This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch 2.x in repository https://gitbox.apache.org/repos/asf/tika.git
was 0a55b4a4e TIKA-2354 -- .doc is missing many pictures This change permanently discards the following revisions: discard 0a55b4a4e TIKA-2354 -- .doc is missing many pictures discard 21bcc5595 TIKA-2343 -- change put to post for multipart discard fe3971a69 TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. Split to different change list...argh. discard 9ef078778 TIKA 2343 -- add text-main/boilerpipe option to tika-server discard babb2534e TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. discard 62e5a8477 TIKA-2350 discard 6930ff025 TIKA-2311 -- try OPC before ZipFile. This can work better on some truncated files. discard 4e1e87ff2 TIKA-2348 -- include caught exception in EMF/WMF rethrows discard 7c4258917 Merge remote-tracking branch 'origin/2.x' into 2.x discard c67e62236 TIKA-2349 -- try to match embedded docs by digest in tika-eval "Compare" discard e7ad4ec15 TIKA-2309 fixed tika-parser-crypto-bundle IT discard 3743e4d67 TIKA-2309 Time Stamped Data Envelope parser discard 51190df6e TIKA-2339 - remove test file that was identified by one av program as potentially contain MDropper. We assess this as a false positive, but we've chosen to remove the file to allow users with this av program to build Tika. discard 73147a239 update javadoc for Latin1StringsParser discard a847a863d TIKA-1195 and TIKA-2329, upgrade to POI 3.16-final and add xlsb parser discard 870ec187e In rare cases, elapsed can == 3000 exactly. Fix this. discard 143efc8d9 TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml discard d2907f41a TIKA-2325 discard 110247fcf turn off debug statement discard 6b9e36e3f TIKA-2323 discard fce6626f2 TIKA-2319 follow up discard 37b8864ed TIKA-2319 discard 96a8ddd84 TIKA-2318 discard c4888d59e Merge remote-tracking branch 'origin/2.x' into 2.x discard 67a5e91b2 TIKA-2317 warn user if max content length is hit; allow for easier parameterization by commandline discard d8e4b5f6e Added explicit test scope for junit discard 363675554 Bumped junit and slf4j versions discard 747b121fd Update mailing list archive links discard 3e925166a Merge remote-tracking branch 'origin/2.x' into 2.x discard 1826112e6 TIKA-2302 -- make macro extraction configurable and set default to false discard f87948d28 Merge changelog update discard 78c31eb61 TIKA-1772 More WebVTT unit tests discard d12c87b6d Merge 3c02c4b to the new 2.x test documents area discard e34498bbe TIKA-1772 More test WebVTT files - no text header, and a custom one discard 2df5c536b TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers discard e3fead445 TIKA-2307 -- include finer grained supported types so that users can control includes/excludes with decorator via config discard fcccda6cc TIKA-2307 discard 29d7d7ceb TIKA-2300 record streams that can't be read via pkg's metadata via Aeham Abushwashi discard 77f25f2e7 clean up unit tests discard 4ed7fccc3 TIKA 2287 -- bug fixes discard 51cc80d24 TIKA-2236 upgrade PDFBox to 2.0.5 and JempBox to 1.8.13 discard 7344209a1 clean up from sax docx work discard 93cb9717e TIKA-2295 -- extract images from odt discard da2dce946 TIKA-2242 -- fix handling of annotations and <p> within a <p> in odt. discard 15e22679f TIKA-1879 -- add more granularity to recipients in Outlook/PST emails discard 380af5b32 TIKA-2290 -- fix bug that prevents passing of ocr strategy via headers in tika-server discard 5719bf788 TIKA-2287 -- bug fix, improve handling when ref tables already exist discard 875c3a151 TIKA-2287 -- add jdbc discard 70895fcd9 TIKA-1865 clean up, deduplicate MailUtil, bug fix discard a12cae48f TIKA-1865 bug fix discard 2ebc90a5c TIKA-2281 applied to PSTParser discard 0274a2816 TIKA-1865 step 2 the other parsers 1 discard f70ea7a8f TIKA-1865 -- step 1 split out sender name from sender email exchange info where possible in msg files discard 24160a1c0 TIKA-2281 add mapi message type discard 81f1591fe TIKA-2285 -- triggering file didn't actually trigger string index out of bounds exception, but there could be one with a null or very short styleName discard 4843ca157 TIKA-2286 discard d0ebfda73 fix tika-eval bug - include child file extension instead of parent discard 82509f32c TIKA-1857 xfa fix discard 5925bcb58 TIKA-2279 - simplify token counting discard 6dcad8896 TIKA-2273 -- improve configuration of encoding detectors. TODO: figure out loading in tika-app bundle and turn tests back on. discard b2a462c6d TIKA 2276 -- cleanup discard a279d039d TIKA-2278 clean up extract exception handling discard 35756b142 TIKA-2276 try to reuse parsers from ParseContext rather than creating own discard 4ebc441bd TIKA-2276 -- pass through TikaConfig if not specified via ParseContext in AutoDetectParser discard 0ce764915 TIKA-2275 discard 824d176c9 TIKA-2269 -- fix potential NPE in FeedParser via Julien Nioche. discard 544ba9752 TIKA-2267 -- add common tokens for some languages into tika-eval discard 81150859b TIKA-1332 -- add English Spanish common tokens; fix logging discard 61532258f TIKA-1332 3rd time's the charm. Fix dependencies with IOUtils. discard 44612ae40 TIKA-1332 fix pom for 2.0 discard 0d04b499a TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7 discard 69dd0328b TIKA-1332 fix one profiler report and whitespace discard 5e49c3308 TIKA-1332 initial commit of tika-eval. More work remains. discard 6bfe5d565 TIKA-2246 and TIKA-2247 -add parsers for EMF and WMF discard d9f376c12 TIKA-2134 - remove npe catch after upgrade to POI 3.16.beta2 discard 0d7f5bad0 TIKA-2198 - add null check to Tika after upgrade to POI 3.16-beta2 discard 27e81b97a TIKA-2181 upgrade to POI 3 16 beta2, make sure to upgrade overall bundle discard cf3996ed0 TIKA 2181 upgrade to POI 3 16 beta2 discard 7b0655cc1 TIKA 2259 -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX 3644 discard 2d4889f44 TIKA 2025 -- fix xls/x testBigIntegersWGeneralFormat to work in multiple locales discard 28010d90d Mimetype for SAS Xport (XPT) files discard 534a52598 TIKA-2255 Mime detection unit tests for SAS files discard a79de0ccf TIKA-2255 Magic for older sas data files discard 4d8feaee5 Move to Tika 2.x location discard 6287b75b5 TIKA-2255 Test SAS files discard 3df8ce8b2 TIKA-2251 improve exception handling in SAX pptx/docx parsers discard 235c2adab TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser discard 4599374d6 Merge remote-tracking branch 'origin/2.x' into 2.x discard 985c1aef8 TIKA 2244 -- be more parsimonious with BufferedInputStream. AutoDetectReader discard bd667acde TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix discard 6668d78fa TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix discard 58d56c33f TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix discard 78828176a TIKA-2244 -- be more parsimonious with BufferedInputStream via Josh Hight. discard 8d783d27a TIKA-2232 -- log/warn if jbig2 is not on classpath discard 161b122ba TIKA-2240 -- improve mime detection for .wri files discard 9dbff6065 TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in TesseractOCRParser via Graham Russell. discard 4374bcecf TIKA-2242 fix style markup in ODT discard 45a9b77d6 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers discard dd70fd33a TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path discard cd98c4cf3 TIKA-2238 add mime detection for embedded MSEquation files discard ce4e7e7d9 TIKA 2134 -- handle missing parts more robustly discard 28b53bd4d TIKA 2159 handle preparse/embedded IO exceptions uniformly discard 681615731 TIKA-2210 -- add experimental SAX parser for pptx and update (also TIKA-2191 and TIKA-2220) discard 2d908d59b TIKA-2237 discard e02084cc6 TIKA-2192 discard 0bc9bd896 TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) via Pascal Essiembre discard c14e75070 TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 dpi via Matthew Caruana Galizia discard 850de1467 TIKA-2234 get rid of ThreadLocal discard aaa661e25 TIKA-2228 from Pascal Essiembre and TIKA-2230. discard f0863bcea Merge remote-tracking branch 'origin/2.x' into 2.x discard f1a541378 TIKA-2190 -- Add test file for maintain spacing discard 4e3534da0 Move new test file to the 2.x location discard 785e47413 Manually merge changelog discard cdb6456bb TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test discard 71584b2de TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack discard db21ee158 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files discard bb76d986a TIKA-2224 Mime magic for OneNote discard d8fa3c2a8 TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as quattropro 7-8 vs quattropro 9 discard 39cf35551 TIKA_2226 add exception for unsupported formats discard 4383e3da7 TIKA-1946 -- initial commit to add parsers for WordPerfect and QuattroPro. Many thanks to Pascal Essiembre for contributing these!!! discard 337d38304 TIKA-2211 -- make sure that head (<style>) content isn't showing up in body in the EpubParser discard c9fcb3315 TIKA-2211 modify test file to include style information to test that we're excluding it. discard 50c1dc69d update OCR config to include default for output type discard 0d30aa1b2 TIKA 2190 -- add configurability for preserve interword spacing discard 54154e004 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre -- fix test method to get inputstream from zip discard 68f305864 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre discard ee761ac00 TIKA-2221 -- correctly catch and rethrow encrypted document exception as EncryptedDocumentException in WordExtractor via Matthew Caruana Galizia discard ffb25af1b Merge remote-tracking branch 'origin/2.x' into 2.x discard 4f04b6c3e TIKA-2218 -- add a new new locations within a pptx to check for embedded objects discard d8853fe31 Update to PDFBox 2.0.4 discard 300100fcb TIKA-2090: Allow extraction of PDActions (including Javascript) from PDFs (TIKA-2090). discard 3d08da79f TIKA-2187 -- make "ignore deleted" as the default in the experimental SAX .docx parser and update the WordExtractor to include extraction of deleted text if requested by the user. discard 32162f59e TIKA 1321 initial commit discard de103c81f TIKA-2096 -- fix example, sorry... discard 1bb7c3384 TIKA-2179 -- add detection and parsing for word2006ml files -- this modification somehow fell to a different change list discard e5e4d4d91 TIKA-2096 change default to extract embedded documents even if the user forgets to specify an AutoDetectParser in the ParseContext discard a47a69933 TIKA-2169 fix xhtml in ocr discard 2f452304b Add mime detection and parser for Word 2006ML format (TIKA-2179). discard 8c01e4d8e TIKA-2116 upgrade to POI 3.16-beta1 discard 7df6fe4be TIKA-2170 fix unit test to allow for different exceptions depending on cause of timeout. discard 7adfe1cb5 TIKA-2170 allow configuration of timeout for ForkServer discard 9a68f4ccc TIKA-2174 -- clean up discard 3f24e6c3e TIKA-2174 -- add ppm and update changes.txt discard ab009aeb7 TIKA-2159 -- first step discard f2661f997 TIKA-2174 add jpx and jp2 to Tesseract discard 7422218eb TIKA-2173 - first steps. Need to integrate parameter configuration into 2.x before I can do the rest discard bcd59cee7 TIKA-2171 - upgrade sqlite parser discard 2c9412ab1 TIKA-2171 - upgrade sqlite parser discard 1d1bc0dd7 TIKA-1933 - clean up one more place where we aren't closing the ForkParser and are leaving behind a tmp ForkParser jar discard 2d5189186 TIKA-2157 - handle zip exception in embedded file discard 6ca74bec6 improve unit test for TIKA-2098 discard a6978521f TIKA-2111 - ExecutableParser should set rather than add a Content-Type discard 4b393a6f9 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too). discard 936e3ac16 TIKA-2130 discard 4c3bb1560 TIKA-2133 discard c5f4f5263 TIKA-2127 : npe if there is no notes master) discard 7e66e4979 TIKA-2123: digester fails with multiple digests on large files discard 30e03de89 TIKA-2122: Extract all headers from MSG/RFC822 discard 1e55953bc TIKA-2113-- upgrade metadata-extractor to 2.9.1 discard 3fe8ef819 Merge remote-tracking branch 'origin/2.x' into 2.x discard af74ea5c9 TIKA-2110-- log full exception throughout tika-batch discard 1ec8c0947 Tesseract may see the t in haystack as a ! some times... discard 1ab6c81ce TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. This closes #136 discard b84fcc584 TIKA-2101 -- don't call MAPIMessage's close() discard cde4c0aa8 TIKA-2098 small clean up. Test for writelimitreached for each catchable IOException. Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134 discard 4392681af TIKA-2097 fix npe in mbox parser discard be78c549a Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057). discard 94789a963 Tika-2095 include Tika version in tika-server's GREETING discard bd7208929 * Re-enable fileUrl for tika-server (TIKA-2081). Fix commandline options not to include '-' discard ce1fc3720 * Re-enable fileUrl for tika-server (TIKA-2081). If you choose, to use this feature, beware of the security vulnerabilities! See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271 discard 673533d0e TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric Pugh. This commit also catches 2.x up to trunk; there were clearly some other changes to Tesseract that hadn't yet made it into 2.x. discard d543378a8 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order discard 66f433471 TIKA-2069 -- extract macros from MSOffice files. discard 32d9ece8d * Maintain passed-in mime in TXTParser (TIKA-2047). discard 12b1d435b TIKA-2013 -- upgrade to POI 3.15 -- don't forget to close new NPOIFS and MAPIMessage discard 1b32e3186 TIKA-2015 -- upgrade to PDFBox 2.0.3 discard 92453f5e7 GitHub user haisi opened a pull request: https://github.com/apache/tika/pull/132 discard 176f3aded TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml, include test file discard ae0cb3059 TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml discard 9f6241161 Merge changes for TIKA-2064 to 2.x discard e58ade381 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test discard 443a21e3f TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing) discard 4636f95b2 TIKA-1255 and TIKA-2078 -- fix hyperlinks that include formatting and fix hyperlinks with multiple runs in docx discard f112c88fb TIKA-2075 - Expose Additional TikaService methods discard f8092d3bd TIKA-2073 - Tika Language Detect Project should include Bundle Activator and packaging consistent with other modules discard 7a0280c77 TIKA-2071 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers discard 587dcb772 TIKA-2072 - Create TikaServiceFactory for creating TikaService discard d57a85274 TIKA-2070 - Add Encoding Detector and Language Detectors to Dynamic Service Loader discard b73cd8ce8 TIKA-2074 - ServiceLoader can use Class files loaded via dynamic loading discard a0f365524 TIKA-2067 upgrade maven plugin dependencies -- revert felix bundle discard 8ff89d419 TIKA-2067 upgrade maven plugin dependencies discard 164bf52c8 TIKA-2066 upgrade commons-io to 2.5 discard b2a7e382a TIKA-2065 upgrade forbiddenapis discard 8234b96fe TIKA-2061 - Added Adobe BSD license to tika-xmp discard 5d9db6bec TIKA-2063 - Added Vorbis bundle to bundle parent. discard fcefaae59 TIKA-2063 - Create vorbis bundle discard dc841e6ba TIKA-2060 - Added toggle to ClassLoaderUtils for OSGi discard cebf72382 TIKA-2062 - Remove bouncy castle inlining in bundles discard 4704d976c TIKA-2061 - Embed xmpcore in tika-xmp since it is not a proper bundle discard 59e0ca0fc TIKA-2059 - Merge multimedia and pdf parser modules and bundles discard 87b6d5d7d TIKA-2007 upgrade jackson, needed to update CachedTranslator (diff btwn trunk and 2.x) discard db513d6ad TIKA-2007 upgrade jackson discard 27bc383eb TIKA-1980 via Joseph Naegele discard 09bd22fb4 TIKA-1938 via Joseph Naegele discard 5358bf1e1 TIKA-1938 via Joseph Naegele discard b41c0b2a8 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods. discard 6ebbd7ef7 cleanup MatParser discard fc7c372f5 TIKA-2048 discard 1c582aba6 TIKA-2040 - prevent permanent hang/oom on corrupt chm file discard 9f6c71fa6 TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug. discard f89887d2f TIKA-2037 Merge fixes for 2.x discard 53310facc Changelog update discard 65cc9bcec TIKA-2042 MBOX magic and detection unit test discard 31374a39b TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction discard d6ce10b41 Email with attachment for testing extraction issues discard 8b951a43c TIKA-2039 upgrade to jackcess 2.1.4 discard f4bacf859 TIKA-2025 increase number of significant digits extracted in "general" format in xls/xlsx discard e27526b84 TIKA-2030 - fix test file so that it is correctly detected discard cdfacdb41 Merge remote-tracking branch 'origin/2.x' into 2.x discard 87e1e23b4 TIKA-2030 - add handling for <text:s/> element to ODT parser. Thanks to David Pilato for opening this issue. discard 573527bbc Merge branch '2.x' of https://git-wip-us.apache.org/repos/asf/tika into 2.x discard 2a7e52ec4 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...not sure why Intellij didn't catch this one. sorry. discard 2eb4804d1 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here... discard 4678d6733 TIKA-2024 extract original path name from OLE1.0 embedded objects discard c7a6bcac4 Convert new lines from windows to unix discard dd3c2a486 TIKA-2026 -- improve extraction of attachments for PPT, PPTX, XLSX discard 933af20e8 rm inconsistently capitalized test files discard e62f23057 TIKA-2024 extract original file name/path where possible, take 1 discard c84855f67 TIKA-2022 - clean up -- make entries private, move more into EndianUtils discard 865c45cd5 fix indentation discard 5bc597dc8 TIKA-2023 -- clean up RTFParser to use EndianUtils and IOUtils.readFully discard b14b47e76 TIKA-2022 -- add parser for applefile discard cd12917fa TIKA-2020 -- remove 3 parameter parse() and simplify CAD tests discard 0c71b2ffc TIKA-2020, remove 3 parameter parse() and simplify CAD tests discard 6bb6827e0 add startDocument and endDocument() to PRTParser so that it works with the ToXMLHandler discard 767442614 fix indents and whitespace discard 1ce93ed9e TIKA-2019 -- fix WordMLParser and SpreadsheetMLParser discard 2f5537380 TIKA-2009 -- add detection for Endnote Import files discard b600b6701 make sure to test magic for vcs/ics/asx discard 73ce7681c TIKA-2009 -- add magic for djvu discard b3bf5141b TIKA-2008 -- change metadata key to TikaCoreProperties.MODIFIER discard 60d4e3ff2 TIKA-2008 -- add mime definition and parser for MSOwnerFile discard ffaa4deaa TIKA-2004 -- add mime definitions for Windows Media Metafile discard f90193aa0 TIKA-2006 -- add mime definitions for ical and vcal discard b480d43f5 TIKA-1996 -- Upgrade to PDFBox 2.0.2 discard ac52e5c15 TIKA-1999: fix setter, update changes.txt discard 89062edb0 TIKA-1999: add configurable limit to number of events extracted in XMPMM history. discard ebe702898 TIKA-1994 -- Integrate TesseractOCR with full page image rendering for PDFs discard e5a7604bc TIKA-1992 -- check for duplicate inline images by COSStream not object name. discard e05dd5bf4 TIKA-1990 -- need to add JPEG filters to embedded stream when handling embedded jpegs in PDFParser discard b1c00c050 TIKA-1985 -- ignore test until we get permission to use test file