Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS thread for a few weeks about getting 1.6 out. Do you have a patch right now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 to get it in. If you don't have a patch yet, would you mind terribly if we pushed out 1.6, which already today has a ton of great updates, then shortly thereafter rolled a 1.7 (or did so when you finished with TIKA-1367)?
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Sergey Beryozkin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, July 28, 2014 11:38 AM To: "[email protected]" <[email protected]> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >+0 given that it appears that the tika-parsers dependencies >documentation issue has been pushed away. I'm getting confused why. > >Thanks. Sergey > >[1] https://issues.apache.org/jira/browse/TIKA-1367 > >On 28/07/14 17:16, Tyler Palsulich wrote: >> +1 >> >> OSX 10.9.3, Java 1.7 >> >> Tyler >> >> >> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >><[email protected]> >> wrote: >> >>> +1 >>> >>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>> Windows 7, Java 1.7 >>> >>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>docs >>> (all formats) plus all available msoffice-x files in govdocs1, yielding >>> 10,413 docs. There were several improvements in text extraction for >>>PDFs >>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). >>> >>> There was one regression: >>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx >>> >>> Stacktrace: >>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out >>>of >>> range: -369073454 >>> at java.lang.String.checkBounds(String.java:371) >>> at java.lang.String.<init>(String.java:415) >>> at >>> >>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java: >>>114) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( >>>Ole10Native.java:91) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( >>>Ole10Native.java:63) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe >>>ddedOLE(AbstractOOXMLExtractor.java:250) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe >>>ddedParts(AbstractOOXMLExtractor.java:199) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A >>>bstractOOXMLExtractor.java:115) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML >>>ExtractorFactory.java:112) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav >>>a:82) >>> at >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) >>> >>> >>> -----Original Message----- >>> From: Mattmann, Chris A (3980) [mailto:[email protected]] >>> Sent: Monday, July 28, 2014 12:22 AM >>> To: [email protected] >>> Cc: [email protected] >>> Subject: [VOTE] Apache Tika 1.6 release candidate #1 >>> >>> Hi Folks, >>> >>> A candidate for the Tika 1.6 release is available at: >>> >>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ >>> >>> >>> The release candidate is a zip archive of the sources in: >>> >>> http://svn.apache.org/repos/asf/tika/tags/1.6/ >>> >>> The SHA1 checksum of the archive is >>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3. >>> >>> A Maven staging repository is available at: >>> >>> https://repository.apache.org/content/repositories/orgapachetika-1003/ >>> >>> >>> Please vote on releasing this package as Apache Tika 1.6. >>> The vote is open for the next 72 hours and passes if a majority of at >>> least three +1 Tika PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Tika 1.6 >>> [ ] -1 Do not release this package becauseŠ >>> >>> Thank you! >>> >>> Cheers, >>> Chris >>> >>> P.S. Here is my +1! >>> >>> >>> >>> >>> >>> >>
