Hi, TIKA-1422 is related and also a blocker. Both issues are caused by the Tesseract Parser. Once I added the TesseractOCRParser to the META-INF.services list of Parsers in r1626341, the TesseractParser took precedence over the previous ImageParser. I've talked about this with Chris somewhat at length.
I thought discovery order might make a difference. But, there are two methods of getting the list of available Parsers from DefaultParser -- getDefaultParsers and getParsers. getDefaultParsers orders the list of Parsers alphabetically, with precedence given to user defined Parsers (not in the a.o.t package). getParsers orders the Parsers in discovery order -- Parsers at the top have precedence over the bottom. So, moving the Tesseract Parser above the Image Parsers didn't make a difference. A simple, temporary fix would be to remove the TesseractOCRParser from the services list. But, is there a way to add the Parser without it messing everything else up? Sorry about this. I shouldn't have committed the services change without testing first. Tyler On Sep 22, 2014 6:59 AM, "Hong-Thai Nguyen" <hngu...@customermatrix.com> wrote: > Hi, > > I've added a test for this case at r1626706. > We are having TIKA-1421 which blocks the release. > > Hong-Thai > > -----Message d'origine----- > De : Ken Krugler [mailto:kkrugler_li...@transpac.com] > Envoyé : jeudi 11 septembre 2014 23:07 > À : dev@tika.apache.org > Objet : RE: NPE on all *.odt, odp, .ods documents > > > > From: Hong-Thai Nguyen > > Sent: September 11, 2014 1:40:08pm PDT > > To: dev@tika.apache.org > > Subject: Re: NPE on all *.odt, odp, .ods documents > > > > I was wrong when saying that All OpenDocument are failed, some files > > passed, but alot of them failed with NPE in OpenDocumentParser line 161. > > OK, thanks for clarifying. > > So I assume we now have a unit test that would fail without the fix, yes? > > Thanks, > > -- Ken > > > > > I'm looking to OpenDocumentParser.java on 1.6. The bug comes from > > block lines 126-130 when input is TikaInputStream (our case): > > if (container instanceof ZipFile) { > > zipFile = (ZipFile) container; > > } else if (tis.hasFile()) { > > zipFile = new ZipFile(tis.getFile()); > > } > > > > zipFile is sometimes never created. > > > > > > For information, this bug is really fixed in 1.7-SNAPSHOT. Here's the > > detail of comparison on two versions on same corpus: > > 1.6: > > 14-09-09 16:17:43 INFO (DocumentConversionErrorPlugin.java : 115) > > [pool-2 -thread-2] Summary of document conversion errors: > > - pdf (7) > > - pptx (10) > > - doc (6) > > - ppt (14) > > - xls (9) > > - dwg (4) > > - odp (495) > > - odt (839) > > - pps (2) > > - ods (1) > > > > 1.7-SNASPHOT: > > - pdf (7) - pptx (10) - doc (6) - ppt (14) - xls (9) - dwg (4) - odp > > (2) - pps (2) > > > > > > On Thu, Sep 11, 2014 at 8:55 PM, Ken Krugler > > <kkrugler_li...@transpac.com> > > wrote: > > > >> > >>> From: Hong-Thai Nguyen > >>> Sent: September 11, 2014 5:21:41am PDT > >>> To: dev@tika.apache.org > >>> Subject: NPE on all *.odt, odp, .ods documents > >>> > >>> Hi all, > >>> > >>> I've tested the conversion Tika 1.6 with our corpus, all OpenOffice > >>> document types are failed with NPE. Fix has been done on > >>> https://issues.apache.org/jira/browse/TIKA-1412, but available from > 1.7. > >>> That's a fatal error for me. > >> > >> I'm curious - don't we have unit tests for OpenOffice document types? > >> > >> If so, then why are they passing, but all docs tried by Hong-Thai fail? > >> > >> -- Ken > >> > >>> > >>> Should we release a 1.6.1 with the fix of TIKA-1412 ? > >>> > >>> Tack trace: > >>> Caused by: com.polyspot.document.converter.ConversionException: > >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException > >>> from > >>> org.apache.tika.parser.ParserDecorator$1@318e5904 > >>> at > >>> > >> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi > >> on(DocumentConverter.java:233) > >>> at > >>> > >> com.polyspot.document.converter.DocumentConverter.convert(DocumentCon > >> verter.java:127) > >>> at > >>> > >> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConv > >> erter.java:83) > >>> ... 22 more > >>> Caused by: org.apache.tika.exception.TikaException: Unexpected > >>> RuntimeException from > >>> org.apache.tika.parser.ParserDecorator$1@318e5904 > >>> at > >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24 > >>> 6) > >>> at > >>> > >> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi > >> on(DocumentConverter.java:225) > >>> ... 24 more > >>> Caused by: java.lang.NullPointerException at > >>> > >> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParse > >> r.java:161) > >>> at > >>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91 > >>> ) at > >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24 > >>> 4) > >>> ... 25 more > >>> > >>> -- > >>> -------------- > >>> Hong-Thai > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > >