Hi,

TIKA-1422 is related and also a blocker. Both issues are caused by the
Tesseract Parser. Once I added the TesseractOCRParser to the
META-INF.services list of Parsers in r1626341, the TesseractParser took
precedence over the previous ImageParser. I've talked about this with Chris
somewhat at length.

I thought discovery order might make a difference. But, there are two
methods of getting the list of available Parsers from DefaultParser --
getDefaultParsers and getParsers. getDefaultParsers orders the list of
Parsers alphabetically, with precedence given to user defined Parsers (not
in the a.o.t package). getParsers orders the Parsers in discovery order --
Parsers at the top have precedence over the bottom. So, moving the
Tesseract Parser above the Image Parsers didn't make a difference.

A simple, temporary fix would be to remove the TesseractOCRParser from the
services list. But, is there a way to add the Parser without it messing
everything else up?

Sorry about this. I shouldn't have committed the services change without
testing first.

Tyler
On Sep 22, 2014 6:59 AM, "Hong-Thai Nguyen" <hngu...@customermatrix.com>
wrote:

> Hi,
>
> I've added a test for this case at r1626706.
> We are having TIKA-1421 which blocks the release.
>
> Hong-Thai
>
> -----Message d'origine-----
> De : Ken Krugler [mailto:kkrugler_li...@transpac.com]
> Envoyé : jeudi 11 septembre 2014 23:07
> À : dev@tika.apache.org
> Objet : RE: NPE on all *.odt, odp, .ods documents
>
>
> > From: Hong-Thai Nguyen
> > Sent: September 11, 2014 1:40:08pm PDT
> > To: dev@tika.apache.org
> > Subject: Re: NPE on all *.odt, odp, .ods documents
> >
> > I was wrong when saying that All OpenDocument are failed, some files
> > passed, but alot of them failed with NPE in OpenDocumentParser line 161.
>
> OK, thanks for clarifying.
>
> So I assume we now have a unit test that would fail without the fix, yes?
>
> Thanks,
>
> -- Ken
>
> >
> > I'm looking to OpenDocumentParser.java on 1.6. The bug comes from
> > block lines 126-130 when input is TikaInputStream (our case):
> > if (container instanceof ZipFile) {
> >                zipFile = (ZipFile) container;
> >            } else if (tis.hasFile()) {
> >                zipFile = new ZipFile(tis.getFile());
> >            }
> >
> > zipFile is sometimes never created.
> >
> >
> > For information, this bug is really fixed in 1.7-SNAPSHOT. Here's the
> > detail of comparison on two versions on same corpus:
> > 1.6:
> > 14-09-09 16:17:43 INFO  (DocumentConversionErrorPlugin.java : 115)
> > [pool-2 -thread-2] Summary of document conversion errors:
> > - pdf (7)
> > - pptx (10)
> > - doc (6)
> > - ppt (14)
> > - xls (9)
> > - dwg (4)
> > - odp (495)
> > - odt (839)
> > - pps (2)
> > - ods (1)
> >
> > 1.7-SNASPHOT:
> > - pdf (7) - pptx (10) - doc (6) - ppt (14) - xls (9) - dwg (4) - odp
> > (2) - pps (2)
> >
> >
> > On Thu, Sep 11, 2014 at 8:55 PM, Ken Krugler
> > <kkrugler_li...@transpac.com>
> > wrote:
> >
> >>
> >>> From: Hong-Thai Nguyen
> >>> Sent: September 11, 2014 5:21:41am PDT
> >>> To: dev@tika.apache.org
> >>> Subject: NPE on all *.odt, odp, .ods documents
> >>>
> >>> Hi all,
> >>>
> >>> I've tested the conversion Tika 1.6 with our corpus, all OpenOffice
> >>> document types are failed with NPE. Fix has been done on
> >>> https://issues.apache.org/jira/browse/TIKA-1412, but available from
> 1.7.
> >>> That's a fatal error for me.
> >>
> >> I'm curious - don't we have unit tests for OpenOffice document types?
> >>
> >> If so, then why are they passing, but all docs tried by Hong-Thai fail?
> >>
> >> -- Ken
> >>
> >>>
> >>> Should we release a 1.6.1 with the fix of TIKA-1412 ?
> >>>
> >>> Tack trace:
> >>> Caused by: com.polyspot.document.converter.ConversionException:
> >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> >>> from
> >>> org.apache.tika.parser.ParserDecorator$1@318e5904
> >>> at
> >>>
> >> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi
> >> on(DocumentConverter.java:233)
> >>> at
> >>>
> >> com.polyspot.document.converter.DocumentConverter.convert(DocumentCon
> >> verter.java:127)
> >>> at
> >>>
> >> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConv
> >> erter.java:83)
> >>> ... 22 more
> >>> Caused by: org.apache.tika.exception.TikaException: Unexpected
> >>> RuntimeException from
> >>> org.apache.tika.parser.ParserDecorator$1@318e5904
> >>> at
> >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24
> >>> 6)
> >>> at
> >>>
> >> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi
> >> on(DocumentConverter.java:225)
> >>> ... 24 more
> >>> Caused by: java.lang.NullPointerException at
> >>>
> >> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParse
> >> r.java:161)
> >>> at
> >>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91
> >>> ) at
> >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24
> >>> 4)
> >>> ... 25 more
> >>>
> >>> --
> >>> --------------
> >>> Hong-Thai
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Reply via email to