RE: migrating Tika to 2.0.0

Allison, Timothy B. Tue, 07 Jul 2015 12:40:19 -0700

Thank you, Andreas.  I opened PDFBox-2856.

How about tiffs not being handled by ExtractImages...is this expected?
>    I also noticed that the tiff file is no longer extracted (2.0.0 logger 
> says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?


Thank you, again.

Best,

          Tim
-----Original Message-----
From: Andreas Lehmkuehler [mailto:[email protected]] 
Sent: Tuesday, July 07, 2015 3:08 PM
To: [email protected]
Subject: Re: migrating Tika to 2.0.0

Hi,

Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
> All,
>
>    As part of TIKA-1285, I updated Jeremy Anderson's original patch for our 
> wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit 
> tests because at least one of our files [0] is causing hefty resource 
> utilization, which sends my laptop into paging.  The parse does eventually 
> stop, and content is extracted.
What version of PDFBox are you using, I guess the lastest SNAPSHOT?

>    I also tried this file outside of Tika and used the straight PDFBox-app ( 
> both ExtractImages and ExtractText), and performance is also far, far slower 
> when compared with 1.8.9.
I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than 
1.8.9 when extracting the text from the given pdf.

>    Many apologies if this issue has already been identified.
AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for 
reporting.

>
>    I also noticed that the tiff file is no longer extracted (2.0.0 logger 
> says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>
>           Thank you!
>
>                Best,
>
>                       Tim
> [0] 
> https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

BR
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: migrating Tika to 2.0.0

Reply via email to