Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
There are two embedded/inline images (not regular attachments) that are
processed by pdfbox app's ExtractImages.
In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks). With trunk,
there is a log warning saying that tiff isn't supported and then an empty tiff
file and a jpeg.
You need to attach jai_imageio.jar to your build.
And also the levigo jbig2 plugin. Like in the 1.8 version.
https://pdfbox.apache.org/1.8/dependencies.html
If it still doesn't work, could you please post the log message?
Tilman
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Tuesday, July 07, 2015 3:48 PM
To: [email protected]
Subject: Re: migrating Tika to 2.0.0
Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
Thank you, Andreas. I opened PDFBox-2856.
How about tiffs not being handled by ExtractImages...is this expected?
I also noticed that the tiff file is no longer extracted (2.0.0 logger
says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
What tiff? When displaying it with Adobe Reader, I see a word file and a
joboptions file.
Tilman
Thank you, again.
Best,
Tim
-----Original Message-----
From: Andreas Lehmkuehler [mailto:[email protected]]
Sent: Tuesday, July 07, 2015 3:08 PM
To: [email protected]
Subject: Re: migrating Tika to 2.0.0
Hi,
Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
All,
As part of TIKA-1285, I updated Jeremy Anderson's original patch for our
wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit
tests because at least one of our files [0] is causing hefty resource
utilization, which sends my laptop into paging. The parse does eventually
stop, and content is extracted.
What version of PDFBox are you using, I guess the lastest SNAPSHOT?
I also tried this file outside of Tika and used the straight PDFBox-app (
both ExtractImages and ExtractText), and performance is also far, far slower
when compared with 1.8.9.
I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
1.8.9 when extracting the text from the given pdf.
Many apologies if this issue has already been identified.
AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
reporting.
I also noticed that the tiff file is no longer extracted (2.0.0 logger
says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
Thank you!
Best,
Tim
[0]
https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]