There are two embedded/inline images (not regular attachments) that are 
processed by pdfbox app's ExtractImages.

In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, 
there is a log warning saying that tiff isn't supported and then an empty tiff 
file and a jpeg.

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Tuesday, July 07, 2015 3:48 PM
To: [email protected]
Subject: Re: migrating Tika to 2.0.0

Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
> Thank you, Andreas.  I opened PDFBox-2856.
>
> How about tiffs not being handled by ExtractImages...is this expected?
>>     I also noticed that the tiff file is no longer extracted (2.0.0 logger 
>> says tiff not handled, but a tiff is extracted with 1.8.9).  Is this 
>> expected?

What tiff? When displaying it with Adobe Reader, I see a word file and a 
joboptions file.

Tilman

> Thank you, again.
>
> Best,
>
>            Tim
> -----Original Message-----
> From: Andreas Lehmkuehler [mailto:[email protected]]
> Sent: Tuesday, July 07, 2015 3:08 PM
> To: [email protected]
> Subject: Re: migrating Tika to 2.0.0
>
> Hi,
>
> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>> All,
>>
>>     As part of TIKA-1285, I updated Jeremy Anderson's original patch for our 
>> wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit 
>> tests because at least one of our files [0] is causing hefty resource 
>> utilization, which sends my laptop into paging.  The parse does eventually 
>> stop, and content is extracted.
> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>
>>     I also tried this file outside of Tika and used the straight PDFBox-app 
>> ( both ExtractImages and ExtractText), and performance is also far, far 
>> slower when compared with 1.8.9.
> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
> 1.8.9 when extracting the text from the given pdf.
>
>>     Many apologies if this issue has already been identified.
> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
> reporting.
>
>>     I also noticed that the tiff file is no longer extracted (2.0.0 logger 
>> says tiff not handled, but a tiff is extracted with 1.8.9).  Is this 
>> expected?
>>
>>            Thank you!
>>
>>                 Best,
>>
>>                        Tim
>> [0] 
>> https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
> BR
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to