There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks). With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg. -----Original Message----- From: Tilman Hausherr [mailto:[email protected]] Sent: Tuesday, July 07, 2015 3:48 PM To: [email protected] Subject: Re: migrating Tika to 2.0.0 Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.: > Thank you, Andreas. I opened PDFBox-2856. > > How about tiffs not being handled by ExtractImages...is this expected? >> I also noticed that the tiff file is no longer extracted (2.0.0 logger >> says tiff not handled, but a tiff is extracted with 1.8.9). Is this >> expected? What tiff? When displaying it with Adobe Reader, I see a word file and a joboptions file. Tilman > Thank you, again. > > Best, > > Tim > -----Original Message----- > From: Andreas Lehmkuehler [mailto:[email protected]] > Sent: Tuesday, July 07, 2015 3:08 PM > To: [email protected] > Subject: Re: migrating Tika to 2.0.0 > > Hi, > > Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.: >> All, >> >> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our >> wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit >> tests because at least one of our files [0] is causing hefty resource >> utilization, which sends my laptop into paging. The parse does eventually >> stop, and content is extracted. > What version of PDFBox are you using, I guess the lastest SNAPSHOT? > >> I also tried this file outside of Tika and used the straight PDFBox-app >> ( both ExtractImages and ExtractText), and performance is also far, far >> slower when compared with 1.8.9. > I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than > 1.8.9 when extracting the text from the given pdf. > >> Many apologies if this issue has already been identified. > AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for > reporting. > >> I also noticed that the tiff file is no longer extracted (2.0.0 logger >> says tiff not handled, but a tiff is extracted with 1.8.9). Is this >> expected? >> >> Thank you! >> >> Best, >> >> Tim >> [0] >> https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > BR > Andreas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
