I'm using the PDFBOX 0.7.4 together with Aperture.

While crawling, I often get the following warning and exception;

"[Aug 14 15:17:05] WARN  (PdfExtractor.java:119) - IOException while extracting 
full-text of 
file:////De-fs003/projects/Active/EC305479%20-%20EUTELSAT%20W2M%20&%20I3K/SC200129%20-%20EUTELSAT%20W2M%20&%20I3K/24%20-%20Client%20Supplied%20Information/01%20I3K%20Supplied%20Information/Signed_W2M_GSRD_iss02_rev06_27Dec2007.pdf
java.util.zip.ZipException: incorrect data check
 at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
 at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:111)
 at org.pdfbox.cos.COSStream.doDecode(COSStream.java:313)
 at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
 at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
 at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
 at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
 at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:205)
 at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
 at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
 at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
 at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
 at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
 at 
org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extractFullText(PdfExtractor.java:112)
 at 
org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.processDocument(PdfExtractor.java:100)
 at 
org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extract(PdfExtractor.java:62)"

Searching the net for the problem it seems to indicate a corrupted archive 
file. Which is good and fine, i.e. not much to do if the archive file is 
corrupt.
 
However what surprises me is that the file being processed at the time is a 
.pdf file... why is the util.zip library being used?

Thanks,
Gert.




Please help Logica to respect the environment by not printing this email  / 
Pour contribuer comme Logica au respect de l'environnement, merci de ne pas 
imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie 
so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a 
respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.




Please help Logica to respect the environment by not printing this email  / 
Pour contribuer comme Logica au respect de l'environnement, merci de ne pas 
imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie 
so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a 
respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

Reply via email to