I'm using the PDFBOX 0.7.4 together with Aperture. While crawling, I often get the following warning and exception;
"[Aug 14 15:17:05] WARN (PdfExtractor.java:119) - IOException while extracting full-text of file:////De-fs003/projects/Active/EC305479%20-%20EUTELSAT%20W2M%20&%20I3K/SC200129%20-%20EUTELSAT%20W2M%20&%20I3K/24%20-%20Client%20Supplied%20Information/01%20I3K%20Supplied%20Information/Signed_W2M_GSRD_iss02_rev06_27Dec2007.pdf java.util.zip.ZipException: incorrect data check at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:111) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:313) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:205) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152) at org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extractFullText(PdfExtractor.java:112) at org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.processDocument(PdfExtractor.java:100) at org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extract(PdfExtractor.java:62)" Searching the net for the problem it seems to indicate a corrupted archive file. Which is good and fine, i.e. not much to do if the archive file is corrupt. However what surprises me is that the file being processed at the time is a .pdf file... why is the util.zip library being used? Thanks, Gert. Please help Logica to respect the environment by not printing this email / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. / Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. Please help Logica to respect the environment by not printing this email / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. / Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.