[ 
https://issues.apache.org/jira/browse/TIKA-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962710#comment-15962710
 ] 

Tim Allison commented on TIKA-2320:
-----------------------------------

Thank you for opening this.  I suspect the cause of this is not in the Tika 
layer, but rather within PDFBox.  Please open an issue on PDFBox's jira and 
share your triggering document with them (if you can).

> java.util.zip.DataFormatException when parsing a PDF
> ----------------------------------------------------
>
>                 Key: TIKA-2320
>                 URL: https://issues.apache.org/jira/browse/TIKA-2320
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
>            Reporter: Moritz Becker
>
> I use the following code to parse a PDF:
> {code:java}
> PDFParser pdfparser = new PDFParser();
>         pdfparser.parse(Test.class.getResourceAsStream("/testdoc.pdf"), 
> handler, metadata, pcontext);
> {code}
> This results in the following exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Unable to 
> extract PDF content
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:133)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>       at com.curecomp.tika.Test.main(Test.java:28)
> Caused by: java.io.IOException: java.util.zip.DataFormatException: too many 
> length or distance symbols
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
>       at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:189)
>       at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:134)
>       at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:84)
>       at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:164)
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>       at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
>       at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>       at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>       at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>       ... 2 more
> Caused by: java.util.zip.DataFormatException: too many length or distance 
> symbols
>       at java.util.zip.Inflater.inflateBytes(Native Method)
>       at java.util.zip.Inflater.inflate(Inflater.java:259)
>       at java.util.zip.Inflater.inflate(Inflater.java:280)
>       at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
>       ... 21 more
> {noformat}
> The PDF can be read using Adobe Reader XI 11.0.12.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to