[ 
https://issues.apache.org/jira/browse/PDFBOX-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406165#comment-15406165
 ] 

Tilman Hausherr commented on PDFBOX-3452:
-----------------------------------------

We already do a lot to make PDFBox more lenient... see your two fixed issues 
from yesterday. But some are just too difficult (i.e. would require some major 
work). Maybe Andreas can do something some day, but currently he's busy.

Btw even seemingly correct PDF files will not always bring a useable text 
extraction.

> IOException at org.apache.pdfbox.pdfparser.BaseParser.readStringNumber
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-3452
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3452
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Yauheni Salopiy
>              Labels: WK
>         Attachments: 95s-0316-rpt0242-21-appendix-16-f-vol177.pdf, 
> PDFBOX-3452_LOG.txt
>
>
> Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) throws following exception on text 
> extraction from valid PDF document:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.pdf.PDFParser@6c25e6c4
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>       at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44)
>       at 
> com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134)
> Caused by: java.io.IOException: Number '???·???????Wk®)i?v' is getting too 
> long, stop reading at offset 266260
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.readStringNumber(BaseParser.java:1379)
>       at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1341)
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1278)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:739)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:721)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:652)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:612)
>       at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:215)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:840)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:780)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:130)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 6 more
> Please, find failing document and log with StackTrace in attachments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to