[ 
https://issues.apache.org/jira/browse/PDFBOX-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316025#comment-17316025
 ] 

Andreas Lehmkühler commented on PDFBOX-5153:
--------------------------------------------

The index returned by COSParser#findString was shifted by 1 and I'm wondering 
why it didn't show up earlier. The bug was introduced a year ago, see 
PDFBOX-3888

> New flatefilter exception on Tika unit test files with 3.0.0-RC1
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-5153
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5153
>             Project: PDFBox
>          Issue Type: Task
>          Components: Parsing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Trivial
>
> On TIKA-3347, we're integrating PDFBox 3.0.0-RC1.  We're getting new flate 
> filter exceptions on a set of files that I _think_ I created with PDFBox a 
> while ago.
> Looks like we're also getting xref exceptions.
> I would not be surprised in the least to learn that I did something wrong in 
> the creation of these files and that they are corrupt!
> I can replicate this issue with {{java -jar pdfbox-app-3.0.0-RC1.jar 
> export:text}}
> {noformat}
> SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
> Error extracting text for document [IOException]: 
> java.util.zip.DataFormatException: invalid block type
> {noformat}
> One of the files: 
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_no_extract_yes_accessibility_owner_user.pdf
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to