[ 
https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093152#comment-14093152
 ] 

Thomas Chojecki commented on PDFBOX-2250:
-----------------------------------------

[~lehmi]
"All fixes and improvements are targeting the non-sequential parser and I won't 
port those changes to the old parser."

The old parser already has this feature or similar one as I remember. This was 
needed as fix for a third party lib that creates documents that have a miss 
matched offset by 2 or 3 bytes. You can find it in the PDFParser class line 923 
(resolveConflicts). 
https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L923

I don't have read the whole coversation, but you wrote something of 200 bytes 
self healing range. This can cause problems with pdfs that are broken and 
include pdf documents as file attachment. The flatdecode algorithm sometimes 
does not compress each block, so it will leave some plaintext pdf blocks whick 
can contain parts like "endstream" or "endobj". In this case it can happen that 
the self healing algorithm runs into such an uncompressed block and fail 
reading the object.

I hope you understand what I mean :-) 

PS: some offtopic things. I think the signature implementation only work with 
the old parser. So maybe someone can post this info on the website if the 
default parser implementation change.

> Improve XRef self healing mechanism
> -----------------------------------
>
>                 Key: PDFBOX-2250
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2250
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.8.6, 1.8.7, 2.0.0
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>
> PDFBOX-1769 introduced a "self healing" mechanism to repair corrupt XRef 
> offsets. But that one was just a starter and there remain a lot of issues to 
> be solved. I'm planing to solve at least some of them.
> All fixes and improvements are targeting the non-sequential parser and I 
> won't port those changes to the old parser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to