[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178547#comment-14178547 ] Tilman Hausherr commented on PDFBOX-2250: - ignore the last commit message (wrong issue) Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 1.8.8, 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177084#comment-14177084 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1633185 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1633185 ] PDFBOX-2250: replaced the 200 bytes seek repair mechanism with a brute force search, optimized the xref repair mechanism, lower the minimum start offset Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177103#comment-14177103 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1633186 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1633186 ] PDFBOX-2250: skip empty xref table followed by trailer, leave call that will create empty instead of null curXrefTrailerObj when xref table is empty (merged from trunk) Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177115#comment-14177115 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1633187 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1633187 ] PDFBOX-2250: include key for Invalid object stream xref object reference IOException, reat fileOffset == 0 like fileOffset == null (merged from trunk) Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176290#comment-14176290 ] Andreas Lehmkühler commented on PDFBOX-2250: [~tilman] Thanks for the fast feedback. I already have a fix for the first issue (055794.pdf). I assumed that the relevant data of a valid pdf starts at offset 15, but in your case it already starts at offset 10. I'll have a look at the other one too. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176293#comment-14176293 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1632895 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1632895 ] PDFBOX-2250: optimized the xref repair mechanism, lower the minimum start offset Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176295#comment-14176295 ] Andreas Lehmkühler commented on PDFBOX-2250: I've improved the xref repair mechanism and fixed the issue with the not found object mentioned by Tilman (055794.pdf). [~tilman] I had a look at 113223.pdf, but I can't find anby object which can be found. Can you give me a pointer where to look? *TODO* - the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176298#comment-14176298 ] Tilman Hausherr commented on PDFBOX-2250: - In the version before the last changes, I got this error: SCHWERWIEGEND: Can't find the object xref at offset 25554 Exception in thread main java.io.IOException: Error: Expected a long type at offset 25554, instead got '?Tl*OjlP^d8jp1Y#@J+\)cfaMC\?Y+WgkWs.4@' Now it works without any error or warning and renders perfectly. The xref offset is wrong, and every object from the xref is wrong, e.g. the first one is said to be at 16 but it really at 17. The next one is said to be at 69 but is really at 71. etc etc Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131655#comment-14131655 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1624567 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1624567 ] PDFBOX-2250: leave call that will create empty instead of null curXrefTrailerObj when xref table is empty Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109074#comment-14109074 ] Timo Boehme commented on PDFBOX-2250: - [~tilman] your patch will treat an object from the xref table with offset 0 equally to a non-declared object id - a null object. While on the one hand for a (C) programmer this might be obvious (0==null) this is not part of the PDF specification (at least I couldn't found it). Furthermore with the ongoing xref offset healing project we try to find the correct offset for an object id - this would be prevented with treating it like a null object only because it happens to have offset 0 and not another wrong offset. Thus I'm not sure if this patch is a good idea. On the other hand offset 0 will be wrong in any case (assuming the required header) and quite seldom thus if we have a document which can be cured by this patch it might be ok. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109295#comment-14109295 ] John Hewson commented on PDFBOX-2250: - [~lehmi], the spec says that, but the[ Adobe Supplement to the ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf, which describes Acrobat's behaviour, adds that: {quote} Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file. {quote} Such PDFs are not unheard of, PDFParser#parseHeader() contains code which scans the PDF for the header (although it doesn't limit to 1024 bytes and contains some misleading comments). Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109320#comment-14109320 ] Andreas Lehmkühler commented on PDFBOX-2250: [~jahewson] I guess that advanced feature was added to skip rubbish in front of the file header and/or malformed header but it wasn't meant to be used to allow constructs like the one you've mentioned above. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109343#comment-14109343 ] John Hewson commented on PDFBOX-2250: - Allowing rubbish in front of the file header is _almost_ the same construct, as even in malformed documents the header is usually at the beginning of a line. The end result is the same though: we do allow an object at offset 0. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107959#comment-14107959 ] Andreas Lehmkühler commented on PDFBOX-2250: No, according to the spec a pdf shall start with the pdf version {quote} 7.5.2 File Header The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7. {quote} Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106686#comment-14106686 ] Andreas Lehmkühler commented on PDFBOX-2250: IMHO it's ok to treat a 0 offset as null offset as such a value is invalid and thus doesn't make any sense. In the mentioned pdf the 0-offset belongs to the object 3 0 which is referenced in the catalog but doesn't exists in the pdf Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107365#comment-14107365 ] John Hewson commented on PDFBOX-2250: - Wouldn't the following be a legal start of a PDF file, given that the header need only be in the first 1024 bytes? In this case, object 1 0 would be at offset 0. {code} 1 0 obj /Count 1 /Type /Pages /Kids [5 0 R] endobj %PDF-1.4 2 0 obj /Pages 1 0 R /Type /Catalog endobj etc... {code} But maybe we shouldn't support such silliness. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105105#comment-14105105 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1619296 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1619296 ] PDFBOX-2250: include key for Invalid object stream xref object reference IOException Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Attachments: PDFBOX-2250-107425-empty-xref.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104711#comment-14104711 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1619255 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1619255 ] PDFBOX-2250: skip empty xref table followed by trailer Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094303#comment-14094303 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1617528 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1617528 ] PDFBOX-2250: don't override offset/trailer when parsing the cross reference stream of a hybrid xref table Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094317#comment-14094317 ] Andreas Lehmkühler commented on PDFBOX-2250: All mentioned pdfs are working now. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094515#comment-14094515 ] Tilman Hausherr commented on PDFBOX-2250: - Yes! Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093086#comment-14093086 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1617339 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1617339 ] PDFBOX-2250: parse an optional cross-reference stream to get object numbers for compressed objects as well Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093088#comment-14093088 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1617340 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1617340 ] PDFBOX-2250: parse an optional cross-reference stream to get object numbers for compressed objects as well Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093106#comment-14093106 ] Andreas Lehmkühler commented on PDFBOX-2250: In contrast to the old parser the non-sequential one didn't parse cross-reference streams. I've added that feature so that especially object references for compressed objects could be found now. This should improve the parser once more if it comes to pdfs using object streams. I've used this [sample pdf|http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf] provided by Martin Tappler on dev@pdfboxf Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093152#comment-14093152 ] Thomas Chojecki commented on PDFBOX-2250: - [~lehmi] All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. The old parser already has this feature or similar one as I remember. This was needed as fix for a third party lib that creates documents that have a miss matched offset by 2 or 3 bytes. You can find it in the PDFParser class line 923 (resolveConflicts). https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L923 I don't have read the whole coversation, but you wrote something of 200 bytes self healing range. This can cause problems with pdfs that are broken and include pdf documents as file attachment. The flatdecode algorithm sometimes does not compress each block, so it will leave some plaintext pdf blocks whick can contain parts like endstream or endobj. In this case it can happen that the self healing algorithm runs into such an uncompressed block and fail reading the object. I hope you understand what I mean :-) PS: some offtopic things. I think the signature implementation only work with the old parser. So maybe someone can post this info on the website if the default parser implementation change. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093186#comment-14093186 ] Tilman Hausherr commented on PDFBOX-2250: - How/where can I see the difference between before and after for the sample PDF? Should the rendering be different, or something else? Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093187#comment-14093187 ] Andreas Lehmkühler commented on PDFBOX-2250: [~tilman] I should have be more specific. Use PDFDebugger and have a look at the root node. There isn't any StructTreeRoot when using the non sequential parser without my changes Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093191#comment-14093191 ] Andreas Lehmkühler commented on PDFBOX-2250: [~tchojecki] I'm aware of that feature. But that was more a brute force search. The non sequential parser is following the spec a relies on the xref-table itself. This issues targets some of the known problems such as wrong offsets. The bugfix/improvement for compress objects was just a sideproduct as I stumbeld upon that missing part. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093197#comment-14093197 ] Tilman Hausherr commented on PDFBOX-2250: - These files can no longer be opened with the nonsequential parser: - PDFBOX-1577 (Missing root object specification in trailer) - PDFBOX-1756 (Catalog cannot be found) - PDFBOX-2251 (Missing root object specification in trailer) - PDFBOX-1512 (immo-kurier_arsenal_93x62.pdf) Missing root object specification in trailer Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093248#comment-14093248 ] Andreas Lehmkühler commented on PDFBOX-2250: Thanks for the fast feedback, I'll have a look Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086481#comment-14086481 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1615956 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1615956 ] PDFBOX-2250: removed false warn message, 0 as count value is allowed Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086483#comment-14086483 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1615957 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1615957 ] PDFBOX-2250: removed false warn message, 0 as count value is allowed Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086489#comment-14086489 ] Andreas Lehmkühler commented on PDFBOX-2250: Some of the files (e.g. testComment.pdf) Tim mentioned on the mailing list produced the following warn message {code} Count in xref table is 0 at offset 68229 {code} This was a false warning as 0 is allowd as count value within a xref table. I've removed that message and now Tims pdfs don't produce any of the given warn messages anymore. I'm continuing with those files Tilman mentioned. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082335#comment-14082335 ] Andreas Lehmkühler commented on PDFBOX-2250: I've started with a small set of [sample pdfs|http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/test-documents/] provided by TIKA (thanks to [~talli...@mitre.org] for the pointer) Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082338#comment-14082338 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1615139 from [~lehmi] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1615139 ] PDFBOX-2250: detect XRef streams to avoid false positives when searching for the XRef table/stream Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082343#comment-14082343 ] ASF subversion and git services commented on PDFBOX-2250: - Commit 1615141 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1615141 ] PDFBOX-2250: detect XRef streams to avoid false positives when searching for the XRef table/stream Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082350#comment-14082350 ] Andreas Lehmkühler commented on PDFBOX-2250: The former implementation wasn't able to detect XRef streams which led to false positives (Can't find the object xref at offset, e.g. in the TIKA pdf testPDF_childAttachments.pdf). After adding a detector ithose messages vanished. Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. -- This message was sent by Atlassian JIRA (v6.2#6252)