[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024998#comment-13024998 ]
Adam Nichols commented on PDFBOX-1000: -------------------------------------- I'll upload XrefEntry tonight. I also noticed that I made some slight changes to some other classes, but when I did a diff, it looked like they were unrelated to this task. If it doesn't work as expected, let me know and I'll double check. 1.) My point was that [1] above is more difficult to parse than this (note the spaces between objects): 31 0 obj << /Length 45 0 R /Length1 568 /Length2 1017 /Length3 0 >> It would be much easier if the objects were separated in some way, like with spaces. However, not all software does this and since white space separation is not required per the spec, we can't depend on this. Another, related, issue I ran into was that when I read in "45" is that a COSInteger, or an indirect reference? We don't know until we read the next "word". The next word is "0", still don't know if it is a int or an indirect reference, but if the next "word" is an "R" then we know it's an indirect reference and we can process it. If the first example the last word was "R/Length1" which requires cleaning up before we can identify it as an "R". It's not something which is unsolvable, but it just makes things more difficult. Currently reading a "word" is defined (by me) as reading until whitespace is encountered. I suppose we could change this to reading until isWhitespace(c) || '/' == c || ']' == c || '>' == c (or something similar). I didn't test that because I was thinking it would cause problems with things like entries like "/Name Some string with name/identifier here" but on second thought those that won't be a problem as it'll just take more calls to readWord() to read in all the data for that object. 2.) Yes, the parser read/parses all in one step. I suppose we could just read it into a string and then parse it after reading/parsing the xref table. Or just read & ignore until we find the beginning, mark it down the offset and then read/parse it after dealing with the xref table. I think we'll also need a flag to tell us if we want to use recursion to dereference objects or not. Normally we would, but not for the trailer nor root. 3.) We should be able to get something which is respectable fairly quickly at which point I'll commit it to the official SVN after going over any and all modifications to existing classes to make sure they won't have any unintended side-effects. In the meantime a unified diff/patch should work okay. Here's my plan: a.) Add a way to enable/disable recursive parsing. Recursion will be on by default, off for parsing the trailer/root, and then turned back on. b.) Change readWord() to stop at '/' ']' and '>' (excluding the first character, which can be any non-whitespace). c.) Clean up the ugly hacks which is properly resolved by updating readWord() d.) See if the above changes put the code into a reasonable starting point. If so, and if won't cause any issues with the normal parser, commit to svn. > Conforming parser > ----------------- > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing > Reporter: Adam Nichols > Assignee: Adam Nichols > Attachments: ConformingPDDocument.java, ConformingPDFParser.java, > ConformingPDFParserTest.java, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira