[jira] [Commented] (PDFBOX-1000) Conforming parser

Adam Nichols (JIRA) Mon, 25 Apr 2011 15:58:45 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024998#comment-13024998
 ]


Adam Nichols commented on PDFBOX-1000:
--------------------------------------

I'll upload XrefEntry tonight.  I also noticed that I made some slight changes 
to some other classes, but when I did a diff, it looked like they were 
unrelated to this task.  If it doesn't work as expected, let me know and I'll 
double check.

1.) My point was that [1] above is more difficult to parse than this (note the 
spaces between objects):
31 0 obj
<< /Length 45 0 R /Length1 568 /Length2 1017 /Length3 0 >> 

It would be much easier if the objects were separated in some way, like with 
spaces.  However, not all software does this and since white space separation 
is not required per the spec, we can't depend on this.

Another, related, issue I ran into was that when I read in "45" is that a 
COSInteger, or an indirect reference?  We don't know until we read the next 
"word".  The next word is "0", still don't know if it is a int or an indirect 
reference, but if the next "word" is an "R" then we know it's an indirect 
reference and we can process it.  If the first example the last word was 
"R/Length1" which requires cleaning up before we can identify it as an "R".  
It's not something which is unsolvable, but it just makes things more difficult.

Currently reading a "word" is defined (by me) as reading until whitespace is 
encountered.  I suppose we could change this to reading until isWhitespace(c) 
|| '/' == c || ']' == c || '>' == c   (or something similar).  I didn't test 
that because I was thinking it would cause problems with things like entries 
like "/Name Some string with name/identifier here" but on second thought those 
that won't be a problem as it'll just take more calls to readWord() to read in 
all the data for that object.


2.) Yes, the parser read/parses all in one step.  I suppose we could just read 
it into a string and then parse it after reading/parsing the xref table.  Or 
just read & ignore until we find the beginning, mark it down the offset and 
then read/parse it after dealing with the xref table.  I think we'll also need 
a flag to tell us if we want to use recursion to dereference objects or not.  
Normally we would, but not for the trailer nor root.

3.) We should be able to get something which is respectable fairly quickly at 
which point I'll commit it to the official SVN after going over any and all 
modifications to existing classes to make sure they won't have any unintended 
side-effects.  In the meantime a unified diff/patch should work okay.

Here's my plan:
a.) Add a way to enable/disable recursive parsing.  Recursion will be on by 
default, off for parsing the trailer/root, and then turned back on.
b.) Change readWord() to stop at '/' ']' and '>' (excluding the first 
character, which can be any non-whitespace).
c.) Clean up the ugly hacks which is properly resolved by updating readWord()
d.) See if the above changes put the code into a reasonable starting point.  If 
so, and if won't cause any issues with the normal parser, commit to svn.



> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: ConformingPDDocument.java, ConformingPDFParser.java, 
> ConformingPDFParserTest.java, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to