[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024725#comment-13024725 ]
Adam Nichols commented on PDFBOX-1000: -------------------------------------- I ran into some interesting problems while working on this tonight. 1.) I realized that some PDFs will not have spaces delimiting each item in a dictionary. For an example see [1]. I looked it up in the spec and found that there was nothing which required white space, which means that this appears to be a conforming PDF, but it's a nightmare to parse. I hacked together code which is sufficient to parse my test PDF, but I need to find a better way to deal with this. The current "solution" is just coding around this one PDF and isn't actually solving anything. Reading until we hit whitespace will sometimes get us the entire object, but sometimes it gets us multiple objects (one object and parts of the next). Reading until "]" ">>" or "/" would lead to false positives as any of these characters can legitimately be in a string object. I'll have to think more about this one... 2.) I solve the infinite recursion problem by keeping loaded objects in memory and referencing the Map. However, the dictionary is parsed before the xref table it read, so there's no way to read these objects as we go. Just iterating through the root element and reading all of the items in the trailer dictionary (e.g. Root, Info, Size) which are weak references, as this will be required info for doing anything at all with the PDF. Relatedly, I need to find a way to do lazy evaluation on this. Currently when the parser will reads in the root object, it ends up traversing the entire tree. While this isn't a "problem" it shouldn't be necessary until the user requests information from these objects. This will reduce load times, memory usage, CPU usage (for loading) and just generally go a good thing. 3.) This isn't really a problem, but for the record, the new parser doesn't currently have support for streams. The parser will just ignore them for now, which seems like a reasonable solution. Since people are interested, I'll upload my classes, but they're far from being ready to commit right now. Patches and suggestions are certainly welcome, especially for the readObject() method of ConformingPDFParser. [1] 31 0 obj <</Length 45 0 R/Length1 568/Length2 1017/Length3 0>> > Conforming parser > ----------------- > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing > Reporter: Adam Nichols > Assignee: Adam Nichols > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira