Re: [DISCUSS] PDFParser

Maruan Sahyoun Sat, 07 Dec 2013 08:44:23 -0800

Hi

> 
> Zitat von Maruan Sahyoun <[email protected]>:
> 
>> Hi,
> Hi,
> 
>> ...
>> i (re-) started working on the new PDFParser. The PDFLexer as a foundation - 
>> together with some tests - is ready so far. Might need some more 
>> improvements moving forward.
> I have a "maybe" silly question. What is with the nonSeq parser we already 
> have? Did he didn't offer all we need to parse a document? Where are the 
> differences between both?
>


That's a very valid question. The main difference is that the PDFLexer etc. 
started as part of a discussion for having a PDF spec conforming parser. As 
part of that the idea was to start completely from scratch to
a) maybe come up with a new approach
b) include new ideas
c) revisit all objects and implementations
d) follow the spec as closely as possible

So there is nothing wrong with the nonSeq parser. In fact it is a very good 
improvement over the older parsers.


>> I'm currently working on the first part of the parser implementation which 
>> is a 'non caching' parser. It generates PD and COS level objects but only 
>> keeps the necessary minimum. e.g. Xref, Trailer ..
> That sounds really great. Did you have already some concept how this will 
> work?
> 

Well the lexer already works that way. It reads token by token but forgets 
about the tokens read. Only a minimum of information is kept. And if for 
example you skip a token the information about that particular token is gone. 
And Information for the token is only completely gathered if you get the token.

For the 'con caching' parser e.g. a page is only parsed if you get it. If one 
is not interested in a particular page the objects making up a page are not 
parsed. 

The idea is to some extend similar to XML processing using a DOM or the Event 
or Cursor parser. The approach for the PDF Lexer was taken form XML stream 
processing.

> I think something like lazy initialize objects on access, would be a nice 
> feature.

Yes, that's the idea. Only parse what's requested.

> 
>> ...
>> As the parser parses the PDF I think about firing events e.g. to react on 
>> malformed PDFs. I consider this to be a better approach than overwriting 
>> methods or putting workarounds into the core code.
>> 
>> 
>> I think I could use the PDF Lexer e.g. to create an FDF parser to fix the 
>> current importFDF() issues and maybe use that as a test suite for the 
>> PDFLexer.
>> 
>> What about setting up a sandbox to share some initial code wo cluttering the 
>> current trunk.
> An alternative to a branch can be a github fork. Just fork the pdfbox and do 
> your changes. You can always merge from the "upstream" (the original project) 
> and use up to date classes. After all you can create a pull request to create 
> a patch with the change set.

Good idea - let's see what others are saying.

> 
>> BR
>> Maruan Sahyoun
> 
> Best regards
> Thomas
> 
> 
>

Re: [DISCUSS] PDFParser

Reply via email to