Hi Maruan,

Am 07.12.2013 13:39, schrieb Maruan Sahyoun:
Hi,

just wanted to give you an update on the new PDF parser.

i (re-) started working on the new PDFParser. The PDFLexer as a foundation - 
together with some tests - is ready so far. Might need some more improvements 
moving forward.
Cool!

I'm currently working on the first part of the parser implementation which is a 'non 
caching' parser. It generates PD and COS level objects but only keeps the necessary 
minimum. e.g. Xref, Trailer .. but doesn't keep pages, resources … in memory. And on top 
of that a "caching" parser which keeps what has being parsed. I don't know if 
that's doable but the idea is that applications like merging or splitting pdfs could 
benefit from a 'non caching' parser. The pure COS level parsing is done (e.g. generating 
a COS Dictionary form tokens) but there are some additional things needed around higher 
level structures e.g. linearized PDFs. Initially the parser reuses most of the existing 
classes where possible. Unfortunately e.g. the COS level classes don't have a common set 
of methods for instantiating these.

Question:  Can we agree on how objects are instantiated. e.g. 
Obj.getInstance(token) or new Obj(token) ...

This only makes sense if the objects themselves like pages or resources can be 
fully cloned so that if objects are cloned or imported they no longer have a 
dependency to the original object. This could benefit PDF merging as one could 
close a no longer needed PDF. This will affect the current PD Model I think.
I don't see any real advantages/disadvantages for both ideas. Maybe the first
could be forced by an interface or an abstract method in the superclass, so that
any subclass has to implement it.

Question:  Can we already clone, what needs to be done to fulfill that? Could 
we do a importPage() so the imported one is completely independent (and stored 
in memory or in a file based cache)?
Hmmm, I'm not sure, but it looks like the answer is no.

As the parser parses the PDF I think about firing events e.g. to react on 
malformed PDFs. I consider this to be a better approach than overwriting 
methods or putting workarounds into the core code.


I think I could use the PDF Lexer e.g. to create an FDF parser to fix the 
current importFDF() issues and maybe use that as a test suite for the PDFLexer.



What about setting up a sandbox to share some initial code wo cluttering the 
current trunk.
That's a good idea as it is quite hard to discuss in theory about such complex
questions. How about creating a separate branch for the new parser?

WDYT?


BR
Maruan Sahyoun

BR
Andreas Lehmkühler

Reply via email to