Re: [DISCUSS] PDFParser

Thomas Chojecki Sat, 07 Dec 2013 06:04:48 -0800


Zitat von Maruan Sahyoun <[email protected]>:

Hi,

Hi,

...
i (re-) started working on the new PDFParser. The PDFLexer as afoundation - together with some tests - is ready so far. Might needsome more improvements moving forward.

I have a "maybe" silly question. What is with the nonSeq parser wealready have? Did he didn't offer all we need to parse a document?Where are the differences between both?

I'm currently working on the first part of the parser implementationwhich is a 'non caching' parser. It generates PD and COS levelobjects but only keeps the necessary minimum. e.g. Xref, Trailer ..

That sounds really great. Did you have already some concept how thiswill work?

I think something like lazy initialize objects on access, would be anice feature.

...
As the parser parses the PDF I think about firing events e.g. toreact on malformed PDFs. I consider this to be a better approachthan overwriting methods or putting workarounds into the core code.
I think I could use the PDF Lexer e.g. to create an FDF parser tofix the current importFDF() issues and maybe use that as a testsuite for the PDFLexer.
What about setting up a sandbox to share some initial code wocluttering the current trunk.

An alternative to a branch can be a github fork. Just fork the pdfboxand do your changes. You can always merge from the "upstream" (theoriginal project) and use up to date classes. After all you can createa pull request to create a patch with the changeset.

BR
Maruan Sahyoun


Best regards
Thomas

Re: [DISCUSS] PDFParser

Reply via email to