Zitat von Maruan Sahyoun <[email protected]>:

Hi,
Hi,

...
i (re-) started working on the new PDFParser. The PDFLexer as a foundation - together with some tests - is ready so far. Might need some more improvements moving forward.
I have a "maybe" silly question. What is with the nonSeq parser we already have? Did he didn't offer all we need to parse a document? Where are the differences between both?

I'm currently working on the first part of the parser implementation which is a 'non caching' parser. It generates PD and COS level objects but only keeps the necessary minimum. e.g. Xref, Trailer ..
That sounds really great. Did you have already some concept how this will work?

I think something like lazy initialize objects on access, would be a nice feature.

...
As the parser parses the PDF I think about firing events e.g. to react on malformed PDFs. I consider this to be a better approach than overwriting methods or putting workarounds into the core code.


I think I could use the PDF Lexer e.g. to create an FDF parser to fix the current importFDF() issues and maybe use that as a test suite for the PDFLexer.

What about setting up a sandbox to share some initial code wo cluttering the current trunk.
An alternative to a branch can be a github fork. Just fork the pdfbox and do your changes. You can always merge from the "upstream" (the original project) and use up to date classes. After all you can create a pull request to create a patch with the changeset.

BR
Maruan Sahyoun

Best regards
Thomas



Reply via email to