Zitat von Maruan Sahyoun <[email protected]>:
Hi,
Hi,
...
i (re-) started working on the new PDFParser. The PDFLexer as a
foundation - together with some tests - is ready so far. Might need
some more improvements moving forward.
I have a "maybe" silly question. What is with the nonSeq parser we
already have? Did he didn't offer all we need to parse a document?
Where are the differences between both?
I'm currently working on the first part of the parser implementation
which is a 'non caching' parser. It generates PD and COS level
objects but only keeps the necessary minimum. e.g. Xref, Trailer ..
That sounds really great. Did you have already some concept how this
will work?
I think something like lazy initialize objects on access, would be a
nice feature.
...
As the parser parses the PDF I think about firing events e.g. to
react on malformed PDFs. I consider this to be a better approach
than overwriting methods or putting workarounds into the core code.
I think I could use the PDF Lexer e.g. to create an FDF parser to
fix the current importFDF() issues and maybe use that as a test
suite for the PDFLexer.
What about setting up a sandbox to share some initial code wo
cluttering the current trunk.
An alternative to a branch can be a github fork. Just fork the pdfbox
and do your changes. You can always merge from the "upstream" (the
original project) and use up to date classes. After all you can create
a pull request to create a patch with the changeset.
BR
Maruan Sahyoun
Best regards
Thomas