Hi, just wanted to give you an update on the new PDF parser.
i (re-) started working on the new PDFParser. The PDFLexer as a foundation - together with some tests - is ready so far. Might need some more improvements moving forward. I'm currently working on the first part of the parser implementation which is a 'non caching' parser. It generates PD and COS level objects but only keeps the necessary minimum. e.g. Xref, Trailer .. but doesn't keep pages, resources … in memory. And on top of that a "caching" parser which keeps what has being parsed. I don't know if that's doable but the idea is that applications like merging or splitting pdfs could benefit from a 'non caching' parser. The pure COS level parsing is done (e.g. generating a COS Dictionary form tokens) but there are some additional things needed around higher level structures e.g. linearized PDFs. Initially the parser reuses most of the existing classes where possible. Unfortunately e.g. the COS level classes don't have a common set of methods for instantiating these. Question: Can we agree on how objects are instantiated. e.g. Obj.getInstance(token) or new Obj(token) ... This only makes sense if the objects themselves like pages or resources can be fully cloned so that if objects are cloned or imported they no longer have a dependency to the original object. This could benefit PDF merging as one could close a no longer needed PDF. This will affect the current PD Model I think. Question: Can we already clone, what needs to be done to fulfill that? Could we do a importPage() so the imported one is completely independent (and stored in memory or in a file based cache)? As the parser parses the PDF I think about firing events e.g. to react on malformed PDFs. I consider this to be a better approach than overwriting methods or putting workarounds into the core code. I think I could use the PDF Lexer e.g. to create an FDF parser to fix the current importFDF() issues and maybe use that as a test suite for the PDFLexer. What about setting up a sandbox to share some initial code wo cluttering the current trunk. WDYT? BR Maruan Sahyoun
