Hi there, here is a rough summary of some ideas I have for a potential pdfbox 2.0 release. Maybe we could capture these as part of a wiki or jira ticket so we can add and agree on some of these if we want to. As soon as we have agreement we could have individual tickets for them.
WDYT? # rearchitect PDF parsing into lexing, incremental (non caching) parser and caching parser o the lexer would be the low level component delivering tokens to the parser. A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low level handling of tokens. Although I proposed the lexer I'm not totally happy with the current implementation. That's something for another mail/ticket ... o the incremental (non caching) parser would allow for page by page processing moving forward only to support text extraction, merging, splitting … - the benefit would be a lower memory consumption as well as a potential faster processing o the caching parser would support applications such a PDFDebugger or PDFReader # handling of pdf versions the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in pdfbox e.g. my marking certain methods and properties to the pdf version support level. This could in addition be a good basis for PDF/A and other compliance checks. # handle large pdf files in addition to the pdf parsing pdfbox does not always handle large pdf files well as some of the references are implemented as int instead of long # split pdfbox into modules to support use cases such as text extraction and merge with the minimum amount of classes needed. more app like tolls such as the PDFDebugger or PDFReader could be additional modules. With kind regards Maruan Sahyoun