Hi there,

here is a rough summary of some ideas I have for a potential pdfbox 2.0 
release. Maybe we could capture these as part of a wiki or jira ticket so we 
can add and agree on some of these if we want to. As soon as we have agreement 
we could have individual tickets for them.

WDYT?


# rearchitect PDF parsing into lexing, incremental (non caching) parser and 
caching parser
o the lexer would be the low level component delivering tokens to the parser. A 
sample implementation exists as part of PDFBOX-1000. The benefit would be a 
clean low level handling of tokens. Although I proposed the lexer I'm not 
totally happy with the current implementation. That's something for another 
mail/ticket ...
o the incremental (non caching) parser would allow for page by page processing 
moving forward only to support text extraction, merging, splitting … - the 
benefit would be a lower memory consumption as well as a potential faster 
processing
o the caching parser would support applications such a PDFDebugger or PDFReader 

# handling of pdf versions
the current implementation is a mix of PDF 1.4 and some adhoc additions without 
a clear distinction what is and is not supported. We could ad some support for 
explicitly handling versions in pdfbox e.g. my marking certain methods and 
properties to the pdf version support level. This could in addition be a good 
basis for PDF/A and other compliance checks. 

# handle large pdf files
in addition to the pdf parsing pdfbox does not always handle large pdf files 
well as some of the references are implemented as int instead of long

# split pdfbox into modules to support use cases such as text extraction and 
merge with the minimum amount of classes needed. more app like tolls such as 
the PDFDebugger or PDFReader could be additional modules.

With kind regards


Maruan Sahyoun

Reply via email to