On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <[email protected]> wrote: > Hi there, > > here is a rough summary of some ideas I have for a potential pdfbox 2.0 > release. Maybe we could capture these as part of a wiki or jira ticket so we > can add and agree on some of these if we want to. As soon as we have > agreement we could have individual tickets for them. > > WDYT? > > > # rearchitect PDF parsing into lexing, incremental (non caching) parser and > caching parser > o the lexer would be the low level component delivering tokens to the parser. > A sample implementation exists as part of PDFBOX-1000. The benefit would be a > clean low level handling of tokens. Although I proposed the lexer I'm not > totally happy with the current implementation. That's something for another > mail/ticket ... > o the incremental (non caching) parser would allow for page by page > processing moving forward only to support text extraction, merging, splitting > … - the benefit would be a lower memory consumption as well as a potential > faster processing > o the caching parser would support applications such a PDFDebugger or > PDFReader > > # handling of pdf versions > the current implementation is a mix of PDF 1.4 and some adhoc additions > without a clear distinction what is and is not supported. We could ad some > support for explicitly handling versions in pdfbox e.g. my marking certain > methods and properties to the pdf version support level. This could in > addition be a good basis for PDF/A and other compliance checks. > > # handle large pdf files > in addition to the pdf parsing pdfbox does not always handle large pdf files > well as some of the references are implemented as int instead of long > > # split pdfbox into modules to support use cases such as text extraction and > merge with the minimum amount of classes needed. more app like tolls such as > the PDFDebugger or PDFReader could be additional modules. > > With kind regards > > > Maruan Sahyoun >
Hi Maruan, I think some wiki pages should be good. This discussion already started but as mails in the list or maybe jira tickets lost in the flow. There is an apache wiki [1], but I found nothing on PDFBox, a good way occasion to start. I do not have many more ideas. According to me, having different modules for PDF parsers, PDF makers and PDF viewers is an important one. [1] http://wiki.apache.org/general/ Guillaume Bailleul
