On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <[email protected]> wrote:
> Hi there,
>
> here is a rough summary of some ideas I have for a potential pdfbox 2.0 
> release. Maybe we could capture these as part of a wiki or jira ticket so we 
> can add and agree on some of these if we want to. As soon as we have 
> agreement we could have individual tickets for them.
>
> WDYT?
>
>
> # rearchitect PDF parsing into lexing, incremental (non caching) parser and 
> caching parser
> o the lexer would be the low level component delivering tokens to the parser. 
> A sample implementation exists as part of PDFBOX-1000. The benefit would be a 
> clean low level handling of tokens. Although I proposed the lexer I'm not 
> totally happy with the current implementation. That's something for another 
> mail/ticket ...
> o the incremental (non caching) parser would allow for page by page 
> processing moving forward only to support text extraction, merging, splitting 
> … - the benefit would be a lower memory consumption as well as a potential 
> faster processing
> o the caching parser would support applications such a PDFDebugger or 
> PDFReader
>
> # handling of pdf versions
> the current implementation is a mix of PDF 1.4 and some adhoc additions 
> without a clear distinction what is and is not supported. We could ad some 
> support for explicitly handling versions in pdfbox e.g. my marking certain 
> methods and properties to the pdf version support level. This could in 
> addition be a good basis for PDF/A and other compliance checks.
>
> # handle large pdf files
> in addition to the pdf parsing pdfbox does not always handle large pdf files 
> well as some of the references are implemented as int instead of long
>
> # split pdfbox into modules to support use cases such as text extraction and 
> merge with the minimum amount of classes needed. more app like tolls such as 
> the PDFDebugger or PDFReader could be additional modules.
>
> With kind regards
>
>
> Maruan Sahyoun
>

Hi Maruan,

I think some wiki pages should be good. This discussion already
started but as mails in the list or maybe jira tickets lost in the
flow.

There is an apache wiki [1], but I found nothing on PDFBox, a good way
occasion to start.

I do not have many more ideas. According to me, having different
modules for PDF parsers, PDF makers and PDF viewers is an important
one.



[1] http://wiki.apache.org/general/

Guillaume Bailleul

Reply via email to