Re: [PDFBox 2.0] Ideas

Andreas Lehmkuehler Fri, 29 Mar 2013 04:28:16 -0700

Hi,

Am 28.03.2013 21:04, schrieb Guillaume Bailleul:

On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <[email protected]> wrote:

Hi there,


here is a rough summary of some ideas I have for a potential pdfbox 2.0 
release. Maybe we could capture these as part of a wiki or jira ticket so we 
can add and agree on some of these if we want to. As soon as we have agreement 
we could have individual tickets for them.

WDYT?


# rearchitect PDF parsing into lexing, incremental (non caching) parser and 
caching parser
o the lexer would be the low level component delivering tokens to the parser. A 
sample implementation exists as part of PDFBOX-1000. The benefit would be a 
clean low level handling of tokens. Although I proposed the lexer I'm not 
totally happy with the current implementation. That's something for another 
mail/ticket ...
o the incremental (non caching) parser would allow for page by page processing 
moving forward only to support text extraction, merging, splitting … - the 
benefit would be a lower memory consumption as well as a potential faster 
processing
o the caching parser would support applications such a PDFDebugger or PDFReader

# handling of pdf versions
the current implementation is a mix of PDF 1.4 and some adhoc additions without 
a clear distinction what is and is not supported. We could ad some support for 
explicitly handling versions in pdfbox e.g. my marking certain methods and 
properties to the pdf version support level. This could in addition be a good 
basis for PDF/A and other compliance checks.

# handle large pdf files
in addition to the pdf parsing pdfbox does not always handle large pdf files 
well as some of the references are implemented as int instead of long

# split pdfbox into modules to support use cases such as text extraction and 
merge with the minimum amount of classes needed. more app like tolls such as 
the PDFDebugger or PDFReader could be additional modules.

With kind regards


Maruan Sahyoun


Hi Maruan,

I think some wiki pages should be good. This discussion already
started but as mails in the list or maybe jira tickets lost in the
flow.

There is an apache wiki [1], but I found nothing on PDFBox, a good way
occasion to start.

Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
that we IMHO don't have to ask for other one.

I do not have many more ideas. According to me, having different
modules for PDF parsers, PDF makers and PDF viewers is an important
one.

This is one of my favourites, too. Let's see what'll come up. At least we don't
only need people who are interested in some features but also in implementing 
it ;-)

[1] http://wiki.apache.org/general/

Guillaume Bailleul


BR
Andreas Lehmkühler

Re: [PDFBox 2.0] Ideas

Reply via email to