Hi,
Am 28.03.2013 21:04, schrieb Guillaume Bailleul:
On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
Hi there,
here is a rough summary of some ideas I have for a potential pdfbox 2.0
release. Maybe we could capture these as part of a wiki or jira ticket so we
can add and agree on some of these if we want to. As soon as we have agreement
we could have individual tickets for them.
WDYT?
# rearchitect PDF parsing into lexing, incremental (non caching) parser and
caching parser
o the lexer would be the low level component delivering tokens to the parser. A
sample implementation exists as part of PDFBOX-1000. The benefit would be a
clean low level handling of tokens. Although I proposed the lexer I'm not
totally happy with the current implementation. That's something for another
mail/ticket ...
o the incremental (non caching) parser would allow for page by page processing
moving forward only to support text extraction, merging, splitting … - the
benefit would be a lower memory consumption as well as a potential faster
processing
o the caching parser would support applications such a PDFDebugger or PDFReader
# handling of pdf versions
the current implementation is a mix of PDF 1.4 and some adhoc additions without
a clear distinction what is and is not supported. We could ad some support for
explicitly handling versions in pdfbox e.g. my marking certain methods and
properties to the pdf version support level. This could in addition be a good
basis for PDF/A and other compliance checks.
# handle large pdf files
in addition to the pdf parsing pdfbox does not always handle large pdf files
well as some of the references are implemented as int instead of long
# split pdfbox into modules to support use cases such as text extraction and
merge with the minimum amount of classes needed. more app like tolls such as
the PDFDebugger or PDFReader could be additional modules.
With kind regards
Maruan Sahyoun
Hi Maruan,
I think some wiki pages should be good. This discussion already
started but as mails in the list or maybe jira tickets lost in the
flow.
There is an apache wiki [1], but I found nothing on PDFBox, a good way
occasion to start.
Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
that we IMHO don't have to ask for other one.
I do not have many more ideas. According to me, having different
modules for PDF parsers, PDF makers and PDF viewers is an important
one.
This is one of my favourites, too. Let's see what'll come up. At least we don't
only need people who are interested in some features but also in implementing
it ;-)
[1] http://wiki.apache.org/general/
Guillaume Bailleul
BR
Andreas Lehmkühler