Re: [PDFBox 2.0] Ideas

Maruan Sahyoun Fri, 29 Mar 2013 04:54:31 -0700

Hi,

Am 29.03.2013 um 12:27 schrieb Andreas Lehmkuehler <[email protected]>:


> Hi,
> 
> Am 28.03.2013 21:04, schrieb Guillaume Bailleul:
>> On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <[email protected]> 
>> wrote:
>>> Hi there,
>>> 
>>> here is a rough summary of some ideas I have for a potential pdfbox 2.0 
>>> release. Maybe we could capture these as part of a wiki or jira ticket so 
>>> we can add and agree on some of these if we want to. As soon as we have 
>>> agreement we could have individual tickets for them.
>>> 
>>> WDYT?
>>> 
>>> 
>>> # rearchitect PDF parsing into lexing, incremental (non caching) parser and 
>>> caching parser
>>> o the lexer would be the low level component delivering tokens to the 
>>> parser. A sample implementation exists as part of PDFBOX-1000. The benefit 
>>> would be a clean low level handling of tokens. Although I proposed the 
>>> lexer I'm not totally happy with the current implementation. That's 
>>> something for another mail/ticket ...
>>> o the incremental (non caching) parser would allow for page by page 
>>> processing moving forward only to support text extraction, merging, 
>>> splitting … - the benefit would be a lower memory consumption as well as a 
>>> potential faster processing
>>> o the caching parser would support applications such a PDFDebugger or 
>>> PDFReader
>>> 
>>> # handling of pdf versions
>>> the current implementation is a mix of PDF 1.4 and some adhoc additions 
>>> without a clear distinction what is and is not supported. We could ad some 
>>> support for explicitly handling versions in pdfbox e.g. my marking certain 
>>> methods and properties to the pdf version support level. This could in 
>>> addition be a good basis for PDF/A and other compliance checks.
>>> 
>>> # handle large pdf files
>>> in addition to the pdf parsing pdfbox does not always handle large pdf 
>>> files well as some of the references are implemented as int instead of long
>>> 
>>> # split pdfbox into modules to support use cases such as text extraction 
>>> and merge with the minimum amount of classes needed. more app like tolls 
>>> such as the PDFDebugger or PDFReader could be additional modules.
>>> 
>>> With kind regards
>>> 
>>> 
>>> Maruan Sahyoun
>>> 
>> 
>> Hi Maruan,
>> 
>> I think some wiki pages should be good. This discussion already
>> started but as mails in the list or maybe jira tickets lost in the
>> flow.
>> 
>> There is an apache wiki [1], but I found nothing on PDFBox, a good way
>> occasion to start.
> Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
> that we IMHO don't have to ask for other one.
> 
>> I do not have many more ideas. According to me, having different
>> modules for PDF parsers, PDF makers and PDF viewers is an important
>> one.
> This is one of my favourites, too. Let's see what'll come up. At least we 
> don't
> only need people who are interested in some features but also in implementing 
> it ;-)

We might be able to split into modules based on the current code and 
rearchitect the individual parts later. E.g the command line tools could easily 
be separated as well as PDFDebugger, PDFReader. One thing to consider is how we 
handle releases afterwards. Will we always release all modules as part of a 
release (like Apache Camel does) or do releases seperately (as Apache Sling 
does).

I'm happy to help with implementation/rearrangement as soon as the transition 
to the CMS is done

> 
> 
>> [1] http://wiki.apache.org/general/
>> 
>> Guillaume Bailleul
> 
> BR
> Andreas Lehmkühler
>

Re: [PDFBox 2.0] Ideas

Reply via email to