[jira] [Updated] (PDFBOX-1256) [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class

John Hewson (JIRA) Sun, 09 Feb 2014 17:06:06 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-1256:
--------------------------------

    Component/s:     (was: Text extraction)
                     (was: Parsing)
                 PDModel

> [PATCH] Split PDFStreamEngine, moving functionality to simpler stream 
> processor base class
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1256
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1256
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 1.7.0, 2.0.0
>         Environment: N/A
>            Reporter: Craig Ringer
>            Priority: Minor
>              Labels: api, refactoring, streams
>         Attachments: 
> 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch
>
>
> The attached patch restructures PDFStreamEngine to move the basic 
> functionality of invoking callbacks for each operator in a stream into a 
> parent class. The parent class knows nothing about the meaning of operators, 
> it just invokes handlers with accumulated arguments whenever it encounters an 
> operator. PDFStreamEngine retains all the "knowledge" of what those operators 
> mean, the state of the graphics state stack, etc.
> The purpose of the change is to make it simpler and easier to use PDFBox's 
> PDF stream processor/parser code without dealing with the full features of 
> PDFStreamEngine with its built-in operator handlers, awareness of the 
> graphics stack, etc when that functionality isn't required. Specifically, I 
> needed to write a tool that copies a PDF stream, renaming resource references 
> as it goes but otherwise leaving it unchanged. I wanted to handle all 
> operators including future or unknown ones, and only needed to special-case a 
> couple of them. PDFStreamEngine was poorly suited to that because it doesn't 
> support a default handler fallback, tries to "understand" the stream, etc. 
> Rather than write a new class that duplicated much of PDFStreamEngine I 
> thought I'd try to factor the required functionality out, so others could use 
> it too.
> The changes should be backward compatible with existing code that uses 
> PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were 
> required for the test suite to pass, text extraction tool to work, etc. 
> Nonetheless, it's possible you'll only consider these changes for inclusion 
> in PDFBox 2.0, in which case they can be cleaned up to remove some of the 
> backward compatibility crap that's currently in them. Let me know.
> In terms of open issues or TODOs, the class naming could probably use work. I 
> can't rename PDFStreamEngine or OperatorProcessor for backward compatibility 
> reasons, so I've had to come up with more contrived names than I'd like.
> The logic of the changes is:
> - Move content stream argument accumulation and operator callback 
> functionality into new PDFStreamProcessor class
> - Add support for a default (fallback) handler to PDFStreamProcessor so 
> operators not explicitly matched may be handled
> - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its 
> existing methods though some are now inherited.
> - Deprecate the properties-map based configuration of PDFStreamEngine because 
> it'll be fragile whenever more than one classloader is in use. Add 
> PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent 
> functionality using a type-safe, multi-classloader-safe HashMap of operator 
> names to handler instances. This isn't added as a ctor override because 
> operator handler registration/unregistration methods are not final (to 
> preserve compatibility with PDFStreamEngine) and if overridden, they might 
> use data from a not-yet-initialized derived class. If a ctor override is 
> required then registerOperatorProcessor must be made final, breaking BC with 
> PDFStreamEngine.
> - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). 
> Instances of this are bound to a particular PDFStreamEngine via the `context' 
> property and they carry state when they don't have to. They're also an 
> abstract class, so handlers can't extend any other class. OperatorProcessor 
> based handlers continue to be supported just fine via a simple wrapper that's 
> used automatically where required.
> - Introduce new PDFStreamProcessor.OperatorHandler interface to replace 
> OperatorProcessor . It's a simple one-method interface that passes the 
> PDFStreamProcessor as an argument, so application designers are free to 
> choose whether to tie their OperationProcessorHandler implementations to 
> PDFStreamProcessor instances or whether they want to re-use the same handler 
> on many different processors. This change is useful for my app and removes 
> unnecessary stateful API, but isn't strictly necessary and can be dropped 
> while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of 
> the API change, new-interface handlers are passed the original arguments 
> array rather than a copy; if they want a copy of the arguments array they 
> have to take it themselves, so that resources aren't wasted copying the array 
> when handlers don't actually need it copied.
> - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor 
> implementations are wrapped in a helper that translates 
> OperatorProcessorHandler interface usage to the usage required by 
> OperatorProcessor. All the wrapper does is set the context (which 
> PDFStreamEngine seems to do before every handler call) then pass a copy of 
> the arguments array.
> I'm aware that this is a non-trivial change I'm proposing, but I think it 
> significantly improves the API (especially once the BC stuff can be removed 
> for PDFBox 2.0) and makes it easier to use this functionality.
> Prior patch in series (should be independent of this one): 
> https://issues.apache.org/jira/browse/PDFBOX-1255
> Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1256) [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class

Reply via email to