[
https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson updated PDFBOX-1256:
--------------------------------
Component/s: (was: Text extraction)
(was: Parsing)
PDModel
> [PATCH] Split PDFStreamEngine, moving functionality to simpler stream
> processor base class
> ------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1256
> URL: https://issues.apache.org/jira/browse/PDFBOX-1256
> Project: PDFBox
> Issue Type: Improvement
> Components: PDModel
> Affects Versions: 1.7.0, 2.0.0
> Environment: N/A
> Reporter: Craig Ringer
> Priority: Minor
> Labels: api, refactoring, streams
> Attachments:
> 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch
>
>
> The attached patch restructures PDFStreamEngine to move the basic
> functionality of invoking callbacks for each operator in a stream into a
> parent class. The parent class knows nothing about the meaning of operators,
> it just invokes handlers with accumulated arguments whenever it encounters an
> operator. PDFStreamEngine retains all the "knowledge" of what those operators
> mean, the state of the graphics state stack, etc.
> The purpose of the change is to make it simpler and easier to use PDFBox's
> PDF stream processor/parser code without dealing with the full features of
> PDFStreamEngine with its built-in operator handlers, awareness of the
> graphics stack, etc when that functionality isn't required. Specifically, I
> needed to write a tool that copies a PDF stream, renaming resource references
> as it goes but otherwise leaving it unchanged. I wanted to handle all
> operators including future or unknown ones, and only needed to special-case a
> couple of them. PDFStreamEngine was poorly suited to that because it doesn't
> support a default handler fallback, tries to "understand" the stream, etc.
> Rather than write a new class that duplicated much of PDFStreamEngine I
> thought I'd try to factor the required functionality out, so others could use
> it too.
> The changes should be backward compatible with existing code that uses
> PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were
> required for the test suite to pass, text extraction tool to work, etc.
> Nonetheless, it's possible you'll only consider these changes for inclusion
> in PDFBox 2.0, in which case they can be cleaned up to remove some of the
> backward compatibility crap that's currently in them. Let me know.
> In terms of open issues or TODOs, the class naming could probably use work. I
> can't rename PDFStreamEngine or OperatorProcessor for backward compatibility
> reasons, so I've had to come up with more contrived names than I'd like.
> The logic of the changes is:
> - Move content stream argument accumulation and operator callback
> functionality into new PDFStreamProcessor class
> - Add support for a default (fallback) handler to PDFStreamProcessor so
> operators not explicitly matched may be handled
> - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its
> existing methods though some are now inherited.
> - Deprecate the properties-map based configuration of PDFStreamEngine because
> it'll be fragile whenever more than one classloader is in use. Add
> PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent
> functionality using a type-safe, multi-classloader-safe HashMap of operator
> names to handler instances. This isn't added as a ctor override because
> operator handler registration/unregistration methods are not final (to
> preserve compatibility with PDFStreamEngine) and if overridden, they might
> use data from a not-yet-initialized derived class. If a ctor override is
> required then registerOperatorProcessor must be made final, breaking BC with
> PDFStreamEngine.
> - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class).
> Instances of this are bound to a particular PDFStreamEngine via the `context'
> property and they carry state when they don't have to. They're also an
> abstract class, so handlers can't extend any other class. OperatorProcessor
> based handlers continue to be supported just fine via a simple wrapper that's
> used automatically where required.
> - Introduce new PDFStreamProcessor.OperatorHandler interface to replace
> OperatorProcessor . It's a simple one-method interface that passes the
> PDFStreamProcessor as an argument, so application designers are free to
> choose whether to tie their OperationProcessorHandler implementations to
> PDFStreamProcessor instances or whether they want to re-use the same handler
> on many different processors. This change is useful for my app and removes
> unnecessary stateful API, but isn't strictly necessary and can be dropped
> while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of
> the API change, new-interface handlers are passed the original arguments
> array rather than a copy; if they want a copy of the arguments array they
> have to take it themselves, so that resources aren't wasted copying the array
> when handlers don't actually need it copied.
> - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor
> implementations are wrapped in a helper that translates
> OperatorProcessorHandler interface usage to the usage required by
> OperatorProcessor. All the wrapper does is set the context (which
> PDFStreamEngine seems to do before every handler call) then pass a copy of
> the arguments array.
> I'm aware that this is a non-trivial change I'm proposing, but I think it
> significantly improves the API (especially once the BC stuff can be removed
> for PDFBox 2.0) and makes it easier to use this functionality.
> Prior patch in series (should be independent of this one):
> https://issues.apache.org/jira/browse/PDFBOX-1255
> Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)