Re: [DISCUSS] PDFBox and Exception handling

John Hewson Thu, 13 Feb 2014 12:59:05 -0800

Maruan

I’ve given this some considerable thought and realised that while it may 
technically work, it won’t be maintainable.

>  the user is making use of PDFBox because they don’t want to have to parse 
> the PDF file themselves. 
>> 
>> But in case a file is invalid for whatever reason the only options are to 
>> either wait until we include a workaround or put it in themselves. And if 
>> you don’t want to write your handler you are not enforced to do so. 

It seems to me that any user who knows enough about the internals of the PDFBox 
parser to write an error handler should know how to make those changes in 
PDFBox and submit a patch. If their patch is good and comes with an example PDF 
then it will be in the latest snapshot within a day or so. Users who don’t know 
what they’re doing shouldn’t be writing error handlers, and users who know what 
they’re doing don’t need to.

>> From a technical standpoint, exposing the internal parser context to the 
>> user seems particularly problematic: the internal implementation details 
>> which are part of the context now become part of PDFBox’s public API which 
>> needs to be kept stable between major releases. How is the user to resolve a 
>> non-trivial exception and allow parsing to continue in a manner which leaves 
>> the internals of the parser in a consistent state? If we don’t know how 
>> users are resolving exceptions out in the real world, how can we be sure 
>> that changes we make to the parser later won’t break their code?
> 
> One can only assume that a documented API is stable. As long as this is the 
> case why should it break their code. Of course if a different file is causing 
> a similar exception which will be dealt with by the exception handler and the 
> code is not able to deal with it ...

You’ve missed the major point here: any “context” would necessarily expose the 
*internal* implementation details of the parser, which breaks the principle of 
encapsulation and will make PDFBox harder to maintain without introducing 
breaking changes. This is because every internal detail of a context becomes a 
fixed part of the public API, as does when and where it is called, i.e. the 
*behaviour* is part of the public API, not just the type signature. Changes to 
when and where error contexts are used and exactly how, will be part of the 
public behaviour and thus must be kept stable. The end result is that it will 
be incredibly hard to make internal changes to the parser without breaking 
major version compatibility. 

Imagine a scenario where a user has a number of PDFs some of which are known to 
contain errors which require manual review of the files after they are 
processed. The user discovers that all such erroneous PDFs trigger one of the 
parser's error handlers for invalid date parsing, and so they write a handler 
callback which correctly parses the date and flags the PDF in their system as 
being in need of manual review. Imagine that at some later point, a helpful 
person comes along and fixes the underling date parsing issue in the core 
PDFBox parser, looks good, right? Boom, that was a breaking change because it 
altered the public behaviour of the parser. The invalid date handler in our 
user’s code will no longer be called because it is now handled internally by 
PDFBox, so the PDF won’t be flagged for review, their customers receive broken 
PDFs, and our user’s boss is very unhappy.

Here’s another scenario: a user adds a handler to recover from a parsing error 
reading a corrupted stream. They have access to lots of example PDF files and 
do an amazing job and handle all sorts of strange edge cases. Then a PDFBox 
committer comes along and fixes the same issue in the core parser but they did 
not have as many example files which contain all the weird edge cases handled 
by our user. The committer’s change doesn’t alter any type signatures on the 
public API so it’s not a breaking change right? Wrong, it changed the public 
behaviour of the class. The user’s handler will now never be called, and 
because we didn’t fix the same edge cases that they did they will now silently 
start getting corrupt streams.

> It’s not about supporting different standards […] It’s about having a core 
> which handles conformant files and an extension which handles workarounds for 
> nonconformant files. 

A commonly used approach to parsing programming languages is to have a core 
language which is small, easily parsed and with an AST which is easy to 
manipulate. On top of that is another parser which handles all of the syntactic 
sugar of the language, transforming a complex concrete AST into a simple core 
AST. Perhaps PDFBox could take a similar approach with ConformingParser having 
a NonConformingParser subclass which is capable of pre-processing bad PDF files 
before they reach the core parser. The actual implementation may be more subtle 
than this, perhaps with some back-and-forth between the conforming and 
non-conforming parsers, so that when the conforming parser encounters an error 
it can call a protected method which in ConformingParser would throw an error 
but in NonConformingParser would perform a recovery, as you proposed. But by 
using protected methods we avoid the maintainability problem caused by making 
the error recovery mechanism public.

What do you think?

-- John

On 13 Feb 2014, at 10:57, Maruan Sahyoun <[email protected]> wrote:

> John
> 
> Am 13.02.2014 um 18:50 schrieb John Hewson <[email protected]>:
> 
>> Maruan,
>> 
>>> Now let’s assume there is a situation where an object is not at a certain 
>>> location, or a specific string is missing …. what if we throw an exception 
>>> where one could register a handler. We pass some kind of context e.g. 
>>> lexer, file position, token …. and the user can handle the exception and 
>>> „enrich“ the content or pass the correct information.
>> 
>> The idea sounds reasonable in theory, but the more I reflect on in the more 
>> I think that we should assume that the user is making use of PDFBox because 
>> they don’t want to have to parse the PDF file themselves. I can’t think of 
>> an example where the knowledge of how to correct some invalid PDF would’t be 
>> better off existing within PDFBox itself, rather than in user code.
> 
> Of course they don’t want to parse it themselves. They can expect that PDFBox 
> can handle a valid PDF file. But in case a file is invalid for whatever 
> reason the only options are to either wait until we include a workaround or 
> put it in themselves. The idea is to have an entry point. What’s the benefit 
> of an exception when one can’t do anything about it.  And if you don’t want 
> to write your handler you are not enforced to do so. 
> 
>> 
>> From a technical standpoint, exposing the internal parser context to the 
>> user seems particularly problematic: the internal implementation details 
>> which are part of the context now become part of PDFBox’s public API which 
>> needs to be kept stable between major releases. How is the user to resolve a 
>> non-trivial exception and allow parsing to continue in a manner which leaves 
>> the internals of the parser in a consistent state? If we don’t know how 
>> users are resolving exceptions out in the real world, how can we be sure 
>> that changes we make to the parser later won’t break their code?
> 
> One can only assume that a documented API is stable. As long as this is the 
> case why should it break their code. Of course if a different file is causing 
> a similar exception which will be dealt with by the exception handler and the 
> code is not able to deal with it ...
> 
>> 
>>> In addition to that we are able to extend from a strictly conformant 
>>> parsing to a relaxed parsing by using the same mechanism thus having the 
>>> workarounds not in the ‚core‘ parser.
>> 
>> 
>> My suggestion would be to either subclass the core parser or pass it a 
>> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any 
>> external error handling/recovery mechanism is going to work in practice, 
>> especially if that means generating thousands of exceptions when given a bad 
>> content stream.
>> 
> 
> It’s not about supporting different standards - that’s different thing 
> (currently PDFBox doesn’t have concept of applying standards or versions - 
> functions are either available or not, regardless of when they became part of 
> the PDF spec). It’s about having a core which handles conformant files and an 
> extension which handles workarounds for nonconformant files. Currently that’s 
> all within the code - sometimes marked, sometimes not - which makes it 
> difficult to rewrite the parser. As you already found out sometimes a fix was 
> made to handle a single occurrence of a file and the file itself might no 
> longer exist.
> 
> 
>> -- John
>> 
>> On 13 Feb 2014, at 03:24, Maruan Sahyoun <[email protected]> wrote:
>> 
>>> Hi John,
>>> 
>>> currently pdfbox mostly throws IOExceptions where the user of the lib is 
>>> not able to do something about it. 
>>> 
>>> Some of these exceptions could occur because a file was not found etc. So 
>>> that’s ok. Others might occur because objects are not at a certain 
>>> position. There are workarounds for some of these in pdfbox e.g. if %%EOF 
>>> ist not the last entry in a PDF. Thus users are dependent on us putting in 
>>> the workarounds to handle such situations. 
>>> 
>>> Now let’s assume there is a situation where an object is not at a certain 
>>> location, or a specific string is missing …. what if we throw an exception 
>>> where one could register a handler. We pass some kind of context e.g. 
>>> lexer, file position, token …. and the user can handle the exception and 
>>> „enrich“ the content or pass the correct information. The exception is than 
>>> resolved and the process can continue.
>>> 
>>> In addition to that we are able to extend from a strictly conformant 
>>> parsing to a relaxed parsing by using the same mechanism thus having the 
>>> workarounds not in the ‚core‘ parser.
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 13.02.2014 um 09:44 schrieb John Hewson <[email protected]>:
>>> 
>>>> I'm not sure in understand what you mean, the Camel examples are very 
>>>> complex indeed. A quick concrete example of what you're after would help 
>>>> greatly.
>>>> 
>>>> -- John
>>>> 
>>>>> On 13 Feb 2014, at 00:20, Maruan Sahyoun <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> what do you think of having an exception handling in pdfbox where people 
>>>>> could define their own handlers. Something similar to
>>>>> 
>>>>> https://camel.apache.org/exception-clause.html
>>>>> 
>>>>> The benefit would be that we could pass the context e.g. during PDF 
>>>>> parsing and the handler could return something which is than taken as the 
>>>>> input. In addition to that maybe we can think about having some 
>>>>> additional types of exceptions instead of mostly IOException to support 
>>>>> that.  
>>>>> 
>>>>> BR
>>>>> Maruan Sahyoun
>>>>> 
>>> 
>> 
>

Re: [DISCUSS] PDFBox and Exception handling

Reply via email to