Re: [DISCUSS] PDFBox and Exception handling

Maruan Sahyoun Thu, 13 Feb 2014 11:08:54 -0800

John

Am 13.02.2014 um 18:50 schrieb John Hewson <johnahew...@yahoo.co.uk>:


> Maruan,
> 
>> Now let’s assume there is a situation where an object is not at a certain 
>> location, or a specific string is missing …. what if we throw an exception 
>> where one could register a handler. We pass some kind of context e.g. lexer, 
>> file position, token …. and the user can handle the exception and „enrich“ 
>> the content or pass the correct information.
> 
> The idea sounds reasonable in theory, but the more I reflect on in the more I 
> think that we should assume that the user is making use of PDFBox because 
> they don’t want to have to parse the PDF file themselves. I can’t think of an 
> example where the knowledge of how to correct some invalid PDF would’t be 
> better off existing within PDFBox itself, rather than in user code.

Of course they don’t want to parse it themselves. They can expect that PDFBox 
can handle a valid PDF file. But in case a file is invalid for whatever reason 
the only options are to either wait until we include a workaround or put it in 
themselves. The idea is to have an entry point. What’s the benefit of an 
exception when one can’t do anything about it.  And if you don’t want to write 
your handler you are not enforced to do so. 
 
> 
> From a technical standpoint, exposing the internal parser context to the user 
> seems particularly problematic: the internal implementation details which are 
> part of the context now become part of PDFBox’s public API which needs to be 
> kept stable between major releases. How is the user to resolve a non-trivial 
> exception and allow parsing to continue in a manner which leaves the 
> internals of the parser in a consistent state? If we don’t know how users are 
> resolving exceptions out in the real world, how can we be sure that changes 
> we make to the parser later won’t break their code?

One can only assume that a documented API is stable. As long as this is the 
case why should it break their code. Of course if a different file is causing a 
similar exception which will be dealt with by the exception handler and the 
code is not able to deal with it ...

> 
>> In addition to that we are able to extend from a strictly conformant parsing 
>> to a relaxed parsing by using the same mechanism thus having the workarounds 
>> not in the ‚core‘ parser.
> 
> 
> My suggestion would be to either subclass the core parser or pass it a 
> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any 
> external error handling/recovery mechanism is going to work in practice, 
> especially if that means generating thousands of exceptions when given a bad 
> content stream.
> 

It’s not about supporting different standards - that’s different thing 
(currently PDFBox doesn’t have concept of applying standards or versions - 
functions are either available or not, regardless of when they became part of 
the PDF spec). It’s about having a core which handles conformant files and an 
extension which handles workarounds for nonconformant files. Currently that’s 
all within the code - sometimes marked, sometimes not - which makes it 
difficult to rewrite the parser. As you already found out sometimes a fix was 
made to handle a single occurrence of a file and the file itself might no 
longer exist.


> -- John
> 
> On 13 Feb 2014, at 03:24, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
> 
>> Hi John,
>> 
>> currently pdfbox mostly throws IOExceptions where the user of the lib is not 
>> able to do something about it. 
>> 
>> Some of these exceptions could occur because a file was not found etc. So 
>> that’s ok. Others might occur because objects are not at a certain position. 
>> There are workarounds for some of these in pdfbox e.g. if %%EOF ist not the 
>> last entry in a PDF. Thus users are dependent on us putting in the 
>> workarounds to handle such situations. 
>> 
>> Now let’s assume there is a situation where an object is not at a certain 
>> location, or a specific string is missing …. what if we throw an exception 
>> where one could register a handler. We pass some kind of context e.g. lexer, 
>> file position, token …. and the user can handle the exception and „enrich“ 
>> the content or pass the correct information. The exception is than resolved 
>> and the process can continue.
>> 
>> In addition to that we are able to extend from a strictly conformant parsing 
>> to a relaxed parsing by using the same mechanism thus having the workarounds 
>> not in the ‚core‘ parser.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 13.02.2014 um 09:44 schrieb John Hewson <j...@jahewson.com>:
>> 
>>> I'm not sure in understand what you mean, the Camel examples are very 
>>> complex indeed. A quick concrete example of what you're after would help 
>>> greatly.
>>> 
>>> -- John
>>> 
>>>> On 13 Feb 2014, at 00:20, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> what do you think of having an exception handling in pdfbox where people 
>>>> could define their own handlers. Something similar to
>>>> 
>>>> https://camel.apache.org/exception-clause.html
>>>> 
>>>> The benefit would be that we could pass the context e.g. during PDF 
>>>> parsing and the handler could return something which is than taken as the 
>>>> input. In addition to that maybe we can think about having some additional 
>>>> types of exceptions instead of mostly IOException to support that.  
>>>> 
>>>> BR
>>>> Maruan Sahyoun
>>>> 
>> 
>

Re: [DISCUSS] PDFBox and Exception handling

Reply via email to