Maruan I’ve given this some considerable thought and realised that while it may technically work, it won’t be maintainable.
> the user is making use of PDFBox because they don’t want to have to parse > the PDF file themselves. >> >> But in case a file is invalid for whatever reason the only options are to >> either wait until we include a workaround or put it in themselves. And if >> you don’t want to write your handler you are not enforced to do so. It seems to me that any user who knows enough about the internals of the PDFBox parser to write an error handler should know how to make those changes in PDFBox and submit a patch. If their patch is good and comes with an example PDF then it will be in the latest snapshot within a day or so. Users who don’t know what they’re doing shouldn’t be writing error handlers, and users who know what they’re doing don’t need to. >> From a technical standpoint, exposing the internal parser context to the >> user seems particularly problematic: the internal implementation details >> which are part of the context now become part of PDFBox’s public API which >> needs to be kept stable between major releases. How is the user to resolve a >> non-trivial exception and allow parsing to continue in a manner which leaves >> the internals of the parser in a consistent state? If we don’t know how >> users are resolving exceptions out in the real world, how can we be sure >> that changes we make to the parser later won’t break their code? > > One can only assume that a documented API is stable. As long as this is the > case why should it break their code. Of course if a different file is causing > a similar exception which will be dealt with by the exception handler and the > code is not able to deal with it ... You’ve missed the major point here: any “context” would necessarily expose the *internal* implementation details of the parser, which breaks the principle of encapsulation and will make PDFBox harder to maintain without introducing breaking changes. This is because every internal detail of a context becomes a fixed part of the public API, as does when and where it is called, i.e. the *behaviour* is part of the public API, not just the type signature. Changes to when and where error contexts are used and exactly how, will be part of the public behaviour and thus must be kept stable. The end result is that it will be incredibly hard to make internal changes to the parser without breaking major version compatibility. Imagine a scenario where a user has a number of PDFs some of which are known to contain errors which require manual review of the files after they are processed. The user discovers that all such erroneous PDFs trigger one of the parser's error handlers for invalid date parsing, and so they write a handler callback which correctly parses the date and flags the PDF in their system as being in need of manual review. Imagine that at some later point, a helpful person comes along and fixes the underling date parsing issue in the core PDFBox parser, looks good, right? Boom, that was a breaking change because it altered the public behaviour of the parser. The invalid date handler in our user’s code will no longer be called because it is now handled internally by PDFBox, so the PDF won’t be flagged for review, their customers receive broken PDFs, and our user’s boss is very unhappy. Here’s another scenario: a user adds a handler to recover from a parsing error reading a corrupted stream. They have access to lots of example PDF files and do an amazing job and handle all sorts of strange edge cases. Then a PDFBox committer comes along and fixes the same issue in the core parser but they did not have as many example files which contain all the weird edge cases handled by our user. The committer’s change doesn’t alter any type signatures on the public API so it’s not a breaking change right? Wrong, it changed the public behaviour of the class. The user’s handler will now never be called, and because we didn’t fix the same edge cases that they did they will now silently start getting corrupt streams. > It’s not about supporting different standards […] It’s about having a core > which handles conformant files and an extension which handles workarounds for > nonconformant files. A commonly used approach to parsing programming languages is to have a core language which is small, easily parsed and with an AST which is easy to manipulate. On top of that is another parser which handles all of the syntactic sugar of the language, transforming a complex concrete AST into a simple core AST. Perhaps PDFBox could take a similar approach with ConformingParser having a NonConformingParser subclass which is capable of pre-processing bad PDF files before they reach the core parser. The actual implementation may be more subtle than this, perhaps with some back-and-forth between the conforming and non-conforming parsers, so that when the conforming parser encounters an error it can call a protected method which in ConformingParser would throw an error but in NonConformingParser would perform a recovery, as you proposed. But by using protected methods we avoid the maintainability problem caused by making the error recovery mechanism public. What do you think? -- John On 13 Feb 2014, at 10:57, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > John > > Am 13.02.2014 um 18:50 schrieb John Hewson <johnahew...@yahoo.co.uk>: > >> Maruan, >> >>> Now let’s assume there is a situation where an object is not at a certain >>> location, or a specific string is missing …. what if we throw an exception >>> where one could register a handler. We pass some kind of context e.g. >>> lexer, file position, token …. and the user can handle the exception and >>> „enrich“ the content or pass the correct information. >> >> The idea sounds reasonable in theory, but the more I reflect on in the more >> I think that we should assume that the user is making use of PDFBox because >> they don’t want to have to parse the PDF file themselves. I can’t think of >> an example where the knowledge of how to correct some invalid PDF would’t be >> better off existing within PDFBox itself, rather than in user code. > > Of course they don’t want to parse it themselves. They can expect that PDFBox > can handle a valid PDF file. But in case a file is invalid for whatever > reason the only options are to either wait until we include a workaround or > put it in themselves. The idea is to have an entry point. What’s the benefit > of an exception when one can’t do anything about it. And if you don’t want > to write your handler you are not enforced to do so. > >> >> From a technical standpoint, exposing the internal parser context to the >> user seems particularly problematic: the internal implementation details >> which are part of the context now become part of PDFBox’s public API which >> needs to be kept stable between major releases. How is the user to resolve a >> non-trivial exception and allow parsing to continue in a manner which leaves >> the internals of the parser in a consistent state? If we don’t know how >> users are resolving exceptions out in the real world, how can we be sure >> that changes we make to the parser later won’t break their code? > > One can only assume that a documented API is stable. As long as this is the > case why should it break their code. Of course if a different file is causing > a similar exception which will be dealt with by the exception handler and the > code is not able to deal with it ... > >> >>> In addition to that we are able to extend from a strictly conformant >>> parsing to a relaxed parsing by using the same mechanism thus having the >>> workarounds not in the ‚core‘ parser. >> >> >> My suggestion would be to either subclass the core parser or pass it a >> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any >> external error handling/recovery mechanism is going to work in practice, >> especially if that means generating thousands of exceptions when given a bad >> content stream. >> > > It’s not about supporting different standards - that’s different thing > (currently PDFBox doesn’t have concept of applying standards or versions - > functions are either available or not, regardless of when they became part of > the PDF spec). It’s about having a core which handles conformant files and an > extension which handles workarounds for nonconformant files. Currently that’s > all within the code - sometimes marked, sometimes not - which makes it > difficult to rewrite the parser. As you already found out sometimes a fix was > made to handle a single occurrence of a file and the file itself might no > longer exist. > > >> -- John >> >> On 13 Feb 2014, at 03:24, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: >> >>> Hi John, >>> >>> currently pdfbox mostly throws IOExceptions where the user of the lib is >>> not able to do something about it. >>> >>> Some of these exceptions could occur because a file was not found etc. So >>> that’s ok. Others might occur because objects are not at a certain >>> position. There are workarounds for some of these in pdfbox e.g. if %%EOF >>> ist not the last entry in a PDF. Thus users are dependent on us putting in >>> the workarounds to handle such situations. >>> >>> Now let’s assume there is a situation where an object is not at a certain >>> location, or a specific string is missing …. what if we throw an exception >>> where one could register a handler. We pass some kind of context e.g. >>> lexer, file position, token …. and the user can handle the exception and >>> „enrich“ the content or pass the correct information. The exception is than >>> resolved and the process can continue. >>> >>> In addition to that we are able to extend from a strictly conformant >>> parsing to a relaxed parsing by using the same mechanism thus having the >>> workarounds not in the ‚core‘ parser. >>> >>> BR >>> Maruan Sahyoun >>> >>> Am 13.02.2014 um 09:44 schrieb John Hewson <j...@jahewson.com>: >>> >>>> I'm not sure in understand what you mean, the Camel examples are very >>>> complex indeed. A quick concrete example of what you're after would help >>>> greatly. >>>> >>>> -- John >>>> >>>>> On 13 Feb 2014, at 00:20, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: >>>>> >>>>> Hi, >>>>> >>>>> what do you think of having an exception handling in pdfbox where people >>>>> could define their own handlers. Something similar to >>>>> >>>>> https://camel.apache.org/exception-clause.html >>>>> >>>>> The benefit would be that we could pass the context e.g. during PDF >>>>> parsing and the handler could return something which is than taken as the >>>>> input. In addition to that maybe we can think about having some >>>>> additional types of exceptions instead of mostly IOException to support >>>>> that. >>>>> >>>>> BR >>>>> Maruan Sahyoun >>>>> >>> >> >