Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
[email protected]
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 21.05.2013 um 08:00 schrieb Andreas Lehmkuehler <[email protected]>:

> Hi,
> 
> Am 15.05.2013 14:56, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> currently PDFBox has a number of workarounds "hidden" in the code for real 
>> world PDFs (e.g. PDFBOX-1172) which are not inline with the spec. There are 
>> several options to deal with that
>> 
>> e.g.
>> a) keep the workarounds in the core code
> IMO we can't drop them. Whenever a parsing issue arises people often
> argue that all pdf readers but PDFbox are able to handle the pdf in
> question. So people expect that a pdf reader works in any situation
> wether the pdf follows the spec or not. That's sad but that's life :-(

I agree that as long as Adobe Reader or e.g. Firefox (pdf.js) can handle the 
pdf we should be able handle them too.

> 
>> b) throw an exception and stop working
> We should add some (special) logging, so that one can detect such glitches.
> 

OK

>> c) handle it through a pluggable extension
> I'm not sure if there is one solution for every use case. Sometimes it's just 
> a
> question of the used format (e.g. PDFBOX-1172) and sometimes there are bigger
> differences.
> 

Wouldn't be a solution to every use case. I thought about PDF's with parsing 
exceptions. E.g. currently there is a workaround code of different kind for 
real world PDFs.

Some are handled by calling specialized routines 
# e.g. checkForMissingCloseParen in BaseParser

Some are handle inline
# line 483 in PDFParser for %%EOF handling
# line 548 in PDFParser for handling 'obj'
# line 733 in PDFParser for incorrect xref table entry


So the extension was meant to 
a) have a clean conforming pdf parser and
b) handle these exceptions to the PDF spec in specialized routines. 

Now by thinking about these routines we could do it within the parser similar 
to checkForMissingCloseParen or by registering handlers for such situations. 

Benefit:
# core objects/methods are clean from a conforming PDF perspective
# extensions stand out clearly
# easier to add handling of special situations
# developers could add their own special handling

Drawback:
# more complex architecture
# no single handling of real world parsing
# runtime performance impact


>> WDYT?
>> 
>> Maruan Sahyoun
> 
> BR
> Andreas Lehmkühler
> 

Reply via email to