Re: Parsing issues & duplicate objects

Andreas Lehmkühler Thu, 19 Aug 2010 00:49:09 -0700

Hi

> I'm trying to find some solution to the problem of documents which have 
> multiple objects with the same object ID and revision number.  I have some 
> documents which cause NPE and hence the documents can not be merged.  I 
> realize these are out of spec, but when the files are opened with Adobe 
> Reader, they are rendered just fine.  So (non-technical) people figure if 
> Adobe Reader can read it, why can't our software deal with it?
That's a very popular argument among (non-technical) people ...


> I found some code in COSObject::setObject() which seems to take a crack at 
> solving this, but it's all commented out.  I uncommented it hoping it 
> would magically solve my problems, but there was no such luck.  Does 
> anyone know who wrote that code so I can collaborate with them (SVN 
> history didn't have anything)?
It seems to be pre Apache. I guess presumably Ben wrote that piece of code.

> According to Neil[1], the best thing to do would be to rewrite the parser. 
>  I'm not beyond rewriting the parser if that will solve my issue.  But I 
> need to understand how it currently works and how it should work before I 
> can take on something like that.  I noticed that section 7.5.5 (File 
> Trailer) of the PDF spec says "Conforming readers should read a PDF file 
> from its end." and I'm pretty sure PDFParser::parse() doesn't do that.
I guess those pdfs aren't out of spec, they just contain incremental updates.
Those updates (added, deletee or changed content) are appended to the end
of the document. The XRef section is somehow used to handle those updates.
If you want to get the most recent version of your pdf, you have to start at 
the 
end to determine if there are any updates or not. Have a look in section 
7.5.6 (Incremental updates) ofr further details.

> Anyone think looking at the COSObject will be any faster than rewriting 
> the parser?
A few weeks ago I invested some time in incremental updates but I didn't find
a solution. I think our parser isn't that bad and it should be possible to 
improve
it to handle incremental updates.

> The documents I have are all confidential, so unfortunately I can't share 
> them, but there are some other[1] issues[2] which seem to be somewhat 
> related.  I'm going to keep looking for a file I can get approved for 
> release so I can upload it to JIRA with an exact stacktrace and 
> everything.
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-569
> [2] https://issues.apache.org/jira/browse/PDFBOX-720
Especially [2] gives some good pointers.

BR
Andreas Lehmkühler

Re: Parsing issues & duplicate objects

Reply via email to