Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Thomas Chojecki Mon, 20 Jun 2011 16:09:47 -0700

Am 20.06.2011 19:00, schrieb [email protected]:

What is the proposed solution for this?  According to the PDF spec, there
will never be two objects with the same object number and revision.
However, this is the real world, not a world of conforming PDF documents,
so I completely understand that this does occur.  My questions is mainly:
how do you plan on telling which object is the "right" one, and which one
should be overwritten?

First of all, with incremental updates you can overwrite existingobjects. The revision only increase if the prev. object was marked asdeleted. This is little strange. In normal case the revision will neverincrement. a good parser can mark older not used objects as garbage.


example:

you have a page with some pictures. now you don't want some of theimages but you want add some text to the rest of it. if you want updatethis page incremental, you need to get the page object. change it(remove some images and add some text). if you write the new page objectand update the xref table, the old page object isn't visible any more.the images are still in the document and reserve the object number.

the next pdf writer doesn't know that the objects aren't use any more.if you add another page via incremental update, you use new objects withnew object numbers. each pdf has a limit amount of object numbers. 65kand you can update existing objects but don't replace it by other.

so if you would spare some object numbers. you can try to mark unusedobject as deleted. so a good pdf writer see this objects and can use thesame number with a incremented revision.

this is the theory, in the real world no one do this :) 65k objects areenough and if someone plan to waste so many objects, he can use xrefstreams instead. ;)



-------

The normal way to overwrite an object is using the same obj no. and thesame revision. To mark the new object, only the xref-table need to beupdated. so it's very important to parse the table from old to new one.

the new one should set the prev flag with the offset of the older one.so a parser read the newest xref-table, check the prev value and jumpinto this one. It's a kind of recursive parsing the table. The first isthe newest and the last table in the line is the oldest. A conformingparser should parse the oldest first.

But, if you parse the xref table from the beginning of the file to theend of the file, you read the table also the right way.

If you need to handle xref streams, the easy way from beginning to theend of the file, doesn't work for most documents. documents that usexref streams are mostly called "weboptimized" or "linearized".

Weboptimized files are optimized for reading the document from beginningtill end. this is for big documents. if you try to load a not optimizedpdf, the application need to read the whole stream and can then handlethe file. so if you only need the first page of the pdf, you need toparse the whole pdf. For the slow web this isn't a good way.

weboptimized documents contain all needed information right in front ofthe file so the reader can read a small amount of data and knows, howmany pages a document has. where the first page is without loading therest of the file. try to load the pdf specification from the web. youwill see the first page and a progress bar (depending of the speed ofour connection) that load the rest of the document.


for example:

a weboptimized document can contain more xref streams. the first onecontains only a few objects, the last one the rest of it.

The patch i tried yesterday, sort the xref streams from smallest prevoffset to the biggest. but in my example file the xref streams are mixedso the offset of the prev doesn't help. the best way is to jump directto the offsets and read the streams recursive. maybe i got an idea

Sry for the long text, tried to explain it for every user that has abasic knowledge of the pdf structure.


----
Thanks,
Adam


Best regards
Thomas

Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Reply via email to