Re: [Podofo-users] PdfContentsTokenizer position is reset with multiple streams

Mike Slegeir Thu, 27 Aug 2009 10:47:49 -0700

I'm Cc-ing the mailing list on this one because I'm not sure how best to implement the carrying over of the fragment between the streams; hopefully someone else will.

I'm attaching the same PDF with the BT in the second stream removed. This is how the PDFs that inspired this issue report does things: its like the streams are split at arbitrary points, expecting them to just be concatenated before they're rendered.

I agree that the first solution is better, but it certainly comes with strange implications for the user (they'll have to hold onto the fragment themselves to pass to the next tokenizer). Perhaps the fragment could be passed up in the exception and be passed as a default (to NULL) argument to PdfTokenizer/PdfContentsTokenizer.


- Mike Slegeir

A. Massad wrote:

Hi Mike,
3) No, I'm not really sure. I based the concept off of a PDF I found in the wild. The BT in the second stream was a mistake on my part; the original PDF which I based this off of doesn't have a BT in its second stream.
Could you prepare a PDF without a BT in the second stream? Then, I would like to test it with the Adobe tools. Acrobat for example has a syntax validation. By this we could find out whether Adobe accepts the split array as valid PDF.
4) Good question. I'm not really sure how that should be handled. I'm doing something similar to you in my app, but rather than preserve all the streams, I just replace them with one concatenated stream. I'd like to be able to preserve the structure as well, but I really don't know how that'd be possible in such a case without some weird hacks to the tokenizer.
I agree, the structure of the streams should be maintained as good as possible. I do not trust an output with reordering to one single stream.
My suggestion would be for the UnexpectedEOF exception to hold onto the state of the tokenizer which could be fed to the next stream so that the streams would only be disturbed to the extent that the interrupted structure moved into the second stream as a whole, but I'm not sure how feasible that would be. Another possibility is to use the other kind of PdfContentsTokenizer (entire page contents) and have it somehow indicate that it has moved to another stream (either by throwing an exception or some other means).
OK, I prefer your first suggestion. I think of a strategy as follows:
1. I have to find a solution to remember the "last good" position in the stream which has already been parsed without errors. 2. Upon "UnexpectedEOF", we could somehow store the remainder of the incomplete stream in a string - i.e. the substring of the stream from lastpos+1 till the end of the stream. 3. Before parsing the next stream, simply preprend the remainder to the next stream and start parsing.
To implement this solution, step 1 requires a means to determine the position of the tokenizer in the stream. Any ideas? Possibly, it's already in the source.
Greetings,
Amin
- Mike Slegeir

A. Massad wrote:
Hi Mike,
let's briefly switch to personal email (feel free to fwd to the list, if you think it is of interest for all):
1) Yes, in my application I use the constructor PoDoFo::PdfContentsTokenizer(char *, pdf_long), as you assumed. So, it does not seem to be affected by your patch (although I haven't tested it yet).
2) You are right, my application fails to parse your sample "split-array.pdf". It throws an exception at the end of the stream with the split array, and cannot reassemble the array when reading the next stream. In fact, the parser confuses the closing square bracket "]" with a keyword.
3) Are you sure, that your sample is valid PDF? It displays correctly in Mac OS X preview.app, but looks strange with Adobe Acroread and Adobe Acrobat (v. 9.1.2). Maybe, split arrays are allowed by the spec - but the injected BT before closing the array looks suspicious to me.
0 Tw
*[(Hello )*
endstream
endobj

6 0 obj
81
endobj

%% Contents for page 1
7 0 obj
<<
 /Length 8 0 R
>>
stream
*BT*
*10(World!)]TJ*
ET
4) If this splitting is in accordance with the PDF spec, how can we fix the tokenizer to work correctly for the second constructor, too - i.e. PoDoFo::PdfContentsTokenizer(char *, pdf_long) - without loosing stream boundaries.
Thank you for your efforts. Great work!

Greetings,
Amin

On 26.08.2009, at 20:01, Mike Slegeir wrote:
Hey Amin,
Good point, but I don't think the change I'm suggesting would affect your usage. I assume in order to handle streams one at a time, you're using the PdfContentsTokenizer which accepts a const char* and a length rather than the PdfCanvas* constructor. If that's not the case, I'm not sure how you're able to detect stream boundaries as is. My suggestion is just to move the code at the top of PdfContentsTokenizer::ReadNext (the if(!gotToken) block) into a virtual PdfContentsTokenizer::GetNextToken method. If you're using the first constructor, m_lstContents will be empty and PdfContentsTokenizer will behave as before. Otherwise, if you construct with a PdfCanvas*, the stream transitions will be seamless (as they have been), but it will now behave correctly when an object is split across Content streams. I'm also curious if your application could handle the PDF that I previously posted where an array is split across the streams. Unfortunately, though, I don't think that it would work nor that it's really fixable in that case: you'd just have to use the PdfCanvas* PdfContentsTokenizer which could be fixed by my suggested change.
- Mike Slegeir

A. Massad wrote:
Hi Mike,
If you change the behavior of PdfContentsTokenizer::GetNextToken() to span across streams, could you please provide a flag to toggle this behavior? For some users (like me) it might be important to change back to the "old" behavior which DOES NOT span across streams.
I have got an application which parses through streams and replaces the content of each single stream without changing the overall structure of the streams. I think that this wouldn't be possible any longer if PdfContentsTokenizer::GetNextToken() did not detect stream boundaries anymore.
Thanks in advance!

Greetings,
Amin

On 26.08.2009, at 17:17, Mike Slegeir wrote:
I've discovered another related issue. PdfTokenizer is unable to reach into the next content stream in order to get a token. So any objects which are split across Contents have an UnexpectedEOF raised. My suggested solution to the problem is to either concatenate all the Content streams before doing any tokenization or to make PdfTokenizer::GetNextToken virtual and move the stream switching logic into PdfContentsTokenizer::GetNextToken such that it will try the parents version, attempt to move to the next stream (if it exists) on failure, then retry. Attached is a very basic example of an array split between two streams.
- Mike Slegeir

Mike Slegeir wrote:
I've resolved this issue in an admittedly hacky way. This may be sufficient for this problem though. Attached is a patch which fixes the issue. I've only done limited testing, but it does at least correct the issue.
- Mike Slegeir
When using PdfContentsTokenizer with a PDF with an array for Contents rather than a single stream, the tokenizer will reset its position to the beginning of the first stream upon exhausting a stream. An Contents array with contents X Y Z will appear as X X Y X Y Z to a user of the PdfContentsTokenizer. Attached is a PDF which has a Contents array. I
can provide example code and output if necessary.
<split- array .pdf ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july_______________________________________________
Podofo-users mailing list
[email protected] <mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/podofo-users

split-array.pdf
Description: Adobe PDF document

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] PdfContentsTokenizer position is reset with multiple streams

Reply via email to