And that is correct - content streams are treated as logically concatenated. (and I've seen them split in the middle of an operation!)
Leonard -----Original Message----- From: Mike Slegeir [mailto:[email protected]] Sent: Thursday, August 27, 2009 1:46 PM To: A. Massad Cc: podofo-users Subject: Re: [Podofo-users] PdfContentsTokenizer position is reset with multiple streams I'm Cc-ing the mailing list on this one because I'm not sure how best to implement the carrying over of the fragment between the streams; hopefully someone else will. I'm attaching the same PDF with the BT in the second stream removed. This is how the PDFs that inspired this issue report does things: its like the streams are split at arbitrary points, expecting them to just be concatenated before they're rendered. I agree that the first solution is better, but it certainly comes with strange implications for the user (they'll have to hold onto the fragment themselves to pass to the next tokenizer). Perhaps the fragment could be passed up in the exception and be passed as a default (to NULL) argument to PdfTokenizer/PdfContentsTokenizer. - Mike Slegeir A. Massad wrote: > Hi Mike, > >> 3) No, I'm not really sure. I based the concept off of a PDF I found >> in the wild. The BT in the second stream was a mistake on my part; >> the original PDF which I based this off of doesn't have a BT in its >> second stream. > > Could you prepare a PDF without a BT in the second stream? Then, I > would like to test it with the Adobe tools. Acrobat for example has a > syntax validation. By this we could find out whether Adobe accepts the > split array as valid PDF. > >> 4) Good question. I'm not really sure how that should be handled. >> I'm doing something similar to you in my app, but rather than >> preserve all the streams, I just replace them with one concatenated >> stream. I'd like to be able to preserve the structure as well, but I >> really don't know how that'd be possible in such a case without some >> weird hacks to the tokenizer. > > I agree, the structure of the streams should be maintained as good as > possible. I do not trust an output with reordering to one single stream. > >> My suggestion would be for the UnexpectedEOF exception to hold onto >> the state of the tokenizer which could be fed to the next stream so >> that the streams would only be disturbed to the extent that the >> interrupted structure moved into the second stream as a whole, but >> I'm not sure how feasible that would be. Another possibility is to >> use the other kind of PdfContentsTokenizer (entire page contents) and >> have it somehow indicate that it has moved to another stream (either >> by throwing an exception or some other means). > > OK, I prefer your first suggestion. I think of a strategy as follows: > 1. I have to find a solution to remember the "last good" position in > the stream which has already been parsed without errors. > 2. Upon "UnexpectedEOF", we could somehow store the remainder of the > incomplete stream in a string - i.e. the substring of the stream from > lastpos+1 till the end of the stream. > 3. Before parsing the next stream, simply preprend the remainder to > the next stream and start parsing. > > To implement this solution, step 1 requires a means to determine the > position of the tokenizer in the stream. Any ideas? Possibly, it's > already in the source. > > Greetings, > Amin > >> - Mike Slegeir >> >> A. Massad wrote: >>> Hi Mike, >>> >>> let's briefly switch to personal email (feel free to fwd to the >>> list, if you think it is of interest for all): >>> >>> 1) Yes, in my application I use the constructor >>> PoDoFo::PdfContentsTokenizer(char *, pdf_long), as you assumed. So, >>> it does not seem to be affected by your patch (although I haven't >>> tested it yet). >>> >>> 2) You are right, my application fails to parse your sample >>> "split-array.pdf". It throws an exception at the end of the stream >>> with the split array, and cannot reassemble the array when reading >>> the next stream. In fact, the parser confuses the closing square >>> bracket "]" with a keyword. >>> >>> 3) Are you sure, that your sample is valid PDF? It displays >>> correctly in Mac OS X preview.app, but looks strange with Adobe >>> Acroread and Adobe Acrobat (v. 9.1.2). Maybe, split arrays are >>> allowed by the spec - but the injected BT before closing the array >>> looks suspicious to me. >>> >>>> 0 Tw >>>> *[(Hello )* >>>> endstream >>>> endobj >>>> >>>> 6 0 obj >>>> 81 >>>> endobj >>>> >>>> %% Contents for page 1 >>>> 7 0 obj >>>> << >>>> /Length 8 0 R >>>> >> >>>> stream >>>> *BT* >>>> *10(World!)]TJ* >>>> ET >>> >>> >>> 4) If this splitting is in accordance with the PDF spec, how can we >>> fix the tokenizer to work correctly for the second constructor, too >>> - i.e. PoDoFo::PdfContentsTokenizer(char *, pdf_long) - without >>> loosing stream boundaries. >>> >>> Thank you for your efforts. Great work! >>> >>> Greetings, >>> Amin >>> >>> On 26.08.2009, at 20:01, Mike Slegeir wrote: >>> >>>> Hey Amin, >>>> >>>> Good point, but I don't think the change I'm suggesting would >>>> affect your usage. I assume in order to handle streams one at a >>>> time, you're using the PdfContentsTokenizer which accepts a const >>>> char* and a length rather than the PdfCanvas* constructor. If >>>> that's not the case, I'm not sure how you're able to detect stream >>>> boundaries as is. My suggestion is just to move the code at the >>>> top of PdfContentsTokenizer::ReadNext (the if(!gotToken) block) >>>> into a virtual PdfContentsTokenizer::GetNextToken method. If >>>> you're using the first constructor, m_lstContents will be empty and >>>> PdfContentsTokenizer will behave as before. Otherwise, if you >>>> construct with a PdfCanvas*, the stream transitions will be >>>> seamless (as they have been), but it will now behave correctly when >>>> an object is split across Content streams. >>>> I'm also curious if your application could handle the PDF that I >>>> previously posted where an array is split across the streams. >>>> Unfortunately, though, I don't think that it would work nor that >>>> it's really fixable in that case: you'd just have to use the >>>> PdfCanvas* PdfContentsTokenizer which could be fixed by my >>>> suggested change. >>>> >>>> - Mike Slegeir >>>> >>>> A. Massad wrote: >>>>> Hi Mike, >>>>> >>>>> If you change the behavior of PdfContentsTokenizer::GetNextToken() >>>>> to span across streams, could you please provide a flag to toggle >>>>> this behavior? For some users (like me) it might be important to >>>>> change back to the "old" behavior which DOES NOT span across streams. >>>>> >>>>> I have got an application which parses through streams and >>>>> replaces the content of each single stream without changing the >>>>> overall structure of the streams. I think that this wouldn't be >>>>> possible any longer if PdfContentsTokenizer::GetNextToken() did >>>>> not detect stream boundaries anymore. >>>>> >>>>> Thanks in advance! >>>>> >>>>> Greetings, >>>>> Amin >>>>> >>>>> On 26.08.2009, at 17:17, Mike Slegeir wrote: >>>>> >>>>> >>>>>> I've discovered another related issue. PdfTokenizer is unable to >>>>>> reach into the next content stream in order to get a token. So >>>>>> any objects which are split across Contents have an >>>>>> UnexpectedEOF raised. My suggested solution to the problem is >>>>>> to either concatenate all the Content streams before doing any >>>>>> tokenization or to make PdfTokenizer::GetNextToken virtual and >>>>>> move the stream switching logic into >>>>>> PdfContentsTokenizer::GetNextToken such that it will try the >>>>>> parents version, attempt to move to the next stream (if it >>>>>> exists) on failure, then retry. Attached is a very basic example >>>>>> of an array split between two streams. >>>>>> >>>>>> - Mike Slegeir >>>>>> >>>>>> Mike Slegeir wrote: >>>>>> >>>>>>> I've resolved this issue in an admittedly hacky way. This may >>>>>>> be sufficient for this problem though. Attached is a patch >>>>>>> which fixes the issue. I've only done limited testing, but it >>>>>>> does at least correct the issue. >>>>>>> >>>>>>> - Mike Slegeir >>>>>>> >>>>>>> >>>>>>> >>>>>>>> When using PdfContentsTokenizer with a PDF with an array for >>>>>>>> Contents rather than a single stream, the tokenizer will reset >>>>>>>> its position to the beginning of the first stream upon >>>>>>>> exhausting a stream. An Contents array with contents X Y Z >>>>>>>> will appear as X X Y X Y Z to a user of the >>>>>>>> PdfContentsTokenizer. Attached is a PDF which has a Contents >>>>>>>> array. I can provide example code and output if necessary. >>>>>>>> >>>>>>>> >>>>>> <split- array .pdf >>>>>> >>>>>> ----------------------------------------------------------------- >>>>>> ------------- Let Crystal Reports handle the reporting - Free >>>>>> Crystal Reports >>>>>> 2008 30-Day >>>>>> trial. Simplify your report design, integration and deployment - >>>>>> and focus on what you do best, core application coding. Discover >>>>>> what's new with Crystal Reports now. >>>>>> >>>>>> http://p.sf.net/sfu/bobj-july____________________________________ >>>>>> ___________ >>>>>> Podofo-users mailing list >>>>>> [email protected] >>>>>> <mailto:[email protected]> >>>>>> https://lists.sourceforge.net/lists/listinfo/podofo-users >>>>>> >>>>> >>>>> >>> > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
