Hi Amin, Thanks a lot for your patch. I tested and applied it today to SVN.
Cheers,
Dominik
Am Dienstag 31 August 2010 schrieb A. Massad:
> Hi,
>
> I have encountered two problems in
> PdfContentsTokenizer::ReadInlineImgData():
>
> 1) Parsing expects a whitespace *before* the EI operator (end of image
> data) whereas it should expect a whitespace *after* the EI. 2) Buffer for
> image data has a fixed size of 4096 bytes.
>
> The patch (against svn rev. 1298) included in this E-Mail provides a
> solution for both issues.
>
> Some further details:
>
> To 1) Unfortunately, the PDF spec does not clearly define how the EI
> operator should be detected in the data following the ID operator. The
> size of the data is not specified, and there seems to be no "escaping"
> mechanisms if the sequence EI should occur in the image data. However,
> there is an "heuristic" approach by other PDF parsers which expect a
> whitespace *after* the EI operator. See, here for such a discussion:
> http://www.planetpdf.com/forumarchive/134376.asp
>
> > Topic: Re: parsing inline images (Via Email)
> > Conf: (P-PDF) Developers, Msg: 134376
> > From: LeonardR
> > Date: 6/13/2005 10:58 PM
> >
> > At 06:38 PM 6/13/2005, p-pdf-developers Listmanager wrote:
> > >The image data contains "EI " where the
> > >white space is a space (0x20).
> >
> > The actual image data, or the encoded version of the data? Are
> > you decoding and then looking or grabbing the inline image data till you
> > find the "EI" and then decoding?
> >
> > >our parser detects either a space or cr lf.
> >
> > I've looked at the sources to a few content stream parsers (my
> > own, Xpdf, Multivalent, etc.) and they all also support "EI" followed by
> > at least one whitespace character (specifically space, CR or LF).
>
> Prior to the patch, PoDoFo expects to find a whitespace *before* the EI
> operator and fails to detect the end of image data for some PDFs created
> by a common PDF workflow software.
>
> To 2) The PDF spec states that inlined images *should* not be larger than
> 4K. However, it does not forbid images to be larger. Again, some common
> PDF outputs contained inlined images larger than 4K. In that case, PoDoFo
> should not fail but rather resize the buffer.
>
> Hopefully, this patch will be helpful for other users, too. Many thanks to
> all developers for this great project!
>
> Best regards,
> Amin
>
> > Index: podofo-src-r1298/src/PdfContentsTokenizer.cpp
> > ===================================================================
> > --- podofo-src-r1298/src/PdfContentsTokenizer.cpp (revision 1298)
> > +++ podofo-src-r1298/src/PdfContentsTokenizer.cpp (working copy)
> > @@ -202,40 +202,43 @@
> > PODOFO_RAISE_ERROR( ePdfError_InvalidHandle );
> > }
> >
> > - // cosume the only whitespace between ID and data
> > + // consume the only whitespace between ID and data
> > c = m_device.Device()->Look();
> > if( PdfTokenizer::IsWhitespace( c ) )
> > {
> > c = m_device.Device()->GetChar();
> > }
> >
> > - while( (c = m_device.Device()->Look()) != EOF
> > - && counter < static_cast<long long>(m_buffer.GetSize()) )
> > - {
> > - if (PdfTokenizer::IsWhitespace(c))
> > - {
> > - // test if end-of-image-data is reached (hit EI keyword)
> > - c = m_device.Device()->GetChar(); // skip the white space
> > - char e = m_device.Device()->GetChar();
> > - char i = m_device.Device()->GetChar();
> > - m_device.Device()->Seek(-2, std::ios::cur);
> > - if (e == 'E' && i == 'I')
> > - {
> > - m_buffer.GetBuffer()[counter] = '\0';
> > - rVariant = PdfData(m_buffer.GetBuffer(),
> > static_cast<size_t>(counter)); - reType =
> > ePdfContentsType_ImageData;
> > - m_readingInlineImgData = false;
> > - return true;
> > - }
> > - m_buffer.GetBuffer()[counter] = c;
> > - ++counter;
> > - }
> > - else
> > - {
> > - c = m_device.Device()->GetChar();
> > - m_buffer.GetBuffer()[counter] = c;
> > - ++counter;
> > - }
> > + while((c = m_device.Device()->Look()) != EOF) {
> > + c = m_device.Device()->GetChar();
> > + if (c=='E' && m_device.Device()->Look()=='I') {
> > + char i = m_device.Device()->GetChar();
> > + char w = m_device.Device()->Look();
> > + if (w==EOF || PdfTokenizer::IsWhitespace(w)) {
> > + // EI is followed by whitespace => stop
> > + m_device.Device()->Seek(-2, std::ios::cur); // put back "EI"
> > + m_buffer.GetBuffer()[counter] = '\0';
> > + rVariant = PdfData(m_buffer.GetBuffer(),
> > static_cast<size_t>(counter)); + reType = ePdfContentsType_ImageData;
> > + m_readingInlineImgData = false;
> > + return true;
> > + }
> > + else {
> > + // no whitespace after EI => do not stop
> > + m_device.Device()->Seek(-1, std::ios::cur); // put back "I"
> > + m_buffer.GetBuffer()[counter] = c;
> > + ++counter;
> > + }
> > + }
> > + else {
> > + m_buffer.GetBuffer()[counter] = c;
> > + ++counter;
> > + }
> > +
> > + if (counter == static_cast<long long>(m_buffer.GetSize())) {
> > + // image is larger than buffer => resize buffer
> > + m_buffer.Resize(m_buffer.GetSize()*2);
> > + }
> > }
> > return false;
> > }
>
--
**********************************************************************
Dominik Seichter - [email protected]
KRename - http://www.krename.net - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing
PoDoFo - http://podofo.sf.net - PDF generation and parsing library
SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game, for KDE
Alan - http://alan.sf.net - A Turing Machine in Java
**********************************************************************
signature.asc
Description: This is a digitally signed message part.
------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
