Hi,
I have encountered two problems in PdfContentsTokenizer::ReadInlineImgData():
1) Parsing expects a whitespace *before* the EI operator (end of image data)
whereas it should expect a whitespace *after* the EI.
2) Buffer for image data has a fixed size of 4096 bytes.
The patch (against svn rev. 1298) included in this E-Mail provides a solution
for both issues.
Some further details:
To 1) Unfortunately, the PDF spec does not clearly define how the EI operator
should be detected in the data following the ID operator. The size of the data
is not specified, and there seems to be no "escaping" mechanisms if the
sequence EI should occur in the image data. However, there is an "heuristic"
approach by other PDF parsers which expect a whitespace *after* the EI
operator. See, here for such a discussion:
http://www.planetpdf.com/forumarchive/134376.asp
> Topic: Re: parsing inline images (Via Email)
> Conf: (P-PDF) Developers, Msg: 134376
> From: LeonardR
> Date: 6/13/2005 10:58 PM
>
> At 06:38 PM 6/13/2005, p-pdf-developers Listmanager wrote:
> >The image data contains "EI " where the
> >white space is a space (0x20).
>
> The actual image data, or the encoded version of the data? Are
> you decoding and then looking or grabbing the inline image data till you
> find the "EI" and then decoding?
>
>
> >our parser detects either a space or cr lf.
>
> I've looked at the sources to a few content stream parsers (my
> own, Xpdf, Multivalent, etc.) and they all also support "EI" followed by at
> least one whitespace character (specifically space, CR or LF).
>
Prior to the patch, PoDoFo expects to find a whitespace *before* the EI
operator and fails to detect the end of image data for some PDFs created by a
common PDF workflow software.
To 2) The PDF spec states that inlined images *should* not be larger than 4K.
However, it does not forbid images to be larger. Again, some common PDF outputs
contained inlined images larger than 4K. In that case, PoDoFo should not fail
but rather resize the buffer.
Hopefully, this patch will be helpful for other users, too. Many thanks to all
developers for this great project!
Best regards,
Amin
> Index: podofo-src-r1298/src/PdfContentsTokenizer.cpp
> ===================================================================
> --- podofo-src-r1298/src/PdfContentsTokenizer.cpp (revision 1298)
> +++ podofo-src-r1298/src/PdfContentsTokenizer.cpp (working copy)
> @@ -202,40 +202,43 @@
> PODOFO_RAISE_ERROR( ePdfError_InvalidHandle );
> }
>
> - // cosume the only whitespace between ID and data
> + // consume the only whitespace between ID and data
> c = m_device.Device()->Look();
> if( PdfTokenizer::IsWhitespace( c ) )
> {
> c = m_device.Device()->GetChar();
> }
>
> - while( (c = m_device.Device()->Look()) != EOF
> - && counter < static_cast<long long>(m_buffer.GetSize()) )
> - {
> - if (PdfTokenizer::IsWhitespace(c))
> - {
> - // test if end-of-image-data is reached (hit EI keyword)
> - c = m_device.Device()->GetChar(); // skip the white space
> - char e = m_device.Device()->GetChar();
> - char i = m_device.Device()->GetChar();
> - m_device.Device()->Seek(-2, std::ios::cur);
> - if (e == 'E' && i == 'I')
> - {
> - m_buffer.GetBuffer()[counter] = '\0';
> - rVariant = PdfData(m_buffer.GetBuffer(),
> static_cast<size_t>(counter));
> - reType = ePdfContentsType_ImageData;
> - m_readingInlineImgData = false;
> - return true;
> - }
> - m_buffer.GetBuffer()[counter] = c;
> - ++counter;
> - }
> - else
> - {
> - c = m_device.Device()->GetChar();
> - m_buffer.GetBuffer()[counter] = c;
> - ++counter;
> - }
> + while((c = m_device.Device()->Look()) != EOF) {
> + c = m_device.Device()->GetChar();
> + if (c=='E' && m_device.Device()->Look()=='I') {
> + char i = m_device.Device()->GetChar();
> + char w = m_device.Device()->Look();
> + if (w==EOF || PdfTokenizer::IsWhitespace(w)) {
> + // EI is followed by whitespace => stop
> + m_device.Device()->Seek(-2, std::ios::cur); // put back "EI"
> + m_buffer.GetBuffer()[counter] = '\0';
> + rVariant = PdfData(m_buffer.GetBuffer(),
> static_cast<size_t>(counter));
> + reType = ePdfContentsType_ImageData;
> + m_readingInlineImgData = false;
> + return true;
> + }
> + else {
> + // no whitespace after EI => do not stop
> + m_device.Device()->Seek(-1, std::ios::cur); // put back "I"
> + m_buffer.GetBuffer()[counter] = c;
> + ++counter;
> + }
> + }
> + else {
> + m_buffer.GetBuffer()[counter] = c;
> + ++counter;
> + }
> +
> + if (counter == static_cast<long long>(m_buffer.GetSize())) {
> + // image is larger than buffer => resize buffer
> + m_buffer.Resize(m_buffer.GetSize()*2);
> + }
> }
> return false;
> }
>
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:
Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users