[Podofo-users] Reading inline image data in PdfContentsTokenizer is incorrectly implemented

Michal Sudolsky Fri, 06 Sep 2019 09:33:13 -0700

Function PdfContentsTokenizer::ReadInlineImgData in
PdfContentsTokenizer.cpp:


```

        c = m_device.Device()->GetChar();
        if (c=='E' &&  m_device.Device()->Look()=='I')
        {
            // Consume character
            m_device.Device()->GetChar();
            int w = m_device.Device()->Look();
            if (w==EOF || PdfTokenizer::IsWhitespace(w))
            {
                // EI is followed by whitespace => stop

                ...

                m_readingInlineImgData = false;

```


It will stop as soon as is found byte sequence "EI " but with inline image
data it is not so simple. It is needed calculate size of decoded image data
from parameters like width, height, bit per component and color space. Then
decode data and see for EOD or look for exactly right number of decoded
bytes and then stop parsing image data as not every filter has EOD.


As is written in pdf reference:
"Entries other than those listed are ignored; in particular, the Type,
Subtype, and Length entries normally found in a stream or image dictionary
are unnecessary"
That is unfortunate because with "Length" would be this parsing simpler.
Without it could be used EOD but as not all filters have it is it needed to
calculate decompressed size of image data.

Sample code to demonstrate this bug:

```

PdfMemDocument doc;
PdfStream *stm = doc.CreatePage({0, 0, 100,
100})->GetContentsForAppending()->GetStream();
stm->BeginAppend(TVecFilters());
stm->Append("100 0 0 100 0 0 cm\n");
stm->Append("BI /W 4 /H 4 /CS /RGB /BPC 8\n");
stm->Append("ID\n");
stm->Append("00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz\n");
stm->Append("EI\n");
stm->EndAppend();
doc.Write("bi.pdf");

PdfMemDocument pdf("bi.pdf");
PdfContentsTokenizer tok(pdf.GetPage(0));
EPdfContentsType type;
const char *key;
PdfVariant var;
while(tok.ReadNext(type, key, var))
{
  switch(type)
  {
    case ePdfContentsType_Keyword: printf("keyword: %s\n", key); break;
    case ePdfContentsType_Variant: printf("variant: %s\n",
var.GetDataTypeString()); break;
    case ePdfContentsType_ImageData: printf("image: %s\n",
var.GetDataTypeString()); break;
  }
}

```


Partial output:

```

...

keyword: ID
image: RawData
keyword: EI
keyword: aazazaazzzaazazzzazzz
keyword: EI

```


Which should instead be:


```

...
keyword: ID
image: RawData
keyword: EI

```


Resulting pdf file (also attached):

```

%PDF-1.3

%âãÏÓ

1 0 obj<</Type/Catalog/Pages 3 0 R>>

endobj

2 0 obj<</CreationDate(D:20190906183146+02'00')/Producer(PoDoFo -
http://podofo.sf.net)>>

endobj

3 0 obj<</Type/Pages/Count 1/Kids[ 4 0 R]>>

endobj

4 0 obj<</Type/Page/Contents 5 0 R/MediaBox[ 0 0 100 100]/Parent 3 0
R/Resources<</ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>>>

endobj

5 0 obj<</Length 103>>

stream

100 0 0 100 0 0 cm

BI /W 4 /H 4 /CS /RGB /BPC 8

ID

00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz

EI


endstream

endobj

xref

0 6

0000000000 65535 f

0000000015 00000 n

0000000059 00000 n

0000000156 00000 n

0000000207 00000 n

0000000341 00000 n

trailer

<</ID[<D047079C2B662F2617BF6BC31251DAB1><D047079C2B662F2617BF6BC31251DAB1>]/Info
2 0 R/Root 1 0 R/Size 6>>

startxref

492

%%EOF

```

bi.pdf
Description: Adobe PDF document

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

[Podofo-users] Reading inline image data in PdfContentsTokenizer is incorrectly implemented

Reply via email to