[Podofo-users] Some patches for podofo.

Xavier Trochu Thu, 10 Mar 2011 01:55:57 -0800

Hello everyone,

I have recently been using podofo for a tool that we use internally. The
idea of the tool is to parse the first page of a PDF to find its
"bounding box" of text so that it can add a line of text at the bottom
of the page without writing over something else.


Because we have various sources of PDF, this has been what I think is a
important test of the PdfParser implementation and it has uncovered a
few bugs that I have fixed here. I'm happy to share those fixes here.

The first bugs has already been described on podofo's mantis here :
http://sourceforge.net/apps/mantisbt/podofo/view.php?id=5

When the content of a page is composed of multiple streams, each
preceding stream will be re-parsed when a new stream is appended. The
patch attached to the bug proposes to fix by correctly keeping the
current offset in the tokenizer. I have chosen another solution, which
is not to keep concatenating to a buffer that would contain the whole
stream, but simply to forget the already parsed content and start over
from a new buffer for each stream part. But there is another issue
related to multiple streams. We had a example of a PDF where the streams
were split inside an expression. The implementation of GetNextToken in
PdfTokenizer is not aware of multiple streams and thus would return an
error at the end of a stream, instead of jumping to parsing the next one.

The patch multiple-streams.patch attached to this message fixes both
these issues.

Next issue is related to objects declared in stream, for PDF that have
multiple revisions. If a later revision contain an object stream that
redefine an object that was contained in a object stream from an earlier
revision, then the final object that will actually be read by PoDoFo
depends on the respective object numbers of the two object streams.
Let's take an example:

Imagine we have a PDF that has two revisions. Revision 1 contains an
object stream (with id 113) for objects 72,75,76 and 77. Revision 2
contains another object stream (with id 140) for object 72 and 75.
PoDoFo load objects in the order of their id. If they are contained in
an object stream, then the stream will be fully loaded at the time the
first object within is loaded. In our example, it means that when PoDoFo
tries to load object 72 (the first one) it will read the stream 140 and
initialize object 72 and 75. But when it tries to load 76, it will load
stream 113 and that will override 72 and 75 from stream 140, which is
not something that we want.

The patch object-streams.patch attached fixes this issue by generating a
list of object ids that are valid for the object stream as it is loaded.

Finally, I have implemented  PdfFontMetricsObject support for
CIDFontType0 and CIDFontType2. I also removed some parts of PdfFontCID
that was breaking font information when trying to instantiate a CID font
from a PDF. Theses fixes are in cidfont.patch

Greetings, and thanks for this excellent library.

Xavier Trochu
EDP Sciences

Index: src/doc/PdfFontMetricsObject.h
===================================================================
--- src/doc/PdfFontMetricsObject.h	(rÃ©vision 1430)
+++ src/doc/PdfFontMetricsObject.h	(copie de travail)
@@ -229,6 +229,7 @@
     double        m_dStrikeOutPosition;

     bool          m_bSymbol;  ///< Internal member to singnal a symbol font
+	double m_dDefWidth; ///< default width
 };

 };
Index: src/doc/PdfFontMetricsObject.cpp
===================================================================
--- src/doc/PdfFontMetricsObject.cpp	(rÃ©vision 1430)
+++ src/doc/PdfFontMetricsObject.cpp	(copie de travail)
@@ -31,16 +31,19 @@

 PdfFontMetricsObject::PdfFontMetricsObject( PdfObject* pFont, PdfObject* pDescriptor, const PdfEncoding* const pEncoding )
     : PdfFontMetrics( ePdfFontType_Unknown, "", NULL ),
-      m_pEncoding( pEncoding )
+      m_pEncoding( pEncoding ), m_dDefWidth(0.0)
 {
     if( !pDescriptor )
     {
         PODOFO_RAISE_ERROR( ePdfError_InvalidHandle );
     }

-    m_sName        = pDescriptor->GetDictionary().GetKey( "FontName" )->GetName();
-    m_bbox         = pDescriptor->GetDictionary().GetKey( "FontBBox" )->GetArray();
+	const PdfName & rSubType = pFont->GetDictionary().GetKey( PdfName::KeySubtype )->GetName();
+
     // OC 15.08.2010 BugFix: /FirstChar /LastChar /Widths are in the Font dictionary and not in the FontDescriptor
+	if ( rSubType == PdfName("Type1") || rSubType == PdfName("TrueType") ) {
+		m_sName        = pDescriptor->GetIndirectKey( "FontName" )->GetName();
+		m_bbox         = pDescriptor->GetIndirectKey( "FontBBox" )->GetArray();
     m_nFirst       = static_cast<int>(pFont->GetDictionary().GetKeyAsLong( "FirstChar", 0L ));
     m_nLast        = static_cast<int>(pFont->GetDictionary().GetKeyAsLong( "LastChar", 0L ));
 	 // OC 15.08.2010 BugFix: GetIndirectKey() instead of GetDictionary().GetKey() and "Widths" instead of "Width"
@@ -59,7 +62,60 @@
             m_missingWidth = widths;
         }
     }
+	} else if ( rSubType == PdfName("CIDFontType0") || rSubType == PdfName("CIDFontType2") ) {
+		PdfObject *pObj = pDescriptor->GetIndirectKey( "FontName" );
+		if (pObj) {
+			m_sName = pObj->GetName();
+		}
+		pObj = pDescriptor->GetIndirectKey( "FontBBox" );
+		if (pObj) {
+			m_bbox = pObj->GetArray();
+		}
+		m_nFirst = 0;
+		m_nLast = 0;

+		m_dDefWidth = pFont->GetDictionary().GetKeyAsLong( "DW", 1000L );
+		PdfVariant default_width(m_dDefWidth);
+		PdfObject * pw = pFont->GetIndirectKey( "W" );
+
+		for (int i = m_nFirst; i <= m_nLast; ++i) {
+			m_width.push_back(default_width);
+		}
+		if (pw) {
+			PdfArray w = pw->GetArray();
+			int pos = 0;
+			while (pos < static_cast<int>(w.size())) {
+				int start = static_cast<int>(w[pos++].GetNumber());
+				PODOFO_ASSERT (start >= 0);
+				if (w[pos].IsArray()) {
+					PdfArray widths = w[pos++].GetArray();
+					int length = start + static_cast<int>(widths.size());
+					PODOFO_ASSERT (length >= start);
+					if (length > m_width.size()) {
+						m_width.resize(length, default_width);
+					}
+					for (int i = 0; i < static_cast<int>(widths.size()); ++i) {
+						m_width[start + i] = widths[i];
+					}
+				} else {
+					int end = static_cast<int>(w[pos++].GetNumber());
+					int length = start + end;
+					PODOFO_ASSERT (length >= start);
+					if (length > m_width.size()) {
+						m_width.resize(length, default_width);
+					}
+					pdf_int64 width = w[pos++].GetNumber();
+					for (int i = start; i <= end; ++i)
+						m_width[i] = PdfVariant(width);
+				}
+			}
+		}
+		m_nLast = m_width.size() - 1;
+	} else {
+        PODOFO_RAISE_ERROR_INFO( ePdfError_UnsupportedFontFormat, rSubType.GetEscapedName().c_str() );
+	}
+
+
     m_nWeight      = static_cast<unsigned int>(pDescriptor->GetDictionary().GetKeyAsLong( "FontWeight", 400L ));
     m_nItalicAngle = static_cast<int>(pDescriptor->GetDictionary().GetKeyAsLong( "ItalicAngle", 0L ));

@@ -94,7 +150,7 @@

 double PdfFontMetricsObject::CharWidth( unsigned char c ) const
 {
-    if( c >= m_nFirst && c < m_nLast
+    if( c >= m_nFirst && c <= m_nLast
        && c - m_nFirst < m_width.size () )
     {
         double dWidth = m_width[c - m_nFirst].GetReal();
@@ -107,14 +163,21 @@
     if( m_missingWidth != NULL )
         return m_missingWidth->GetReal ();
     else
-        return 0.0;
+        return m_dDefWidth;
 }

Index: src/doc/PdfFontCID.cpp
===================================================================
--- src/doc/PdfFontCID.cpp	(rÃ©vision 1430)
+++ src/doc/PdfFontCID.cpp	(copie de travail)
@@ -53,7 +53,7 @@
 PdfFontCID::PdfFontCID( PdfFontMetrics* pMetrics, const PdfEncoding* const pEncoding, PdfObject* pObject, bool bEmbed )
     : PdfFont( pMetrics, pEncoding, pObject )
 {
-    this->Init( bEmbed );
+    /* this->Init( bEmbed ); */
 }

 void PdfFontCID::Init( bool bEmbed )
@@ -224,6 +224,7 @@
         }
     }

+	if (nMax >= nMin) {
     // Now compact the array
     std::ostringstream oss;
     PdfArray array;
@@ -270,18 +271,21 @@
     }

     pFontDict->GetDictionary().AddKey( PdfName("W"), array );
+	}

     free( pdWidth );
 }

 void PdfFontCID::CreateCMap( PdfObject* pUnicode ) const
 {
+    PdfFontMetricsFreetype* pFreetype = dynamic_cast<PdfFontMetricsFreetype*>(m_pMetrics);
+	if (!pFreetype) return;
+
     int  nFirstChar = m_pEncoding->GetFirstChar();
     int  nLastChar  = m_pEncoding->GetLastChar();

     std::ostringstream oss;

-    PdfFontMetricsFreetype* pFreetype = dynamic_cast<PdfFontMetricsFreetype*>(m_pMetrics);
     FT_Face   face = pFreetype->GetFace();
     FT_ULong  charcode;
     FT_UInt   gindex;
Index: src/doc/PdfFontCache.cpp
===================================================================
--- src/doc/PdfFontCache.cpp	(rÃ©vision 1430)
+++ src/doc/PdfFontCache.cpp	(copie de travail)
@@ -630,7 +630,7 @@
         sPath = reinterpret_cast<const char*>(v.u.s);
 #ifdef PODOFO_VERBOSE_DEBUG
         PdfError::LogMessage( eLogSeverity_Debug,
-                              "Got Font %s for for %s\n", sPath.c_str(), pszFontname );
+                              "Got Font %s for for %s\n", sPath.c_str(), pszFontName );
 #endif // PODOFO_DEBUG
     }

Index: src/base/PdfArray.h
===================================================================
--- src/base/PdfArray.h	(rÃ©vision 1430)
+++ src/base/PdfArray.h	(copie de travail)
@@ -205,6 +205,7 @@
     inline void erase( const iterator& pos );
     inline void erase( const iterator& first, const iterator& last );

+    inline void resize(size_type __n, PdfObject const & = PdfObject());
     inline void reserve(size_type __n);

     /**
@@ -445,6 +446,11 @@
 // -----------------------------------------------------
 //
 // -----------------------------------------------------
+void PdfArray::resize(size_type __n, PdfObject const & o)
+{
+    PdfArrayBaseClass::resize( __n, o );
+}
+
 void PdfArray::reserve(size_type __n)
 {
     PdfArrayBaseClass::reserve( __n );

Index: src/base/PdfContentsTokenizer.cpp
===================================================================
--- src/base/PdfContentsTokenizer.cpp	(rÃ©vision 1430)
+++ src/base/PdfContentsTokenizer.cpp	(copie de travail)
@@ -95,12 +95,28 @@

     PdfStream* pStream = pObject->GetStream();

-    PdfBufferOutputStream stream( &m_curBuffer );
+	PdfRefCountedBuffer buffer;
+    PdfBufferOutputStream stream( &buffer );
     pStream->GetFilteredCopy( &stream );

-    m_device = PdfRefCountedInputDevice( m_curBuffer.GetBuffer(), m_curBuffer.GetSize() );
+    m_device = PdfRefCountedInputDevice( buffer.GetBuffer(), buffer.GetSize() );
 }

+bool PdfContentsTokenizer::GetNextToken( const char*& pszToken , EPdfTokenType* peType )
+{
+	bool result = PdfTokenizer::GetNextToken(pszToken, peType);
+	while (!result) {
+		if( !m_lstContents.size() )
+			return false;
+
+		SetCurrentContentsStream( m_lstContents.front() );
+		m_lstContents.pop_front();
+		result = PdfTokenizer::GetNextToken(pszToken, peType);
+	}
+	return result;
+}
+
+
 bool PdfContentsTokenizer::ReadNext( EPdfContentsType& reType, const char*& rpszKeyword, PdfVariant & rVariant )
 {
     if (m_readingInlineImgData)
Index: src/base/PdfContentsTokenizer.h
===================================================================
--- src/base/PdfContentsTokenizer.h	(rÃ©vision 1430)
+++ src/base/PdfContentsTokenizer.h	(copie de travail)
@@ -99,6 +99,7 @@
      *
      */
     bool ReadNext( EPdfContentsType& reType, const char*& rpszKeyword, PoDoFo::PdfVariant & rVariant );
+    bool GetNextToken( const char *& pszToken, EPdfTokenType* peType = NULL);

  private:
     /** Set another objects stream as the current stream for parsing
@@ -109,7 +110,6 @@
     bool ReadInlineImgData(EPdfContentsType& reType, const char*& rpszKeyword, PoDoFo::PdfVariant & rVariant);

  private:
-    PdfRefCountedBuffer       m_curBuffer;    ///< A copy of the current contents stream
     std::list<PdfObject*>     m_lstContents;  ///< A list containing pointers to all contents objects
     bool                      m_readingInlineImgData;  ///< A state of reading inline image data
 };
Index: src/base/PdfTokenizer.h
===================================================================
--- src/base/PdfTokenizer.h	(rÃ©vision 1430)
+++ src/base/PdfTokenizer.h	(copie de travail)
@@ -77,7 +77,7 @@
      *
      *  \see GetBuffer
      */
-    bool GetNextToken( const char *& pszToken, EPdfTokenType* peType = NULL);
+    virtual bool GetNextToken( const char *& pszToken, EPdfTokenType* peType = NULL);

     /** Reads the next token from the current file position
      *  ignoring all comments and compare the passed token

Index: src/base/PdfParser.cpp
===================================================================
--- src/base/PdfParser.cpp	(rÃ©vision 1430)
+++ src/base/PdfParser.cpp	(copie de travail)
@@ -908,6 +908,13 @@
     // Read objects
     for( i=0; i < m_nNumObjects; i++ )
     {
+#ifdef PODOFO_VERBOSE_DEBUG
+		std::cerr << "ReadObjectsInteral\t" << i << " "
+			<< (m_offsets[i].bParsed ? "parsed" : "unparsed") << " "
+			<< m_offsets[i].cUsed << " "
+			<< m_offsets[i].lOffset << " "
+			<< m_offsets[i].lGeneration << std::endl;
+#endif
         if( m_offsets[i].bParsed && m_offsets[i].cUsed == 'n' && m_offsets[i].lOffset > 0 )
         {
             //printf("Reading object %i 0 R from %li\n", i, m_offsets[i].lOffset );
@@ -1079,8 +1086,16 @@
         PODOFO_RAISE_ERROR_INFO( ePdfError_NoObject, oss.str().c_str() );
     }

+	PdfObjectStreamParserObject::ObjectIdList list;
+    for(int i = 0; i < m_nNumObjects; i++ ) {
+        if( m_offsets[i].bParsed && m_offsets[i].cUsed == 's' &&
+			m_offsets[i].lGeneration == nObjNo) {
+				list.push_back(static_cast<long long>(i));
+		}
+	}
+
     PdfObjectStreamParserObject pParserObject( pStream, m_vecObjects, m_buffer, m_pEncrypt );
-    pParserObject.Parse();
+    pParserObject.Parse(list);
 }

 const char* PdfParser::GetPdfVersionString() const
Index: src/base/PdfObjectStreamParserObject.cpp
===================================================================
--- src/base/PdfObjectStreamParserObject.cpp	(rÃ©vision 1430)
+++ src/base/PdfObjectStreamParserObject.cpp	(copie de travail)
@@ -27,6 +27,12 @@
 #include "PdfStream.h"
 #include "PdfVecObjects.h"

+#include <algorithm>
+
+#if defined(PODOFO_VERBOSE_DEBUG)
+#include <iostream>
+#endif
+
 namespace PoDoFo {

 PdfObjectStreamParserObject::PdfObjectStreamParserObject(PdfParserObject* pParser, PdfVecObjects* pVecObjects, const PdfRefCountedBuffer & rBuffer, PdfEncrypt* pEncrypt )
@@ -40,7 +46,7 @@

 }

-void PdfObjectStreamParserObject::Parse()
+void PdfObjectStreamParserObject::Parse(ObjectIdList const & list)
 {
     long long lNum   = m_pParser->GetDictionary().GetKeyAsLong( "N", 0 );
     long long lFirst = m_pParser->GetDictionary().GetKeyAsLong( "First", 0 );
@@ -50,20 +56,20 @@
     m_pParser->GetStream()->GetFilteredCopy( &pBuffer, &lBufferLen );

     try {
+        this->ReadObjectsFromStream( pBuffer, lBufferLen, lNum, lFirst, list );
+        free( pBuffer );
+
         // the object stream is not needed anymore in the final PDF
         delete m_vecObjects->RemoveObject( m_pParser->Reference() );
         m_pParser = NULL;

-        this->ReadObjectsFromStream( pBuffer, lBufferLen, lNum, lFirst );
-        free( pBuffer );
-
     } catch( const PdfError & rError ) {
         free( pBuffer );
         throw rError;
     }
 }

-void PdfObjectStreamParserObject::ReadObjectsFromStream( char* pBuffer, pdf_long lBufferLen, long long lNum, long long lFirst )
+void PdfObjectStreamParserObject::ReadObjectsFromStream( char* pBuffer, pdf_long lBufferLen, long long lNum, long long lFirst, ObjectIdList const & list)
 {
     PdfRefCountedInputDevice device( pBuffer, lBufferLen );
     PdfTokenizer             tokenizer( device, m_buffer );
@@ -82,13 +88,19 @@
 		// use a second tokenizer here so that anything that gets dequeued isn't left in the tokenizer that reads the offsets and lengths
 	    PdfTokenizer variantTokenizer( device, m_buffer );
         variantTokenizer.GetNextVariant( var, m_pEncrypt );
-
-        if(m_vecObjects->GetObject(PdfReference( static_cast<int>(lObj), 0LL )))
-        {
+		bool should_read = std::find(list.begin(), list.end(), lObj) != list.end();
+#if defined(PODOFO_VERBOSE_DEBUG)
+    std::cerr << "ReadObjectsFromStream STREAM=" << m_pParser->Reference().ToString() <<
+			", OBJ=" << lObj <<
+			", " << (should_read ? "read" : "skipped") << std::endl;
+#endif
+		if (should_read) {
+			if(m_vecObjects->GetObject(PdfReference( static_cast<int>(lObj), 0LL ))) {
             PdfError::LogMessage( eLogSeverity_Warning, "Object: %li 0 R will be deleted and loaded again.\n", lObj );
             delete m_vecObjects->RemoveObject(PdfReference( static_cast<int>(lObj), 0LL ),false);
         }
         m_vecObjects->push_back( new PdfObject( PdfReference( static_cast<int>(lObj), 0LL ), var ) );
+		}

         // move back to the position inside of the table of contents
         device.Device()->Seek( pos );
Index: src/base/PdfObjectStreamParserObject.h
===================================================================
--- src/base/PdfObjectStreamParserObject.h	(rÃ©vision 1430)
+++ src/base/PdfObjectStreamParserObject.h	(copie de travail)
@@ -39,6 +39,7 @@
  */
 class PdfObjectStreamParserObject {
 public:
+	typedef std::vector<long long> ObjectIdList;
     /**
      * Create a new PdfObjectStreamParserObject from an existing
      * PdfParserObject. The PdfParserObject will be removed and deleted.
@@ -53,10 +54,10 @@

     ~PdfObjectStreamParserObject();

-    void Parse();
+    void Parse(ObjectIdList const &);

 private:
-    void ReadObjectsFromStream( char* pBuffer, pdf_long lBufferLen, long long lNum, long long lFirst );
+    void ReadObjectsFromStream( char* pBuffer, pdf_long lBufferLen, long long lNum, long long lFirst, ObjectIdList const &);

 private:
     PdfParserObject* m_pParser;

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d

_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

[Podofo-users] Some patches for podofo.

Reply via email to