Hello,
when trying to use PoDoFo to parse large files (e.g. the PDF spec),
PdfMemDocument took forever to load. When I tried running, e.g.
podofocountpages on the PDF spec after approx. ten minutes it wasn't
finished yet.
After looking at it, I found out that most of the time
is spent in
PdfObjectStreamparserObject::ReadObjectsFromStream
in the while cycle.
First, on line 98 (PdfObjectStreamParserObject.cpp)
GetObject is called. If the m_vecObjects vector is not sorted, the
method first calls sort to sort it.
Then, on line 102, a new object is pushed_back into the vector,
which invalidates the sorting so when GetObject is called again,
it has to sort the list again --- Quite time consuming.
I have tried replacing the push_back on line 102 by a new
method insert_sorted, which inserts the element into
the vector so that it stays sorted (if it was sorted in the first place)
and this has led to a dramatic improvement: now podofocountpages
finishes in a reasonable amount of time:
$ time podofocountpages PDF32000_2008.pdf
PDF32000_2008.pdf: 756
real 0m20.235s
user 0m16.809s
sys 0m0.260s
Attached is my proposed patch. What do you think about it?
Best,
Jonathan Verner
Index: src/base/PdfObjectStreamParserObject.cpp
===================================================================
--- src/base/PdfObjectStreamParserObject.cpp (revision 1456)
+++ src/base/PdfObjectStreamParserObject.cpp (working copy)
@@ -99,7 +99,7 @@
PdfError::LogMessage( eLogSeverity_Warning, "Object: %li 0 R will be deleted and loaded again.\n", lObj );
delete m_vecObjects->RemoveObject(PdfReference( static_cast<int>(lObj), 0LL ),false);
}
- m_vecObjects->push_back( new PdfObject( PdfReference( static_cast<int>(lObj), 0LL ), var ) );
+ m_vecObjects->insert_sorted( new PdfObject( PdfReference( static_cast<int>(lObj), 0LL ), var ) );
}
// move back to the position inside of the table of contents
Index: src/base/PdfVecObjects.h
===================================================================
--- src/base/PdfVecObjects.h (revision 1456)
+++ src/base/PdfVecObjects.h (working copy)
@@ -273,6 +273,19 @@
*/
void push_back( PdfObject* pObj );
+ /** Insert an object into this vector so that
+ * the vector remains sorted w.r.t.
+ * the ordering based on object and generation numbers
+ * m_bObjectCount will be increased for the object.
+ *
+ * Note: Assumes the vector is sorted, otherwise
+ * equivalent to push_back
+ *
+ * \param pObj pointer to the object you want to insert
+ */
+ void insert_sorted( PdfObject *pObj );
+
+
/**
* Sort the objects in the vector based on their object and generation numbers
*/
Index: src/base/PdfVecObjects.cpp
===================================================================
--- src/base/PdfVecObjects.cpp (revision 1456)
+++ src/base/PdfVecObjects.cpp (working copy)
@@ -277,6 +277,18 @@
m_vector.push_back( pObj );
}
+void PdfVecObjects::insert_sorted( PdfObject* pObj )
+{
+ SetObjectCount( pObj->Reference() );
+ pObj->SetOwner( this );
+
+ if ( m_bSorted ) {
+ TVecObjects::iterator i_pos = std::lower_bound(m_vector.begin(),m_vector.end(),pObj,ObjectLittle);
+ m_vector.insert(i_pos, pObj );
+ } else m_vector.push_back( pObj );
+
+}
+
void PdfVecObjects::RenumberObjects( PdfObject* pTrailer, TPdfReferenceSet* pNotDelete, bool bDoGarbageCollection )
{
TVecReferencePointerList list;
------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users