Hi

A while back I posted about a problem loading a large PDF document into PoDoFo. 
The document in question was fairly unusual (it's a 700 page list of pharmacies 
in North America) but took 15 minutes to load and allocated 800MB of working 
set before throwing an out of memory error.

Problem is due to:

a) large number of objects (about 450,000) in document
b) short byte sequences in the source document turning into 40-100 byte 
PdfObjects in memory (which turns a 20MB document on disk into 800MB in memory)
        
There's no easy fix without major refactoring, and the document in question is 
pretty unusual, so a workaround seems in order. The workaround provides a way 
for the caller to specify max number of objects to load (an exception is thrown 
if object limit is exceeded when reading header). If the caller doesn't specify 
an object limit the behaviour is unchanged from previous versions.

PdfParser.h

.370 added

   /**
     * \return maximum object count to read (default is LONG_MAX
         * which means no limit)
     */
    inline static long GetMaxObjectCount();

    /**
     * Specify the maximum number of objects the parser should
     * read. An exception is thrown if document contains more
         * objects than this. Use to avoid problems with very large 
         * documents with millions of objects, which use 500MB of 
         * working set and spend 15 mins in Load() before throwing 
         * an out of memory exception.
     *
     * \param nMaxObjects set max number of objects
     */
    inline static void SetMaxObjectCount( long nMaxObjects );

.538 added
    static long   s_nMaxObjects;

.641 added
// -----------------------------------------------------
// 
// -----------------------------------------------------
long PdfParser::GetMaxObjectCount()
{
    return s_nMaxObjects;
}

// -----------------------------------------------------
// 
// -----------------------------------------------------
void PdfParser::SetMaxObjectCount( long nMaxObjects )
{
    s_nMaxObjects = nMaxObjects;
}

PdfParser.cpp

.51 added
long PdfParser::s_nMaxObjects = LONG_MAX;

.293 added
    // allow caller to specify a max object count to avoid very slow load times 
on large documents
    if (s_nMaxObjects != LONG_MAX && m_nNumObjects > s_nMaxObjects)
        PODOFO_RAISE_ERROR_INFO( ePdfError_ValueOutOfRange, "m_nNumObjects is 
greater than m_nMaxObjects." );

Best Regards
Mark

Mark Rogers - mark.rog...@powermapper.com
PowerMapper Software Ltd - www.powermapper.com 
Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL 



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to