Re: [Dspace-tech] Java Heap dumps during Filter-Media

Larry Stone Wed, 08 Apr 2009 14:21:46 -0700

The PDFBox library is _always_ going to be a problem because of its
architecture.  It insists on reading the entire PDF document, images
included, into memory.  This is not necessary, PDF was explicitly
designed to let renderers process a page at a time in limited memory.
Perhaps it could gain a lot by adding a "mode" where it ignores images
(e.g. for text extraction, it is a complete waste of time to even
read them into memory since it won't be getting any text out of them).


I took a different approach that may be helpful to sites with a lot
of PDF content that is pathological to PDFBox.  I wrote a couple of
filters that invoke the XPDF utilities as external OS-level command
processes to do the dirty work.  They are a bit more complicated to
maintain since they rely on outside programs that have to be installed,
but I've found the xpdf tools to be simple to install and maintain.
The XPDF-based text extractor is about three times as fast as PDFBox and
the only inputs it failed on PDFs were corrupt.  There were also no
issues with heap space since it runs outside of the JVM.

See patch #2745393 for the code:
https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984

    -- Larry


------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Java Heap dumps during Filter-Media

Reply via email to