The PDFBox library is _always_ going to be a problem because of its architecture. It insists on reading the entire PDF document, images included, into memory. This is not necessary, PDF was explicitly designed to let renderers process a page at a time in limited memory. Perhaps it could gain a lot by adding a "mode" where it ignores images (e.g. for text extraction, it is a complete waste of time to even read them into memory since it won't be getting any text out of them).
I took a different approach that may be helpful to sites with a lot of PDF content that is pathological to PDFBox. I wrote a couple of filters that invoke the XPDF utilities as external OS-level command processes to do the dirty work. They are a bit more complicated to maintain since they rely on outside programs that have to be installed, but I've found the xpdf tools to be simple to install and maintain. The XPDF-based text extractor is about three times as fast as PDFBox and the only inputs it failed on PDFs were corrupt. There were also no issues with heap space since it runs outside of the JVM. See patch #2745393 for the code: https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984 -- Larry ------------------------------------------------------------------------------ This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

