Nice work Larry,

I've replaced our PDF text extraction and thumbnail generation with this code.

Thankfully, running on Debian, adding the third party tools was as hard as 
"apt-get install xpdf" ;)

I actually ran into a few more difficulties with the ImageIO libraries - it's a 
pity that you don't get a simple ClassNotFoundException to be able to report 
this more clearly.

But aside from that, my limited tests seem to work quite well.

G 

-----Original Message-----
From: Larry Stone [mailto:[email protected]] 
Sent: 08 April 2009 22:21
To: Tim Donohue
Cc: DSpace Tech; Jeffrey Trimble
Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media

The PDFBox library is _always_ going to be a problem because of its 
architecture.  It insists on reading the entire PDF document, images included, 
into memory.  This is not necessary, PDF was explicitly designed to let 
renderers process a page at a time in limited memory.
Perhaps it could gain a lot by adding a "mode" where it ignores images (e.g. 
for text extraction, it is a complete waste of time to even read them into 
memory since it won't be getting any text out of them).

I took a different approach that may be helpful to sites with a lot of PDF 
content that is pathological to PDFBox.  I wrote a couple of filters that 
invoke the XPDF utilities as external OS-level command processes to do the 
dirty work.  They are a bit more complicated to maintain since they rely on 
outside programs that have to be installed, but I've found the xpdf tools to be 
simple to install and maintain.
The XPDF-based text extractor is about three times as fast as PDFBox and the 
only inputs it failed on PDFs were corrupt.  There were also no issues with 
heap space since it runs outside of the JVM.

See patch #2745393 for the code:
https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984

    -- Larry


------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
    • R... Jeffrey Trimble
    • R... Richard Rodgers
      • ... Dorothea Salo
        • ... Tim Donohue
      • ... Jeffrey Trimble
    • R... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
  • Re: [D... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
  • Re: [D... Larry Stone
  • Re: [D... Graham Triggs
  • Re: [D... Larry Stone

Reply via email to