Larry, I assume this is a donation to DSpace? If so I'll commit it so its
available for testing/use in the 1.5.2 release.
Mark
On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs <[email protected]>wrote:
> Nice work Larry,
>
> I've replaced our PDF text extraction and thumbnail generation with this
> code.
>
> Thankfully, running on Debian, adding the third party tools was as hard as
> "apt-get install xpdf" ;)
>
> I actually ran into a few more difficulties with the ImageIO libraries -
> it's a pity that you don't get a simple ClassNotFoundException to be able to
> report this more clearly.
>
> But aside from that, my limited tests seem to work quite well.
>
> G
>
> -----Original Message-----
> From: Larry Stone [mailto:[email protected]]
> Sent: 08 April 2009 22:21
> To: Tim Donohue
> Cc: DSpace Tech; Jeffrey Trimble
> Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media
>
> The PDFBox library is _always_ going to be a problem because of its
> architecture. It insists on reading the entire PDF document, images
> included, into memory. This is not necessary, PDF was explicitly designed
> to let renderers process a page at a time in limited memory.
> Perhaps it could gain a lot by adding a "mode" where it ignores images
> (e.g. for text extraction, it is a complete waste of time to even read them
> into memory since it won't be getting any text out of them).
>
> I took a different approach that may be helpful to sites with a lot of PDF
> content that is pathological to PDFBox. I wrote a couple of filters that
> invoke the XPDF utilities as external OS-level command processes to do the
> dirty work. They are a bit more complicated to maintain since they rely on
> outside programs that have to be installed, but I've found the xpdf tools to
> be simple to install and maintain.
> The XPDF-based text extractor is about three times as fast as PDFBox and
> the only inputs it failed on PDFs were corrupt. There were also no issues
> with heap space since it runs outside of the JVM.
>
> See patch #2745393 for the code:
>
> https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984
>
> -- Larry
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
--
Mark R. Diggory
http://purl.org/net/mdiggory/homepage - Bio
http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech