> Larry, I assume this is a donation to DSpace? If so I'll commit it so its > available for testing/use in the 1.5.2 release.
Sure, go ahead, although I won't have time to provide better documentation (for a while at least, maybe ever). My time on the FACADE project which produced this code is ending tomorrow, april 10; that's also the end of my time at MIT. I'm working desperately to finish other parts of the project and do not have any time to spend on this, that's why I just threw it over the wall because it looked like it could be useful right now. Eventually all of the code I produced for FACADE will be made available as open source; keep an eye on http://facade.mit.edu/ .. Not sure when this will happen, though. I'm not looking at any of the JIRA stuff (don't even have access yet) so if there's anything there that needs my attention, please send me personal mail -- I'm deleting anything with "JIRA" in the subject. Thanks, and enjoy.. -- Larry > On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs > <[email protected]>wrote: > > Nice work Larry, > > > > I've replaced our PDF text extraction and thumbnail generation with this > > code. > > > > Thankfully, running on Debian, adding the third party tools was as hard as > > "apt-get install xpdf" ;) > > > > I actually ran into a few more difficulties with the ImageIO libraries - > > it's a pity that you don't get a simple ClassNotFoundException to be able to > > report this more clearly. > > > > But aside from that, my limited tests seem to work quite well. > > > > G > > > > -----Original Message----- > > From: Larry Stone [mailto:[email protected]] > > Sent: 08 April 2009 22:21 > > To: Tim Donohue > > Cc: DSpace Tech; Jeffrey Trimble > > Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media > > > > The PDFBox library is _always_ going to be a problem because of its > > architecture. It insists on reading the entire PDF document, images > > included, into memory. This is not necessary, PDF was explicitly designed > > to let renderers process a page at a time in limited memory. > > Perhaps it could gain a lot by adding a "mode" where it ignores images > > (e.g. for text extraction, it is a complete waste of time to even read them > > into memory since it won't be getting any text out of them). > > > > I took a different approach that may be helpful to sites with a lot of PDF > > content that is pathological to PDFBox. I wrote a couple of filters that > > invoke the XPDF utilities as external OS-level command processes to do the > > dirty work. They are a bit more complicated to maintain since they rely on > > outside programs that have to be installed, but I've found the xpdf tools to > > be simple to install and maintain. > > The XPDF-based text extractor is about three times as fast as PDFBox and > > the only inputs it failed on PDFs were corrupt. There were also no issues > > with heap space since it runs outside of the JVM. > > > > See patch #2745393 for the code: > > > > https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984 > > > > -- Larry > > > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by: > > High Quality Requirements in a Collaborative Environment. > > Download a free trial of Rational Requirements Composer Now! > > http://p.sf.net/sfu/www-ibm-com > > _______________________________________________ > > DSpace-tech mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by: > > High Quality Requirements in a Collaborative Environment. > > Download a free trial of Rational Requirements Composer Now! > > http://p.sf.net/sfu/www-ibm-com > > _______________________________________________ > > DSpace-tech mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > > > > -- > Mark R. Diggory > http://purl.org/net/mdiggory/homepage - Bio > http://www.atmire.com - Institutional Repository Solutions > http://www.togather.eu - Before getting together, get t...@ther > > --001636c5b1fac033c2046723300c > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > > Larry, I assume this is a donation to DSpace? If so I'll commit it so i= > ts available for testing/use in the 1.5.2 release.<br><br>Mark<br><br><br><= > div class=3D"gmail_quote">On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs <s= > pan dir=3D"ltr"><<a href=3D"mailto:[email protected]">gra...@biom= > edcentral.com</a>></span> wrote:<br> > > <blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, = > 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Nice work Larry,<= > br> > <br> > I've replaced our PDF text extraction and thumbnail generation with thi= > s code.<br> > <br> > Thankfully, running on Debian, adding the third party tools was as hard as = > "apt-get install xpdf" ;)<br> > <br> > I actually ran into a few more difficulties with the ImageIO libraries - it= > 's a pity that you don't get a simple ClassNotFoundException to be = > able to report this more clearly.<br> > <br> > But aside from that, my limited tests seem to work quite well.<br> > <font color=3D"#888888"><br> > G<br> > </font><div class=3D"im"><br> > -----Original Message-----<br> > From: Larry Stone [mailto:<a href=3D"mailto:[email protected]">[email protected]</a>]<b= > r> > Sent: 08 April 2009 22:21<br> > To: Tim Donohue<br> > Cc: DSpace Tech; Jeffrey Trimble<br> > Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media<br> > <br> > </div><div class=3D"im">The PDFBox library is _always_ going to be a proble= > m because of its architecture. =A0It insists on reading the entire PDF docu= > ment, images included, into memory. =A0This is not necessary, PDF was expli= > citly designed to let renderers process a page at a time in limited memory.= > <br> > > > Perhaps it could gain a lot by adding a "mode" where it ignores i= > mages (e.g. for text extraction, it is a complete waste of time to even rea= > d them into memory since it won't be getting any text out of them).<br> > > > <br> > I took a different approach that may be helpful to sites with a lot of PDF = > content that is pathological to PDFBox. =A0I wrote a couple of filters that= > invoke the XPDF utilities as external OS-level command processes to do the= > dirty work. =A0They are a bit more complicated to maintain since they rely= > on outside programs that have to be installed, but I've found the xpdf= > tools to be simple to install and maintain.<br> > > > The XPDF-based text extractor is about three times as fast as PDFBox and th= > e only inputs it failed on PDFs were corrupt. =A0There were also no issues = > with heap space since it runs outside of the JVM.<br> > <br> > See patch #2745393 for the code:<br> > <a href=3D"https://sourceforge.net/tracker/?func=3Ddetail&aid=3D2745393= > &group_id=3D19984&atid=3D319984" target=3D"_blank">https://sourcefo= > rge.net/tracker/?func=3Ddetail&aid=3D2745393&group_id=3D19984&a= > tid=3D319984</a><br> > > > <br> > =A0 =A0-- Larry<br> > <br> > <br> > ---------------------------------------------------------------------------= > ---<br> > </div><div><div></div><div class=3D"h5">This SF.net email is sponsored by:<= > br> > High Quality Requirements in a Collaborative Environment.<br> > Download a free trial of Rational Requirements Composer Now!<br> > <a href=3D"http://p.sf.net/sfu/www-ibm-com" target=3D"_blank">http://p.sf.n= > et/sfu/www-ibm-com</a><br> > _______________________________________________<br> > DSpace-tech mailing list<br> > <a href=3D"mailto:[email protected]">[email protected]= > ceforge.net</a><br> > <a href=3D"https://lists.sourceforge.net/lists/listinfo/dspace-tech" target= > =3D"_blank">https://lists.sourceforge.net/lists/listinfo/dspace-tech</a><br= > > > <br> > ---------------------------------------------------------------------------= > ---<br> > This SF.net email is sponsored by:<br> > High Quality Requirements in a Collaborative Environment.<br> > Download a free trial of Rational Requirements Composer Now!<br> > <a href=3D"http://p.sf.net/sfu/www-ibm-com" target=3D"_blank">http://p.sf.n= > et/sfu/www-ibm-com</a><br> > _______________________________________________<br> > DSpace-tech mailing list<br> > <a href=3D"mailto:[email protected]">[email protected]= > ceforge.net</a><br> > <a href=3D"https://lists.sourceforge.net/lists/listinfo/dspace-tech" target= > =3D"_blank">https://lists.sourceforge.net/lists/listinfo/dspace-tech</a><br= > > > </div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Mark R. Dig= > gory<br><a href=3D"http://purl.org/net/mdiggory/homepage">http://purl.org/n= > et/mdiggory/homepage</a> - Bio<br><a href=3D"http://www.atmire.com">http://= > www.atmire.com</a> - Institutional Repository Solutions<br> > > <a href=3D"http://www.togather.eu">http://www.togather.eu</a> - Before gett= > ing together, get t...@ther <br> > > --001636c5b1fac033c2046723300c-- ------------------------------------------------------------------------------ This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

