Larry,
Thanks, nI posted a patch that fits into DSpace 1.5.x cleanly and only
requires imageio if you enable the build profile for using it.
I also altered the documentation... (anyone out there any good at inserting
this stuff into docBook? We could use a hand).
Note, the JIRA issue is public ad viewable ...
http://jira.dspace.org/jira/browse/DS-183
And we are switching away from S.F. and eventually shutting down the
trackers there... I expect you will eventually have to post things there
instead. We hope it will be a big improvement over the SF Tracker mess.
Cheers,
Mark
On Thu, Apr 9, 2009 at 12:24 PM, Larry Stone <[email protected]> wrote:
> > Larry, I assume this is a donation to DSpace? If so I'll commit it so its
> > available for testing/use in the 1.5.2 release.
>
> Sure, go ahead, although I won't have time to provide better documentation
> (for a while at least, maybe ever). My time on the FACADE project which
> produced this code is ending tomorrow, april 10; that's also the end of
> my time at MIT. I'm working desperately to finish other parts of the
> project and do not have any time to spend on this, that's why I just
> threw it over the wall because it looked like it could be useful right now.
>
> Eventually all of the code I produced for FACADE will be made available
> as open source; keep an eye on http://facade.mit.edu/ .. Not sure when
> this will happen, though.
>
> I'm not looking at any of the JIRA stuff (don't even have access yet)
> so if there's anything there that needs my attention, please send me
> personal mail -- I'm deleting anything with "JIRA" in the subject.
> Thanks, and enjoy..
>
> -- Larry
>
> > On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs <[email protected]
> >wrote:
> > > Nice work Larry,
> > >
> > > I've replaced our PDF text extraction and thumbnail generation with
> this
> > > code.
> > >
> > > Thankfully, running on Debian, adding the third party tools was as hard
> as
> > > "apt-get install xpdf" ;)
> > >
> > > I actually ran into a few more difficulties with the ImageIO libraries
> -
> > > it's a pity that you don't get a simple ClassNotFoundException to be
> able to
> > > report this more clearly.
> > >
> > > But aside from that, my limited tests seem to work quite well.
> > >
> > > G
> > >
> > > -----Original Message-----
> > > From: Larry Stone [mailto:[email protected]]
> > > Sent: 08 April 2009 22:21
> > > To: Tim Donohue
> > > Cc: DSpace Tech; Jeffrey Trimble
> > > Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media
> > >
> > > The PDFBox library is _always_ going to be a problem because of its
> > > architecture. It insists on reading the entire PDF document, images
> > > included, into memory. This is not necessary, PDF was explicitly
> designed
> > > to let renderers process a page at a time in limited memory.
> > > Perhaps it could gain a lot by adding a "mode" where it ignores images
> > > (e.g. for text extraction, it is a complete waste of time to even read
> them
> > > into memory since it won't be getting any text out of them).
> > >
> > > I took a different approach that may be helpful to sites with a lot of
> PDF
> > > content that is pathological to PDFBox. I wrote a couple of filters
> that
> > > invoke the XPDF utilities as external OS-level command processes to do
> the
> > > dirty work. They are a bit more complicated to maintain since they
> rely on
> > > outside programs that have to be installed, but I've found the xpdf
> tools to
> > > be simple to install and maintain.
> > > The XPDF-based text extractor is about three times as fast as PDFBox
> and
> > > the only inputs it failed on PDFs were corrupt. There were also no
> issues
> > > with heap space since it runs outside of the JVM.
> > >
> > > See patch #2745393 for the code:
> > >
> > >
> https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984
> > >
> > > -- Larry
> > >
> > >
> > >
> > >
> ------------------------------------------------------------------------------
> > > This SF.net email is sponsored by:
> > > High Quality Requirements in a Collaborative Environment.
> > > Download a free trial of Rational Requirements Composer Now!
> > > http://p.sf.net/sfu/www-ibm-com
> > > _______________________________________________
> > > DSpace-tech mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > >
> > >
> > >
> ------------------------------------------------------------------------------
> > > This SF.net email is sponsored by:
> > > High Quality Requirements in a Collaborative Environment.
> > > Download a free trial of Rational Requirements Composer Now!
> > > http://p.sf.net/sfu/www-ibm-com
> > > _______________________________________________
> > > DSpace-tech mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > >
> >
> >
> >
> > --
> > Mark R. Diggory
> > http://purl.org/net/mdiggory/homepage - Bio
> > http://www.atmire.com - Institutional Repository Solutions
> > http://www.togather.eu - Before getting together, get t...@ther
> >
> > --001636c5b1fac033c2046723300c
> > Content-Type: text/html; charset=ISO-8859-1
> > Content-Transfer-Encoding: quoted-printable
> >
> > Larry, I assume this is a donation to DSpace? If so I'll commit it so
> i=
> > ts available for testing/use in the 1.5.2
> release.<br><br>Mark<br><br><br><=
> > div class=3D"gmail_quote">On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs
> <s=
> > pan dir=3D"ltr"><<a href=3D"mailto:[email protected]
> ">gra...@biom=
> > edcentral.com</a>></span> wrote:<br>
> >
> > <blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid
> rgb(204, =
> > 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Nice work
> Larry,<=
> > br>
> > <br>
> > I've replaced our PDF text extraction and thumbnail generation with
> thi=
> > s code.<br>
> > <br>
> > Thankfully, running on Debian, adding the third party tools was as hard
> as =
> > "apt-get install xpdf" ;)<br>
> > <br>
> > I actually ran into a few more difficulties with the ImageIO libraries -
> it=
> > 's a pity that you don't get a simple ClassNotFoundException to
> be =
> > able to report this more clearly.<br>
> > <br>
> > But aside from that, my limited tests seem to work quite well.<br>
> > <font color=3D"#888888"><br>
> > G<br>
> > </font><div class=3D"im"><br>
> > -----Original Message-----<br>
> > From: Larry Stone [mailto:<a href=3D"mailto:[email protected]">[email protected]
> </a>]<b=
> > r>
> > Sent: 08 April 2009 22:21<br>
> > To: Tim Donohue<br>
> > Cc: DSpace Tech; Jeffrey Trimble<br>
> > Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media<br>
> > <br>
> > </div><div class=3D"im">The PDFBox library is _always_ going to be a
> proble=
> > m because of its architecture. =A0It insists on reading the entire PDF
> docu=
> > ment, images included, into memory. =A0This is not necessary, PDF was
> expli=
> > citly designed to let renderers process a page at a time in limited
> memory.=
> > <br>
> >
> >
> > Perhaps it could gain a lot by adding a "mode" where it ignores
> i=
> > mages (e.g. for text extraction, it is a complete waste of time to even
> rea=
> > d them into memory since it won't be getting any text out of
> them).<br>
> >
> >
> > <br>
> > I took a different approach that may be helpful to sites with a lot of
> PDF =
> > content that is pathological to PDFBox. =A0I wrote a couple of filters
> that=
> > invoke the XPDF utilities as external OS-level command processes to do
> the=
> > dirty work. =A0They are a bit more complicated to maintain since they
> rely=
> > on outside programs that have to be installed, but I've found the
> xpdf=
> > tools to be simple to install and maintain.<br>
> >
> >
> > The XPDF-based text extractor is about three times as fast as PDFBox and
> th=
> > e only inputs it failed on PDFs were corrupt. =A0There were also no
> issues =
> > with heap space since it runs outside of the JVM.<br>
> > <br>
> > See patch #2745393 for the code:<br>
> > <a href=3D"
> https://sourceforge.net/tracker/?func=3Ddetail&aid=3D2745393=
> > &group_id=3D19984&atid=3D319984" target=3D"_blank">
> https://sourcefo=
> >
> rge.net/tracker/?func=3Ddetail&aid=3D2745393&group_id=3D19984&a=
> > tid=3D319984</a><br>
> >
> >
> > <br>
> > =A0 =A0-- Larry<br>
> > <br>
> > <br>
> >
> ---------------------------------------------------------------------------=
> > ---<br>
> > </div><div><div></div><div class=3D"h5">This SF.net email is sponsored
> by:<=
> > br>
> > High Quality Requirements in a Collaborative Environment.<br>
> > Download a free trial of Rational Requirements Composer Now!<br>
> > <a href=3D"http://p.sf.net/sfu/www-ibm-com" target=3D"_blank">
> http://p.sf.n=
> > et/sfu/www-ibm-com</a><br>
> > _______________________________________________<br>
> > DSpace-tech mailing list<br>
> > <a href=3D"mailto:[email protected]
> ">[email protected]=
> > ceforge.net</a><br>
> > <a href=3D"https://lists.sourceforge.net/lists/listinfo/dspace-tech"
> target=
> > =3D"_blank">https://lists.sourceforge.net/lists/listinfo/dspace-tech
> </a><br=
> > >
> > <br>
> >
> ---------------------------------------------------------------------------=
> > ---<br>
> > This SF.net email is sponsored by:<br>
> > High Quality Requirements in a Collaborative Environment.<br>
> > Download a free trial of Rational Requirements Composer Now!<br>
> > <a href=3D"http://p.sf.net/sfu/www-ibm-com" target=3D"_blank">
> http://p.sf.n=
> > et/sfu/www-ibm-com</a><br>
> > _______________________________________________<br>
> > DSpace-tech mailing list<br>
> > <a href=3D"mailto:[email protected]
> ">[email protected]=
> > ceforge.net</a><br>
> > <a href=3D"https://lists.sourceforge.net/lists/listinfo/dspace-tech"
> target=
> > =3D"_blank">https://lists.sourceforge.net/lists/listinfo/dspace-tech
> </a><br=
> > >
> > </div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Mark R.
> Dig=
> > gory<br><a href=3D"http://purl.org/net/mdiggory/homepage">
> http://purl.org/n=
> > et/mdiggory/homepage</a> - Bio<br><a href=3D"http://www.atmire.com
> ">http://=
> > www.atmire.com</a> - Institutional Repository Solutions<br>
> >
> > <a href=3D"http://www.togather.eu">http://www.togather.eu</a> - Before
> gett=
> > ing together, get t...@ther <br>
> >
> > --001636c5b1fac033c2046723300c--
>
>
--
Mark R. Diggory
http://purl.org/net/mdiggory/homepage - Bio
http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech