At MIT we came up with a similar approach, which takes some of the grunt work out of managing the skips. We extended MediaFilter to detect PDFBox (or other) exceptions, then automatically record their handles to a skip list, which is used for any subsequent runs. We'd be glad to give you the code or just put it into the next 1.5.X release.
Thanks, Richard R Quoting Tim Donohue <[email protected]>: > Jeffrey, > > I've seen this same issue all to many times to count. From what I've > noticed it seems that the PDFBox software (which DSpace uses) > occasionally has difficulties with larger PDFs (usually 7MB or larger) > which included OCRed, scanned images. I've never encountered this > problem with PDFs created directly from digital files (like Word, etc.)... > > From what I've seen, occasionally recreating the PDF will resolve the > problem...but, more often than not even that doesn't help. The problem > seems to be more of an issue with how PDFBox loads the content into memory. > > Locally, I've only come up with two possible solutions: > > (1) Increase the memory available to the 'filter-media' script (by > bumping up the -Xmx value in the '[dspace]/bin/dsrun' script). This > works for some PDFs, but others will continue to have problems (as > PDFBox seems to use up enormous amounts of memory for some PDFs). > > (2) Force those problematic PDFs to be skipped over by the > 'filter-media' script (by using the -s flag): > > To make this easier on myself, I've started maintaining a > "filter-skiplist" file which lists all the handles of the problematic > PDFs (so far we've encountered 35 of them), with a separate handle on > each line. Then, I pass this "filter-skiplist" file to the cronjob > which runs 'filter-media' like so: > > 0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','` > > The above script translates all the newlines (\n) to commas (,) in the > 'filter-skiplist' file and passes the result to the 'filter-media' -s > (skip) flag. So, in the end, filter-media receives a comma-separated > list of handles of PDFs which it should no longer process. (Obviously > this means any PDFs belonging to items in your 'filter-skiplist' can not > be full text searched in DSpace) > > I'm hoping that in the longer term PDFBox will resolve its memory issues > as it comes out of the "incubation" stage under Apache. > > If anyone else has potential solutions, I'd love to hear them, as I'm in > a similar situation as Jeffrey. > > - Tim > > > Jeffrey Trimble wrote: >> I've run into a funky situation. After using the distributed PDFBOX....and >> the associated jars (bouncy castle) the filter media works really, >> really well, >> until-- >> >> We have one pdf that has caused the filter-media to produce a memory dump/ >> java heap dump. The errors are reports first the IBM flavor of JVM. >> We removed >> the offending PDF from the database, the filter-media went on it's way >> merrily. >> >> Has anyone seen anything like this? I have a copy of the heap dump and >> trace. I can >> reproduce it one demand by placing this PDF back into the IR. >> >> If you have seen this, and was able to resolve it, please let me know. >> The only thing >> I can think of doing is to rescan the PDF file from the original and >> seeing if there >> is something that resovles itself with the new scan. >> >> Thanks in advance, >> >> >> Jeffrey Trimble >> System LIbrarian >> William F. Maag Library >> Youngstown State University >> 330.941.2483 (Office) >> [email protected] <mailto:[email protected]> >> http://www.maag.ysu.edu >> http://digital.maag.ysu.edu >> >> >> >> >> ------------------------------------------------------------------------ >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by: >> High Quality Requirements in a Collaborative Environment. >> Download a free trial of Rational Requirements Composer Now! >> http://p.sf.net/sfu/www-ibm-com >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech > > -- > Tim Donohue > Research Programmer, IDEALS > http://www.ideals.uiuc.edu/ > University of Illinois > [email protected] | (217) 333-4648 > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > High Quality Requirements in a Collaborative Environment. > Download a free trial of Rational Requirements Composer Now! > http://p.sf.net/sfu/www-ibm-com > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > ------------------------------------------------------------------------------ This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

