At MIT we came up with a similar approach, which takes some of the
grunt work out of managing the skips. We extended MediaFilter to detect PDFBox
(or other) exceptions, then automatically record their handles to a skip list,
which is used for any subsequent runs. We'd be glad to give you the code or
just put it into the next 1.5.X release.

Thanks,

Richard R

Quoting Tim Donohue <[email protected]>:

> Jeffrey,
>
> I've seen this same issue all to many times to count.  From what I've
> noticed it seems that the PDFBox software (which DSpace uses)
> occasionally has difficulties with larger PDFs (usually 7MB or larger)
> which included OCRed, scanned images.   I've never encountered this
> problem with PDFs created directly from digital files (like Word, etc.)...
>
> From what I've seen, occasionally recreating the PDF will resolve the
> problem...but, more often than not even that doesn't help.  The problem
> seems to be more of an issue with how PDFBox loads the content into memory.
>
> Locally, I've only come up with two possible solutions:
>
> (1) Increase the memory available to the 'filter-media' script (by
> bumping up the -Xmx value in the '[dspace]/bin/dsrun' script).  This
> works for some PDFs, but others will continue to have problems (as
> PDFBox seems to use up enormous amounts of memory for some PDFs).
>
> (2) Force those problematic PDFs to be skipped over by the
> 'filter-media' script (by using the -s flag):
>
> To make this easier on myself, I've started maintaining a
> "filter-skiplist" file which lists all the handles of the problematic
> PDFs (so far we've encountered 35 of them), with a separate handle on
> each line.  Then, I pass this "filter-skiplist" file to the cronjob
> which runs 'filter-media' like so:
>
> 0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','`
>
> The above script translates all the newlines (\n) to commas (,) in the
> 'filter-skiplist' file and passes the result to the 'filter-media' -s
> (skip) flag.  So, in the end, filter-media receives a comma-separated
> list of handles of PDFs which it should no longer process.  (Obviously
> this means any PDFs belonging to items in your 'filter-skiplist' can not
> be full text searched in DSpace)
>
> I'm hoping that in the longer term PDFBox will resolve its memory issues
> as it comes out of the "incubation" stage under Apache.
>
> If anyone else has potential solutions, I'd love to hear them, as I'm in
> a similar situation as Jeffrey.
>
> - Tim
>
>
> Jeffrey Trimble wrote:
>> I've run into a funky situation.  After using the distributed PDFBOX....and
>> the associated jars (bouncy castle) the filter media works really,
>> really well,
>> until--
>>
>> We have one pdf that has caused the filter-media to produce a memory dump/
>> java heap dump.  The errors are reports first  the IBM flavor of JVM.
>>  We removed
>> the offending PDF from the database, the filter-media went on it's way
>> merrily.
>>
>> Has anyone seen anything like this?  I have a copy of the heap dump and
>> trace.  I can
>> reproduce it one demand by placing this PDF back into the IR.
>>
>> If you have seen this, and was able to resolve it, please let me know.
>>  The only thing
>> I can think of doing is to rescan the PDF file from the original and
>> seeing if there
>> is something that resovles itself with the new scan.
>>
>> Thanks in advance,
>>
>>
>> Jeffrey Trimble
>> System LIbrarian
>> William F.  Maag Library
>> Youngstown State University
>> 330.941.2483 (Office)
>> [email protected] <mailto:[email protected]>
>> http://www.maag.ysu.edu
>> http://digital.maag.ysu.edu
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by:
>> High Quality Requirements in a Collaborative Environment.
>> Download a free trial of Rational Requirements Composer Now!
>> http://p.sf.net/sfu/www-ibm-com
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
> --
> Tim Donohue
> Research Programmer, IDEALS
> http://www.ideals.uiuc.edu/
> University of Illinois
> [email protected] | (217) 333-4648
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>



------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspac... Jeffrey Trimble
    • R... Tim Donohue
      • ... Jeffrey Trimble
      • ... Richard Rodgers
        • ... Dorothea Salo
          • ... Tim Donohue
        • ... Jeffrey Trimble
      • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Larry Stone
    • R... Graham Triggs
      • ... Mark Diggory

Reply via email to