We've also experienced the Java heap errors in filter-media.  What I did was 
create a postgreSQL table that holds the bitstream_id of each document that 
will not filter.  I modified MediaFilterManager.java to write a row to this 
table whenever it encounters an "unfilterable" document (via Java heap or other 
error(s)) and to query this table for the bitstream_id it's getting ready to 
try and filter *BEFORE* it attempts to filter it.  If the bitstream_id *is* 
found in this table, the document is skipped.  Essentially we're accomplishing 
the same thing as Tim, only we are also collecting date, time, # of times a 
document has been skipped, and we're also able to report this list of 
"unfilterable" documents to our users.  Then they can open the problematic .pdf 
file and save it as a .txt file, and we "import -update" them back into DSpace.



Sue



-----Original Message-----
From: Tim Donohue [mailto:[email protected]]
Sent: Wednesday, April 08, 2009 10:37 AM
To: Jeffrey Trimble
Cc: DSpace Tech
Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media



Jeffrey,



I've seen this same issue all to many times to count.  From what I've

noticed it seems that the PDFBox software (which DSpace uses)

occasionally has difficulties with larger PDFs (usually 7MB or larger)

which included OCRed, scanned images.   I've never encountered this

problem with PDFs created directly from digital files (like Word, etc.)...



 From what I've seen, occasionally recreating the PDF will resolve the

problem...but, more often than not even that doesn't help.  The problem

seems to be more of an issue with how PDFBox loads the content into memory.



Locally, I've only come up with two possible solutions:



(1) Increase the memory available to the 'filter-media' script (by

bumping up the -Xmx value in the '[dspace]/bin/dsrun' script).  This

works for some PDFs, but others will continue to have problems (as

PDFBox seems to use up enormous amounts of memory for some PDFs).



(2) Force those problematic PDFs to be skipped over by the

'filter-media' script (by using the -s flag):



To make this easier on myself, I've started maintaining a

"filter-skiplist" file which lists all the handles of the problematic

PDFs (so far we've encountered 35 of them), with a separate handle on

each line.  Then, I pass this "filter-skiplist" file to the cronjob

which runs 'filter-media' like so:



0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','`



The above script translates all the newlines (\n) to commas (,) in the

'filter-skiplist' file and passes the result to the 'filter-media' -s

(skip) flag.  So, in the end, filter-media receives a comma-separated

list of handles of PDFs which it should no longer process.  (Obviously

this means any PDFs belonging to items in your 'filter-skiplist' can not

be full text searched in DSpace)



I'm hoping that in the longer term PDFBox will resolve its memory issues

as it comes out of the "incubation" stage under Apache.



If anyone else has potential solutions, I'd love to hear them, as I'm in

a similar situation as Jeffrey.



- Tim





Jeffrey Trimble wrote:

> I've run into a funky situation.  After using the distributed PDFBOX....and

> the associated jars (bouncy castle) the filter media works really,

> really well,

> until--

>

> We have one pdf that has caused the filter-media to produce a memory dump/

> java heap dump.  The errors are reports first  the IBM flavor of JVM.

>  We removed

> the offending PDF from the database, the filter-media went on it's way

> merrily.

>

> Has anyone seen anything like this?  I have a copy of the heap dump and

> trace.  I can

> reproduce it one demand by placing this PDF back into the IR.

>

> If you have seen this, and was able to resolve it, please let me know.

>  The only thing

> I can think of doing is to rescan the PDF file from the original and

> seeing if there

> is something that resovles itself with the new scan.

>

> Thanks in advance,

>

>

> Jeffrey Trimble

> System LIbrarian

> William F.  Maag Library

> Youngstown State University

> 330.941.2483 (Office)

> [email protected] <mailto:[email protected]>

> http://www.maag.ysu.edu

> http://digital.maag.ysu.edu

>

>

>

>

> ------------------------------------------------------------------------

>

> ------------------------------------------------------------------------------

> This SF.net email is sponsored by:

> High Quality Requirements in a Collaborative Environment.

> Download a free trial of Rational Requirements Composer Now!

> http://p.sf.net/sfu/www-ibm-com

>

>

> ------------------------------------------------------------------------

>

> _______________________________________________

> DSpace-tech mailing list

> [email protected]

> https://lists.sourceforge.net/lists/listinfo/dspace-tech



--

Tim Donohue

Research Programmer, IDEALS

http://www.ideals.uiuc.edu/

University of Illinois

[email protected] | (217) 333-4648



------------------------------------------------------------------------------

This SF.net email is sponsored by:

High Quality Requirements in a Collaborative Environment.

Download a free trial of Rational Requirements Composer Now!

http://p.sf.net/sfu/www-ibm-com

_______________________________________________

DSpace-tech mailing list

[email protected]

https://lists.sourceforge.net/lists/listinfo/dspace-tech
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspac... Jeffrey Trimble
    • R... Tim Donohue
      • ... Jeffrey Trimble
      • ... Richard Rodgers
        • ... Dorothea Salo
          • ... Tim Donohue
        • ... Jeffrey Trimble
      • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Larry Stone
    • R... Graham Triggs
      • ... Mark Diggory
    • R... Larry Stone
      • ... Mark Diggory

Reply via email to