Just to follow up briefly: Thanks for all the great suggestions from everyone! I've tried the latest development version of PDFBox today and unfortunately that didn't seem to resolve anything :(
I've also noticed that it *seems* to be related to the size of the PDFs. We just received a bulk load into our DSpace of about 100+ PDFs, and I've now run into about 3-4 which cause the OutOfMemory errors (all of which are between 11MB and 15MB). The only other thing in common is that all of these PDFs were initially image-based, and were OCRed before ingesting them into DSpace (not sure if that could be "confusing" PDFBox) In any case, I've logged a bug with PDFBox on SourceForge and referenced a few of the PDFs which have these issues. I'm hoping they'll be able to help debug it :) http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832 I'll post back to this thread once there is a resolution to the issue in case others run across this problem as well. - Tim Dan Scott wrote: > I should note to the list that the latest development version of > PDFBox claims to have solved an Out of Memory Exception error. Sounds > familiar :) > > I had suggested to Tim privately that maybe he could test it out and let us > know if it resolves the problem: > > http://www.pdfbox.org/changes.html#version_0.7.4-dev > > On 22/09/2007, Mark Diggory <[EMAIL PROTECTED]> wrote: >> Yes, I recall some issues in the past which we addressed by upgrading that >> jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar >> (0.7.3) while DSpace 1.4.1 has an older version. >> >> Unfortunately, PDFBox is a more full featured PDF editor and not just a text >> extractor, it pulls large portions of the PDF into memory when processing >> it. If we could find a more stream based text extractor for pdf files, it >> would make the memory footprint much more fixed for FilterMedia. >> >> -Mark >> >> >> >> On Sep 22, 2007, at 5:57 AM, Jimmy Zhang wrote: >> >> The responsibility of PDFBox is to extract the full text of the pdf file.I >> am wondering it maybe has to do with the pdf file.Do you mean any pdf files >> whose size more than 10M can cause problem or only that pdf file? >> >> -- >> Website: www.drepository.com >> >> On 9/21/07, Mark Diggory <[EMAIL PROTECTED] > wrote: >>> I would also then recommend trying to get the latest PDFBox and >>> replace the jar in your lib directory. >>> >>> http://sourceforge.net/project/showfiles.php? >>> group_id=78314&package_id=79377 >>> >>> On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote: >>> >>>> Jayan & Mark, >>>> >>>> Thanks for the suggestions. But, our problem is that we're >>>> currently running Java & dsrun using: >>>> >>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 >>>> >>>> (I've modified our local dsrun script to read from the JAVA_OPTS >>>> environment variable). >>>> >>>> So, even setting a maximum heap size of 1GB, we don't seem to be >>>> able to full text index a 15MB PDF without encountering >>>> "OutOfMemory: Java heap space" errors. Strange, I know. My >>>> current theory is that there may be a memory leak in the PDFBox >>>> tools. I'm still working on a definite diagnosis though. If no >>>> one else out there has noticed this with DSpace 1.4.2, then I guess >>>> it's possible there's something in our local settings (or >>>> customizations of DSpace) which could be causing this issue. >>>> >>>> - Tim >>>> >>>> Mark Diggory wrote: >>>>> We should consider adding more sane defaults, most machines that >>>>> DSpace is running on have well over 1Gig of memory available and >>>>> its important to remember this is a maximum heap size and is not >>>>> take unless required. I think setting dsrun and the other >>>>> commandline scripts to be 512m (1/2 * 1Gig) would eliminate most >>>>> outlying cases where PDF docs need to be held in memory. >>>>> -Mark Diggory >>>>> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote: >>>>>> Hi! Tim, >>>>>> >>>>>> Here we faced similar errors while trying out full-text indexing on >>>>>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 >>>>>> records. This was rectified once dsrun.bat was given 1000m at java >>>>>> -Xmx256m -classpath ........ >>>>>> http://repositorydev.ntu.edu.sg >>>>>> >>>>>> Jayan >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: [EMAIL PROTECTED] >>>>>> [mailto:[EMAIL PROTECTED] On >> Behalf Of Tim >>>>>> Donohue >>>>>> Sent: Friday, September 21, 2007 1:58 AM >>>>>> To: dspace-tech >>>>>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing >>>>>> >>>>>> All, >>>>>> >>>>>> I'm curious if anyone out there has run into strange OutOfMemory >>>>>> errors >>>>>> while full-text indexing larger (>10MB) PDF files in DSpace. >>>>>> >>>>>> It usually appears as either: >>>>>> >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>>>>> space >>>>>> >>>>>> OR >>>>>> >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: GC >>>>>> Overhead limit >>>>>> >>>>>> exceeded >>>>>> >>>>>> I've located the main "problem" PDF in our DSpace instance: >>>>>> http://hdl.handle.net/2142/2050 >>>>>> >>>>>> I've also done a large amount of searching/testing based on >>>>>> recommendations from various sites. In particular, I've done a >>>>>> memory >>>>>> dump using JHat >>>>>> >> (http://java.sun.com/javase/6/docs/technotes/tools/share/ >>>>>> jhat.html), and >>>>>> >>>>>> it looks like the problem may reside with a potential memory leak >>>>>> in the >>>>>> >>>>>> 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it >>>>>> *looks* >>>>>> like PDFBox is attempting to load most/all of the textual content >>>>>> into a >>>>>> >>>>>> giant HashMap) >>>>>> >>>>>> Here's the latest settings I've been testing on: >>>>>> >>>>>> RHEL 4 >>>>>> Java 1.6.0_02 >>>>>> Postgres 8.1.9 >>>>>> DSpace 1.4.2 >>>>>> >>>>>> We also have the following JAVA_OPTS settings in place for our JVM: >>>>>> >>>>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 - Dfile.encoding=UTF-8 >>>>>> >>>>>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're >>>>>> still getting the OutOfMemory exception at 1GB!) >>>>>> >>>>>> Anyone have any hints/tips or JVM settings to share? I >>>>>> personally don't >>>>>> >>>>>> see why PDFBox would need so much JVM memory to parse a 15MB >>>>>> PDF. But, >>>>>> the JHat analysis seemed to be pointing to PDFBox. >>>>>> >>>>>> - Tim >>>>>> >>>>>> P.S. an example of the full error stack trace is below: >>>>>> >>>>>> Exception in thread "main" java.lang.OutOfMemoryError : Java heap >>>>>> space >>>>>> at java.util.HashMap.resize(Unknown Source) >>>>>> at java.util.HashMap.addEntry(Unknown Source) >>>>>> at java.util.HashMap.put (Unknown Source) >>>>>> at >> org.fontbox.cmap.CMap.addMapping(CMap.java:132) >>>>>> at >> org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) >>>>>> at org.pdfbox.pdmodel.font.PDFont.parseCmap >> (PDFont.java: >>>>>> 535) >>>>>> at >> org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) >>>>>> at >>>>>> >> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java >> :325) >>>>>> at org.pdfbox.util.operator.ShowText.process >>>>>> (ShowText.java:64) >>>>>> at >>>>>> org.pdfbox.util.PDFStreamEngine.processOperator >>>>>> (PDFStreamEngine.java :452 >>>>>> ) >>>>>> at >>>>>> org.pdfbox.util.PDFStreamEngine.processSubStream >>>>>> (PDFStreamEngine.java:21 >>>>>> 5) >>>>>> at >>>>>> org.pdfbox.util.PDFStreamEngine.processStream >>>>>> (PDFStreamEngine.java:174) >>>>>> at >>>>>> >> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java: >>>>>> 336) >>>>>> at >>>>>> org.pdfbox.util.PDFTextStripper.processPages >> (PDFTextStripper.java: >>>>>> 259) >>>>>> at >>>>>> >> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) >>>>>> at >>>>>> org.pdfbox.util.PDFTextStripper.getText >> (PDFTextStripper.java:149) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.PDFFilter.getDestinationStream >>>>>> (PDFFilter.java >>>>>> :114) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.MediaFilterManager.processBitstream >>>>>> (MediaFilt >>>>>> erManager.java:602) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream >>>>>> (MediaFilte >>>>>> rManager.java:513) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.MediaFilterManager.filterItem >>>>>> (MediaFilterMana >>>>>> ger.java :461) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem >>>>>> (MediaFilt >>>>>> erManager.java:428) >>>>>> at >>>>>> >> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems >>>>>> (Media >>>>>> FilterManager.java:391) >>>>>> at >>>>>> org.dspace.app.mediafilter.MediaFilterManager.main >>>>>> (MediaFilterManager.ja >>>>>> va:342) >>>>>> >>>>>> >> -------------------------------------------------------------------- >>>>>> ---- >>>>>> - >>>>>> This SF.net email is sponsored by: Microsoft >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>>>>> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>>>>> _______________________________________________ >>>>>> DSpace-tech mailing list >>>>>> [email protected] >>>>>> >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >>>>>> >> -------------------------------------------------------------------- >>>>>> ----- >>>>>> This SF.net email is sponsored by: Microsoft >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>>>>> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>>>>> _______________________________________________ >>>>>> DSpace-tech mailing list >>>>>> [email protected] >>>>>> >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >>>>> ~~~~~~~~~~~~~ >>>>> Mark R. Diggory - DSpace Systems Manager >>>>> MIT Libraries, Systems and Technology Services >>>>> Massachusetts Institute of Technology >>>> -- >>>> >>>> ======================================== >>>> Tim Donohue >>>> Research Programmer, Illinois Digital Environment for >>>> Access to Learning and Scholarship (IDEALS) >>>> 135 Grainger Engineering Library >>>> University of Illinois at Urbana-Champaign >>>> >>>> email: [EMAIL PROTECTED] >>>> web: http://www.ideals.uiuc.edu >>>> phone: (217) 333-4648 >>>> fax: (217) 244-7764 >>>> ======================================== >>> >>> >> ------------------------------------------------------------------------- >>> This SF.net email is sponsored by: Microsoft >>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>> _______________________________________________ >>> DSpace-tech mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/dspace-tech >>> >> >> >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >> >> > > -- ======================================== Tim Donohue Research Programmer, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) 135 Grainger Engineering Library University of Illinois at Urbana-Champaign email: [EMAIL PROTECTED] web: http://www.ideals.uiuc.edu phone: (217) 333-4648 fax: (217) 244-7764 ======================================== ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

