I would also then recommend trying to get the latest PDFBox and replace the jar in your lib directory.
http://sourceforge.net/project/showfiles.php? group_id=78314&package_id=79377 On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote: > Jayan & Mark, > > Thanks for the suggestions. But, our problem is that we're > currently running Java & dsrun using: > > JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 > > (I've modified our local dsrun script to read from the JAVA_OPTS > environment variable). > > So, even setting a maximum heap size of 1GB, we don't seem to be > able to full text index a 15MB PDF without encountering > "OutOfMemory: Java heap space" errors. Strange, I know. My > current theory is that there may be a memory leak in the PDFBox > tools. I'm still working on a definite diagnosis though. If no > one else out there has noticed this with DSpace 1.4.2, then I guess > it's possible there's something in our local settings (or > customizations of DSpace) which could be causing this issue. > > - Tim > > Mark Diggory wrote: >> We should consider adding more sane defaults, most machines that >> DSpace is running on have well over 1Gig of memory available and >> its important to remember this is a maximum heap size and is not >> take unless required. I think setting dsrun and the other >> commandline scripts to be 512m (1/2 * 1Gig) would eliminate most >> outlying cases where PDF docs need to be held in memory. >> -Mark Diggory >> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote: >>> Hi! Tim, >>> >>> Here we faced similar errors while trying out full-text indexing on >>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 >>> records. This was rectified once dsrun.bat was given 1000m at java >>> -Xmx256m -classpath ........ >>> http://repositorydev.ntu.edu.sg >>> >>> Jayan >>> >>> >>> -----Original Message----- >>> From: [EMAIL PROTECTED] >>> [mailto:[EMAIL PROTECTED] On Behalf Of Tim >>> Donohue >>> Sent: Friday, September 21, 2007 1:58 AM >>> To: dspace-tech >>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing >>> >>> All, >>> >>> I'm curious if anyone out there has run into strange OutOfMemory >>> errors >>> while full-text indexing larger (>10MB) PDF files in DSpace. >>> >>> It usually appears as either: >>> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>> space >>> >>> OR >>> >>> Exception in thread "main" java.lang.OutOfMemoryError: GC >>> Overhead limit >>> >>> exceeded >>> >>> I've located the main "problem" PDF in our DSpace instance: >>> http://hdl.handle.net/2142/2050 >>> >>> I've also done a large amount of searching/testing based on >>> recommendations from various sites. In particular, I've done a >>> memory >>> dump using JHat >>> (http://java.sun.com/javase/6/docs/technotes/tools/share/ >>> jhat.html), and >>> >>> it looks like the problem may reside with a potential memory leak >>> in the >>> >>> 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it >>> *looks* >>> like PDFBox is attempting to load most/all of the textual content >>> into a >>> >>> giant HashMap) >>> >>> Here's the latest settings I've been testing on: >>> >>> RHEL 4 >>> Java 1.6.0_02 >>> Postgres 8.1.9 >>> DSpace 1.4.2 >>> >>> We also have the following JAVA_OPTS settings in place for our JVM: >>> >>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 >>> >>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're >>> still getting the OutOfMemory exception at 1GB!) >>> >>> Anyone have any hints/tips or JVM settings to share? I >>> personally don't >>> >>> see why PDFBox would need so much JVM memory to parse a 15MB >>> PDF. But, >>> the JHat analysis seemed to be pointing to PDFBox. >>> >>> - Tim >>> >>> P.S. an example of the full error stack trace is below: >>> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>> space >>> at java.util.HashMap.resize(Unknown Source) >>> at java.util.HashMap.addEntry(Unknown Source) >>> at java.util.HashMap.put(Unknown Source) >>> at org.fontbox.cmap.CMap.addMapping(CMap.java:132) >>> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) >>> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java: >>> 535) >>> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) >>> at >>> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325) >>> at org.pdfbox.util.operator.ShowText.process >>> (ShowText.java:64) >>> at >>> org.pdfbox.util.PDFStreamEngine.processOperator >>> (PDFStreamEngine.java:452 >>> ) >>> at >>> org.pdfbox.util.PDFStreamEngine.processSubStream >>> (PDFStreamEngine.java:21 >>> 5) >>> at >>> org.pdfbox.util.PDFStreamEngine.processStream >>> (PDFStreamEngine.java:174) >>> at >>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java: >>> 336) >>> at >>> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java: >>> 259) >>> at >>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) >>> at >>> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) >>> at >>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream >>> (PDFFilter.java >>> :114) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream >>> (MediaFilt >>> erManager.java:602) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream >>> (MediaFilte >>> rManager.java:513) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.filterItem >>> (MediaFilterMana >>> ger.java:461) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem >>> (MediaFilt >>> erManager.java:428) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems >>> (Media >>> FilterManager.java:391) >>> at >>> org.dspace.app.mediafilter.MediaFilterManager.main >>> (MediaFilterManager.ja >>> va:342) >>> >>> -------------------------------------------------------------------- >>> ---- >>> - >>> This SF.net email is sponsored by: Microsoft >>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>> _______________________________________________ >>> DSpace-tech mailing list >>> DSpace-tech@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dspace-tech >>> >>> -------------------------------------------------------------------- >>> ----- >>> This SF.net email is sponsored by: Microsoft >>> Defy all challenges. Microsoft(R) Visual Studio 2005. >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>> _______________________________________________ >>> DSpace-tech mailing list >>> DSpace-tech@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dspace-tech >> ~~~~~~~~~~~~~ >> Mark R. Diggory - DSpace Systems Manager >> MIT Libraries, Systems and Technology Services >> Massachusetts Institute of Technology > > -- > > ======================================== > Tim Donohue > Research Programmer, Illinois Digital Environment for > Access to Learning and Scholarship (IDEALS) > 135 Grainger Engineering Library > University of Illinois at Urbana-Champaign > > email: [EMAIL PROTECTED] > web: http://www.ideals.uiuc.edu > phone: (217) 333-4648 > fax: (217) 244-7764 > ======================================== ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech