We should consider adding more sane defaults, most machines that DSpace is running on have well over 1Gig of memory available and its important to remember this is a maximum heap size and is not take unless required. I think setting dsrun and the other commandline scripts to be 512m (1/2 * 1Gig) would eliminate most outlying cases where PDF docs need to be held in memory.
-Mark Diggory On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote: > Hi! Tim, > > Here we faced similar errors while trying out full-text indexing on > DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 > records. This was rectified once dsrun.bat was given 1000m at java > -Xmx256m -classpath ........ > http://repositorydev.ntu.edu.sg > > Jayan > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Tim > Donohue > Sent: Friday, September 21, 2007 1:58 AM > To: dspace-tech > Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing > > All, > > I'm curious if anyone out there has run into strange OutOfMemory > errors > while full-text indexing larger (>10MB) PDF files in DSpace. > > It usually appears as either: > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > OR > > Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead > limit > > exceeded > > I've located the main "problem" PDF in our DSpace instance: > http://hdl.handle.net/2142/2050 > > I've also done a large amount of searching/testing based on > recommendations from various sites. In particular, I've done a > memory > dump using JHat > (http://java.sun.com/javase/6/docs/technotes/tools/share/ > jhat.html), and > > it looks like the problem may reside with a potential memory leak > in the > > 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it > *looks* > like PDFBox is attempting to load most/all of the textual content > into a > > giant HashMap) > > Here's the latest settings I've been testing on: > > RHEL 4 > Java 1.6.0_02 > Postgres 8.1.9 > DSpace 1.4.2 > > We also have the following JAVA_OPTS settings in place for our JVM: > > JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 > > (We initially had Xmx and Xms at 512MB, but I bumped it up and we're > still getting the OutOfMemory exception at 1GB!) > > Anyone have any hints/tips or JVM settings to share? I personally > don't > > see why PDFBox would need so much JVM memory to parse a 15MB PDF. > But, > the JHat analysis seemed to be pointing to PDFBox. > > - Tim > > P.S. an example of the full error stack trace is below: > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.HashMap.resize(Unknown Source) > at java.util.HashMap.addEntry(Unknown Source) > at java.util.HashMap.put(Unknown Source) > at org.fontbox.cmap.CMap.addMapping(CMap.java:132) > at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) > at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535) > at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) > at > org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325) > at org.pdfbox.util.operator.ShowText.process(ShowText.java: > 64) > at > org.pdfbox.util.PDFStreamEngine.processOperator > (PDFStreamEngine.java:452 > ) > at > org.pdfbox.util.PDFStreamEngine.processSubStream > (PDFStreamEngine.java:21 > 5) > at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java: > 174) > at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) > at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) > at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) > at > org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) > at > org.dspace.app.mediafilter.PDFFilter.getDestinationStream > (PDFFilter.java > :114) > at > org.dspace.app.mediafilter.MediaFilterManager.processBitstream > (MediaFilt > erManager.java:602) > at > org.dspace.app.mediafilter.MediaFilterManager.filterBitstream > (MediaFilte > rManager.java:513) > at > org.dspace.app.mediafilter.MediaFilterManager.filterItem > (MediaFilterMana > ger.java:461) > at > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem > (MediaFilt > erManager.java:428) > at > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems > (Media > FilterManager.java:391) > at > org.dspace.app.mediafilter.MediaFilterManager.main > (MediaFilterManager.ja > va:342) > > ---------------------------------------------------------------------- > -- > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech ~~~~~~~~~~~~~ Mark R. Diggory - DSpace Systems Manager MIT Libraries, Systems and Technology Services Massachusetts Institute of Technology ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech