Jayan & Mark, Thanks for the suggestions. But, our problem is that we're currently running Java & dsrun using:
JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 (I've modified our local dsrun script to read from the JAVA_OPTS environment variable). So, even setting a maximum heap size of 1GB, we don't seem to be able to full text index a 15MB PDF without encountering "OutOfMemory: Java heap space" errors. Strange, I know. My current theory is that there may be a memory leak in the PDFBox tools. I'm still working on a definite diagnosis though. If no one else out there has noticed this with DSpace 1.4.2, then I guess it's possible there's something in our local settings (or customizations of DSpace) which could be causing this issue. - Tim Mark Diggory wrote: > We should consider adding more sane defaults, most machines that DSpace > is running on have well over 1Gig of memory available and its important > to remember this is a maximum heap size and is not take unless required. > I think setting dsrun and the other commandline scripts to be 512m (1/2 > * 1Gig) would eliminate most outlying cases where PDF docs need to be > held in memory. > > -Mark Diggory > > On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote: > >> Hi! Tim, >> >> Here we faced similar errors while trying out full-text indexing on >> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 >> records. This was rectified once dsrun.bat was given 1000m at java >> -Xmx256m -classpath ........ >> http://repositorydev.ntu.edu.sg >> >> Jayan >> >> >> -----Original Message----- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of Tim >> Donohue >> Sent: Friday, September 21, 2007 1:58 AM >> To: dspace-tech >> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing >> >> All, >> >> I'm curious if anyone out there has run into strange OutOfMemory errors >> while full-text indexing larger (>10MB) PDF files in DSpace. >> >> It usually appears as either: >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> >> OR >> >> Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit >> >> exceeded >> >> I've located the main "problem" PDF in our DSpace instance: >> http://hdl.handle.net/2142/2050 >> >> I've also done a large amount of searching/testing based on >> recommendations from various sites. In particular, I've done a memory >> dump using JHat >> (http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and >> >> it looks like the problem may reside with a potential memory leak in the >> >> 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it *looks* >> like PDFBox is attempting to load most/all of the textual content into a >> >> giant HashMap) >> >> Here's the latest settings I've been testing on: >> >> RHEL 4 >> Java 1.6.0_02 >> Postgres 8.1.9 >> DSpace 1.4.2 >> >> We also have the following JAVA_OPTS settings in place for our JVM: >> >> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 >> >> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're >> still getting the OutOfMemory exception at 1GB!) >> >> Anyone have any hints/tips or JVM settings to share? I personally don't >> >> see why PDFBox would need so much JVM memory to parse a 15MB PDF. But, >> the JHat analysis seemed to be pointing to PDFBox. >> >> - Tim >> >> P.S. an example of the full error stack trace is below: >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at java.util.HashMap.resize(Unknown Source) >> at java.util.HashMap.addEntry(Unknown Source) >> at java.util.HashMap.put(Unknown Source) >> at org.fontbox.cmap.CMap.addMapping(CMap.java:132) >> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) >> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535) >> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) >> at >> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325) >> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) >> at >> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452 >> ) >> at >> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21 >> 5) >> at >> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) >> at >> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) >> at >> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) >> at >> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) >> at >> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) >> at >> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java >> :114) >> at >> org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilt >> erManager.java:602) >> at >> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte >> rManager.java:513) >> at >> org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana >> ger.java:461) >> at >> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt >> erManager.java:428) >> at >> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media >> FilterManager.java:391) >> at >> org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja >> va:342) >> >> ------------------------------------------------------------------------ >> - >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech > > ~~~~~~~~~~~~~ > Mark R. Diggory - DSpace Systems Manager > MIT Libraries, Systems and Technology Services > Massachusetts Institute of Technology > > > -- ======================================== Tim Donohue Research Programmer, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) 135 Grainger Engineering Library University of Illinois at Urbana-Champaign email: [EMAIL PROTECTED] web: http://www.ideals.uiuc.edu phone: (217) 333-4648 fax: (217) 244-7764 ======================================== ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

