Hi! Tim, Here we faced similar errors while trying out full-text indexing on DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 records. This was rectified once dsrun.bat was given 1000m at java -Xmx256m -classpath ........ http://repositorydev.ntu.edu.sg
Jayan -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tim Donohue Sent: Friday, September 21, 2007 1:58 AM To: dspace-tech Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing All, I'm curious if anyone out there has run into strange OutOfMemory errors while full-text indexing larger (>10MB) PDF files in DSpace. It usually appears as either: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space OR Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit exceeded I've located the main "problem" PDF in our DSpace instance: http://hdl.handle.net/2142/2050 I've also done a large amount of searching/testing based on recommendations from various sites. In particular, I've done a memory dump using JHat (http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and it looks like the problem may reside with a potential memory leak in the 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it *looks* like PDFBox is attempting to load most/all of the textual content into a giant HashMap) Here's the latest settings I've been testing on: RHEL 4 Java 1.6.0_02 Postgres 8.1.9 DSpace 1.4.2 We also have the following JAVA_OPTS settings in place for our JVM: JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 (We initially had Xmx and Xms at 512MB, but I bumped it up and we're still getting the OutOfMemory exception at 1GB!) Anyone have any hints/tips or JVM settings to share? I personally don't see why PDFBox would need so much JVM memory to parse a 15MB PDF. But, the JHat analysis seemed to be pointing to PDFBox. - Tim P.S. an example of the full error stack trace is below: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.resize(Unknown Source) at java.util.HashMap.addEntry(Unknown Source) at java.util.HashMap.put(Unknown Source) at org.fontbox.cmap.CMap.addMapping(CMap.java:132) at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535) at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325) at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452 ) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21 5) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java :114) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilt erManager.java:602) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte rManager.java:513) at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana ger.java:461) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt erManager.java:428) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media FilterManager.java:391) at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja va:342) ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech