Hi! Tim,

Here we faced similar errors while trying out full-text indexing on
DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
records. This was rectified once dsrun.bat was given 1000m at java
-Xmx256m -classpath ........ 
http://repositorydev.ntu.edu.sg

Jayan


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tim
Donohue
Sent: Friday, September 21, 2007 1:58 AM
To: dspace-tech
Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing

All,

I'm curious if anyone out there has run into strange OutOfMemory errors 
while full-text indexing larger (>10MB) PDF files in DSpace.

It usually appears as either:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

OR

Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit

exceeded

I've located the main "problem" PDF in our DSpace instance:
http://hdl.handle.net/2142/2050

I've also done a large amount of searching/testing based on 
recommendations from various sites.   In particular, I've done a memory 
dump using JHat 
(http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and

it looks like the problem may reside with a potential memory leak in the

3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it *looks* 
like PDFBox is attempting to load most/all of the textual content into a

giant HashMap)

Here's the latest settings I've been testing on:

RHEL 4
Java 1.6.0_02
Postgres 8.1.9
DSpace 1.4.2

We also have the following JAVA_OPTS settings in place for our JVM:

JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8

(We initially had Xmx and Xms at 512MB, but I bumped it up and we're 
still getting the OutOfMemory exception at 1GB!)

Anyone have any hints/tips or JVM settings to share?  I personally don't

see why PDFBox would need so much JVM memory to parse a 15MB PDF.  But, 
the JHat analysis seemed to be pointing to PDFBox.

- Tim

P.S.  an example of the full error stack trace is below:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
         at java.util.HashMap.resize(Unknown Source)
         at java.util.HashMap.addEntry(Unknown Source)
         at java.util.HashMap.put(Unknown Source)
         at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
         at 
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
         at 
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452
)
         at 
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21
5)
         at 
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
         at 
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
         at 
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
         at 
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
         at 
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
         at 
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
:114)
         at 
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilt
erManager.java:602)
         at 
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:513)
         at 
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:461)
         at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:428)
         at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
FilterManager.java:391)
         at 
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:342)

------------------------------------------------------------------------
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to