We should consider adding more sane defaults, most machines that  
DSpace is running on have well over 1Gig of memory available and its  
important to remember this is a maximum heap size and is not take  
unless required. I think setting dsrun and the other commandline  
scripts to be 512m (1/2 * 1Gig)  would eliminate most outlying cases  
where PDF docs need to be held in memory.

-Mark Diggory

On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:

> Hi! Tim,
>
> Here we faced similar errors while trying out full-text indexing on
> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
> records. This was rectified once dsrun.bat was given 1000m at java
> -Xmx256m -classpath ........
> http://repositorydev.ntu.edu.sg
>
> Jayan
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Tim
> Donohue
> Sent: Friday, September 21, 2007 1:58 AM
> To: dspace-tech
> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
>
> All,
>
> I'm curious if anyone out there has run into strange OutOfMemory  
> errors
> while full-text indexing larger (>10MB) PDF files in DSpace.
>
> It usually appears as either:
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
> OR
>
> Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead  
> limit
>
> exceeded
>
> I've located the main "problem" PDF in our DSpace instance:
> http://hdl.handle.net/2142/2050
>
> I've also done a large amount of searching/testing based on
> recommendations from various sites.   In particular, I've done a  
> memory
> dump using JHat
> (http://java.sun.com/javase/6/docs/technotes/tools/share/ 
> jhat.html), and
>
> it looks like the problem may reside with a potential memory leak  
> in the
>
> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it  
> *looks*
> like PDFBox is attempting to load most/all of the textual content  
> into a
>
> giant HashMap)
>
> Here's the latest settings I've been testing on:
>
> RHEL 4
> Java 1.6.0_02
> Postgres 8.1.9
> DSpace 1.4.2
>
> We also have the following JAVA_OPTS settings in place for our JVM:
>
> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>
> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
> still getting the OutOfMemory exception at 1GB!)
>
> Anyone have any hints/tips or JVM settings to share?  I personally  
> don't
>
> see why PDFBox would need so much JVM memory to parse a 15MB PDF.   
> But,
> the JHat analysis seemed to be pointing to PDFBox.
>
> - Tim
>
> P.S.  an example of the full error stack trace is below:
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>          at java.util.HashMap.resize(Unknown Source)
>          at java.util.HashMap.addEntry(Unknown Source)
>          at java.util.HashMap.put(Unknown Source)
>          at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
>          at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
>          at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>          at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>          at
> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>          at org.pdfbox.util.operator.ShowText.process(ShowText.java: 
> 64)
>          at
> org.pdfbox.util.PDFStreamEngine.processOperator 
> (PDFStreamEngine.java:452
> )
>          at
> org.pdfbox.util.PDFStreamEngine.processSubStream 
> (PDFStreamEngine.java:21
> 5)
>          at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java: 
> 174)
>          at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>          at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>          at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>          at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>          at
> org.dspace.app.mediafilter.PDFFilter.getDestinationStream 
> (PDFFilter.java
> :114)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.processBitstream 
> (MediaFilt
> erManager.java:602)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream 
> (MediaFilte
> rManager.java:513)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.filterItem 
> (MediaFilterMana
> ger.java:461)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem 
> (MediaFilt
> erManager.java:428)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems 
> (Media
> FilterManager.java:391)
>          at
> org.dspace.app.mediafilter.MediaFilterManager.main 
> (MediaFilterManager.ja
> va:342)
>
> ---------------------------------------------------------------------- 
> --
> -
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

~~~~~~~~~~~~~
Mark R. Diggory - DSpace Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to