Jayan & Mark,

Thanks for the suggestions.  But, our problem is that we're currently 
running Java & dsrun using:

JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8

(I've modified our local dsrun script to read from the JAVA_OPTS 
environment variable).

So, even setting a maximum heap size of 1GB, we don't seem to be able to 
full text index a 15MB PDF without encountering "OutOfMemory: Java heap 
space" errors.  Strange, I know.  My current theory is that there may be 
a memory leak in the PDFBox tools.   I'm still working on a definite 
diagnosis though.  If no one else out there has noticed this with DSpace 
1.4.2, then I guess it's possible there's something in our local 
settings (or customizations of DSpace) which could be causing this issue.

- Tim

Mark Diggory wrote:
> We should consider adding more sane defaults, most machines that DSpace 
> is running on have well over 1Gig of memory available and its important 
> to remember this is a maximum heap size and is not take unless required. 
> I think setting dsrun and the other commandline scripts to be 512m (1/2 
> * 1Gig)  would eliminate most outlying cases where PDF docs need to be 
> held in memory.
> 
> -Mark Diggory
> 
> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:
> 
>> Hi! Tim,
>>
>> Here we faced similar errors while trying out full-text indexing on
>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
>> records. This was rectified once dsrun.bat was given 1000m at java
>> -Xmx256m -classpath ........
>> http://repositorydev.ntu.edu.sg
>>
>> Jayan
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED]
>> [mailto:[EMAIL PROTECTED] On Behalf Of Tim
>> Donohue
>> Sent: Friday, September 21, 2007 1:58 AM
>> To: dspace-tech
>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
>>
>> All,
>>
>> I'm curious if anyone out there has run into strange OutOfMemory errors
>> while full-text indexing larger (>10MB) PDF files in DSpace.
>>
>> It usually appears as either:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>
>> OR
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit
>>
>> exceeded
>>
>> I've located the main "problem" PDF in our DSpace instance:
>> http://hdl.handle.net/2142/2050
>>
>> I've also done a large amount of searching/testing based on
>> recommendations from various sites.   In particular, I've done a memory
>> dump using JHat
>> (http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and
>>
>> it looks like the problem may reside with a potential memory leak in the
>>
>> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it *looks*
>> like PDFBox is attempting to load most/all of the textual content into a
>>
>> giant HashMap)
>>
>> Here's the latest settings I've been testing on:
>>
>> RHEL 4
>> Java 1.6.0_02
>> Postgres 8.1.9
>> DSpace 1.4.2
>>
>> We also have the following JAVA_OPTS settings in place for our JVM:
>>
>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>>
>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
>> still getting the OutOfMemory exception at 1GB!)
>>
>> Anyone have any hints/tips or JVM settings to share?  I personally don't
>>
>> see why PDFBox would need so much JVM memory to parse a 15MB PDF.  But,
>> the JHat analysis seemed to be pointing to PDFBox.
>>
>> - Tim
>>
>> P.S.  an example of the full error stack trace is below:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.HashMap.resize(Unknown Source)
>>          at java.util.HashMap.addEntry(Unknown Source)
>>          at java.util.HashMap.put(Unknown Source)
>>          at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
>>          at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
>>          at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>>          at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>>          at
>> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>>          at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>>          at
>> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452
>> )
>>          at
>> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21
>> 5)
>>          at
>> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>>          at
>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>>          at
>> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>>          at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>          at
>> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>          at
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
>> :114)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilt
>> erManager.java:602)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
>> rManager.java:513)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
>> ger.java:461)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
>> erManager.java:428)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
>> FilterManager.java:391)
>>          at
>> org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
>> va:342)
>>
>> ------------------------------------------------------------------------
>> -
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> ~~~~~~~~~~~~~
> Mark R. Diggory - DSpace Systems Manager
> MIT Libraries, Systems and Technology Services
> Massachusetts Institute of Technology
> 
> 
> 

-- 

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: [EMAIL PROTECTED]
web:   http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax:   (217) 244-7764
========================================

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to