I would also then recommend trying to get the latest PDFBox and  
replace the jar in your lib directory.

http://sourceforge.net/project/showfiles.php? 
group_id=78314&package_id=79377

On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote:

> Jayan & Mark,
>
> Thanks for the suggestions.  But, our problem is that we're  
> currently running Java & dsrun using:
>
> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>
> (I've modified our local dsrun script to read from the JAVA_OPTS  
> environment variable).
>
> So, even setting a maximum heap size of 1GB, we don't seem to be  
> able to full text index a 15MB PDF without encountering  
> "OutOfMemory: Java heap space" errors.  Strange, I know.  My  
> current theory is that there may be a memory leak in the PDFBox  
> tools.   I'm still working on a definite diagnosis though.  If no  
> one else out there has noticed this with DSpace 1.4.2, then I guess  
> it's possible there's something in our local settings (or  
> customizations of DSpace) which could be causing this issue.
>
> - Tim
>
> Mark Diggory wrote:
>> We should consider adding more sane defaults, most machines that  
>> DSpace is running on have well over 1Gig of memory available and  
>> its important to remember this is a maximum heap size and is not  
>> take unless required. I think setting dsrun and the other  
>> commandline scripts to be 512m (1/2 * 1Gig)  would eliminate most  
>> outlying cases where PDF docs need to be held in memory.
>> -Mark Diggory
>> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:
>>> Hi! Tim,
>>>
>>> Here we faced similar errors while trying out full-text indexing on
>>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
>>> records. This was rectified once dsrun.bat was given 1000m at java
>>> -Xmx256m -classpath ........
>>> http://repositorydev.ntu.edu.sg
>>>
>>> Jayan
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED]
>>> [mailto:[EMAIL PROTECTED] On Behalf Of Tim
>>> Donohue
>>> Sent: Friday, September 21, 2007 1:58 AM
>>> To: dspace-tech
>>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
>>>
>>> All,
>>>
>>> I'm curious if anyone out there has run into strange OutOfMemory  
>>> errors
>>> while full-text indexing larger (>10MB) PDF files in DSpace.
>>>
>>> It usually appears as either:
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap  
>>> space
>>>
>>> OR
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: GC  
>>> Overhead limit
>>>
>>> exceeded
>>>
>>> I've located the main "problem" PDF in our DSpace instance:
>>> http://hdl.handle.net/2142/2050
>>>
>>> I've also done a large amount of searching/testing based on
>>> recommendations from various sites.   In particular, I've done a  
>>> memory
>>> dump using JHat
>>> (http://java.sun.com/javase/6/docs/technotes/tools/share/ 
>>> jhat.html), and
>>>
>>> it looks like the problem may reside with a potential memory leak  
>>> in the
>>>
>>> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it  
>>> *looks*
>>> like PDFBox is attempting to load most/all of the textual content  
>>> into a
>>>
>>> giant HashMap)
>>>
>>> Here's the latest settings I've been testing on:
>>>
>>> RHEL 4
>>> Java 1.6.0_02
>>> Postgres 8.1.9
>>> DSpace 1.4.2
>>>
>>> We also have the following JAVA_OPTS settings in place for our JVM:
>>>
>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>>>
>>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
>>> still getting the OutOfMemory exception at 1GB!)
>>>
>>> Anyone have any hints/tips or JVM settings to share?  I  
>>> personally don't
>>>
>>> see why PDFBox would need so much JVM memory to parse a 15MB  
>>> PDF.  But,
>>> the JHat analysis seemed to be pointing to PDFBox.
>>>
>>> - Tim
>>>
>>> P.S.  an example of the full error stack trace is below:
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap  
>>> space
>>>          at java.util.HashMap.resize(Unknown Source)
>>>          at java.util.HashMap.addEntry(Unknown Source)
>>>          at java.util.HashMap.put(Unknown Source)
>>>          at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
>>>          at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
>>>          at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java: 
>>> 535)
>>>          at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>>>          at org.pdfbox.util.operator.ShowText.process 
>>> (ShowText.java:64)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processOperator 
>>> (PDFStreamEngine.java:452
>>> )
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processSubStream 
>>> (PDFStreamEngine.java:21
>>> 5)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processStream 
>>> (PDFStreamEngine.java:174)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java: 
>>> 336)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java: 
>>> 259)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>          at
>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream 
>>> (PDFFilter.java
>>> :114)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream 
>>> (MediaFilt
>>> erManager.java:602)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream 
>>> (MediaFilte
>>> rManager.java:513)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.filterItem 
>>> (MediaFilterMana
>>> ger.java:461)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem 
>>> (MediaFilt
>>> erManager.java:428)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems 
>>> (Media
>>> FilterManager.java:391)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.main 
>>> (MediaFilterManager.ja
>>> va:342)
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>> -
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpace-tech@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>
>>> -------------------------------------------------------------------- 
>>> -----
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpace-tech@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>> ~~~~~~~~~~~~~
>> Mark R. Diggory - DSpace Systems Manager
>> MIT Libraries, Systems and Technology Services
>> Massachusetts Institute of Technology
>
> -- 
>
> ========================================
> Tim Donohue
> Research Programmer, Illinois Digital Environment for
> Access to Learning and Scholarship (IDEALS)
> 135 Grainger Engineering Library
> University of Illinois at Urbana-Champaign
>
> email: [EMAIL PROTECTED]
> web:   http://www.ideals.uiuc.edu
> phone: (217) 333-4648
> fax:   (217) 244-7764
> ========================================


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to