Yes, I recall some issues in the past which we addressed by upgrading that jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar (0.7.3) while DSpace 1.4.1 has an older version.

Unfortunately, PDFBox is a more full featured PDF editor and not just a text extractor, it pulls large portions of the PDF into memory when processing it. If we could find a more stream based text extractor for pdf files, it would make the memory footprint much more fixed for FilterMedia.

-Mark


On Sep 22, 2007, at 5:57 AM, Jimmy Zhang wrote:

The responsibility of PDFBox is to extract the full text of the pdf file.I am wondering it maybe has to do with the pdf file.Do you mean any pdf files whose size more than 10M can cause problem or only that pdf file?

--
Website: www.drepository.com

On 9/21/07, Mark Diggory <[EMAIL PROTECTED] > wrote:
I would also then recommend trying to get the latest PDFBox and
replace the jar in your lib directory.

http://sourceforge.net/project/showfiles.php?
group_id=78314&package_id=79377

On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote:

> Jayan & Mark,
>
> Thanks for the suggestions.  But, our problem is that we're
> currently running Java & dsrun using:
>
> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>
> (I've modified our local dsrun script to read from the JAVA_OPTS
> environment variable).
>
> So, even setting a maximum heap size of 1GB, we don't seem to be
> able to full text index a 15MB PDF without encountering
> "OutOfMemory: Java heap space" errors.  Strange, I know.  My
> current theory is that there may be a memory leak in the PDFBox
> tools.   I'm still working on a definite diagnosis though.  If no
> one else out there has noticed this with DSpace 1.4.2, then I guess
> it's possible there's something in our local settings (or
> customizations of DSpace) which could be causing this issue.
>
> - Tim
>
> Mark Diggory wrote:
>> We should consider adding more sane defaults, most machines that
>> DSpace is running on have well over 1Gig of memory available and
>> its important to remember this is a maximum heap size and is not
>> take unless required. I think setting dsrun and the other
>> commandline scripts to be 512m (1/2 * 1Gig)  would eliminate most
>> outlying cases where PDF docs need to be held in memory.
>> -Mark Diggory
>> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:
>>> Hi! Tim,
>>>
>>> Here we faced similar errors while trying out full-text indexing on
>>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
>>> records. This was rectified once dsrun.bat was given 1000m at java
>>> -Xmx256m -classpath ........
>>> http://repositorydev.ntu.edu.sg
>>>
>>> Jayan
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED]
>>> [mailto:[EMAIL PROTECTED] On Behalf Of Tim
>>> Donohue
>>> Sent: Friday, September 21, 2007 1:58 AM
>>> To: dspace-tech
>>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
>>>
>>> All,
>>>
>>> I'm curious if anyone out there has run into strange OutOfMemory
>>> errors
>>> while full-text indexing larger (>10MB) PDF files in DSpace.
>>>
>>> It usually appears as either:
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
>>> space
>>>
>>> OR
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: GC
>>> Overhead limit
>>>
>>> exceeded
>>>
>>> I've located the main "problem" PDF in our DSpace instance:
>>> http://hdl.handle.net/2142/2050
>>>
>>> I've also done a large amount of searching/testing based on
>>> recommendations from various sites.   In particular, I've done a
>>> memory
>>> dump using JHat
>>> (http://java.sun.com/javase/6/docs/technotes/tools/share/
>>> jhat.html), and
>>>
>>> it looks like the problem may reside with a potential memory leak
>>> in the
>>>
>>> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it
>>> *looks*
>>> like PDFBox is attempting to load most/all of the textual content
>>> into a
>>>
>>> giant HashMap)
>>>
>>> Here's the latest settings I've been testing on:
>>>
>>> RHEL 4
>>> Java 1.6.0_02
>>> Postgres 8.1.9
>>> DSpace 1.4.2
>>>
>>> We also have the following JAVA_OPTS settings in place for our JVM:
>>>
>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 - Dfile.encoding=UTF-8
>>>
>>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
>>> still getting the OutOfMemory exception at 1GB!)
>>>
>>> Anyone have any hints/tips or JVM settings to share?  I
>>> personally don't
>>>
>>> see why PDFBox would need so much JVM memory to parse a 15MB
>>> PDF.  But,
>>> the JHat analysis seemed to be pointing to PDFBox.
>>>
>>> - Tim
>>>
>>> P.S.  an example of the full error stack trace is below:
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError : Java heap
>>> space
>>>          at java.util.HashMap.resize(Unknown Source)
>>>          at java.util.HashMap.addEntry(Unknown Source)
>>>          at java.util.HashMap.put (Unknown Source)
>>>          at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
>>>          at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
>>>          at org.pdfbox.pdmodel.font.PDFont.parseCmap (PDFont.java:
>>> 535)
>>>          at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.showString (PDFStreamEngine.java :325)
>>>          at org.pdfbox.util.operator.ShowText.process
>>> (ShowText.java:64)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processOperator
>>> (PDFStreamEngine.java :452
>>> )
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processSubStream
>>> (PDFStreamEngine.java:21
>>> 5)
>>>          at
>>> org.pdfbox.util.PDFStreamEngine.processStream
>>> (PDFStreamEngine.java:174)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
>>> 336)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.processPages (PDFTextStripper.java:
>>> 259)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java: 216)
>>>          at
>>> org.pdfbox.util.PDFTextStripper.getText (PDFTextStripper.java:149)
>>>          at
>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream
>>> (PDFFilter.java
>>> :114)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream
>>> (MediaFilt
>>> erManager.java:602)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream
>>> (MediaFilte
>>> rManager.java:513)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.filterItem
>>> (MediaFilterMana
>>> ger.java :461)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem
>>> (MediaFilt
>>> erManager.java:428)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems
>>> (Media
>>> FilterManager.java:391)
>>>          at
>>> org.dspace.app.mediafilter.MediaFilterManager.main
>>> (MediaFilterManager.ja
>>> va:342)
>>>
>>> --------------------------------------------------------------------
>>> ----
>>> -
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpace-tech@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>
>>> --------------------------------------------------------------------
>>> -----
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpace-tech@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>> ~~~~~~~~~~~~~
>> Mark R. Diggory - DSpace Systems Manager
>> MIT Libraries, Systems and Technology Services
>> Massachusetts Institute of Technology
>
> --
>
> ========================================
> Tim Donohue
> Research Programmer, Illinois Digital Environment for
> Access to Learning and Scholarship (IDEALS)
> 135 Grainger Engineering Library
> University of Illinois at Urbana-Champaign
>
> email: [EMAIL PROTECTED]
> web:   http://www.ideals.uiuc.edu
> phone: (217) 333-4648
> fax:   (217) 244-7764
> ========================================


---------------------------------------------------------------------- ---
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech




---------------------------------------------------------------------- ---
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to