Just to follow up briefly:

Thanks for all the great suggestions from everyone!   I've tried the 
latest development version of PDFBox today and unfortunately that didn't 
seem to resolve anything :(

I've also noticed that it *seems* to be related to the size of the PDFs. 
  We just received a bulk load into our DSpace of about 100+ PDFs, and 
I've now run into about 3-4 which cause the OutOfMemory errors (all of 
which are between 11MB and 15MB).  The only other thing in common is 
that all of these PDFs were initially image-based, and were OCRed before 
ingesting them into DSpace (not sure if that could be "confusing" PDFBox)

In any case, I've logged a bug with PDFBox on SourceForge and referenced 
a few of the PDFs which have these issues.  I'm hoping they'll be able 
to help debug it :)

http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832

I'll post back to this thread once there is a resolution to the issue in 
case others run across this problem as well.

- Tim

Dan Scott wrote:
> I should note to the list that the latest development version of
> PDFBox claims to have solved an Out of Memory Exception error. Sounds
> familiar :)
> 
> I had suggested to Tim privately that maybe he could test it out and let us
> know if it resolves the problem:
> 
> http://www.pdfbox.org/changes.html#version_0.7.4-dev
> 
> On 22/09/2007, Mark Diggory <[EMAIL PROTECTED]> wrote:
>> Yes, I recall some issues in the past which we addressed by upgrading that
>> jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar
>> (0.7.3) while DSpace 1.4.1 has an older version.
>>
>> Unfortunately, PDFBox is a more full featured PDF editor and not just a text
>> extractor, it pulls large portions of the PDF into memory when processing
>> it.  If we could find a more stream based text extractor for pdf files, it
>> would make the memory footprint much more fixed for FilterMedia.
>>
>> -Mark
>>
>>
>>
>> On Sep 22, 2007, at 5:57 AM, Jimmy Zhang wrote:
>>
>> The responsibility of PDFBox is to extract the full text of the pdf file.I
>> am wondering it maybe has to do with the pdf file.Do you mean any pdf files
>> whose size more than 10M can cause problem or only that pdf file?
>>
>>  --
>> Website: www.drepository.com
>>
>> On 9/21/07, Mark Diggory <[EMAIL PROTECTED] > wrote:
>>> I would also then recommend trying to get the latest PDFBox and
>>> replace the jar in your lib directory.
>>>
>>> http://sourceforge.net/project/showfiles.php?
>>> group_id=78314&package_id=79377
>>>
>>> On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote:
>>>
>>>> Jayan & Mark,
>>>>
>>>> Thanks for the suggestions.  But, our problem is that we're
>>>> currently running Java & dsrun using:
>>>>
>>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
>>>>
>>>> (I've modified our local dsrun script to read from the JAVA_OPTS
>>>> environment variable).
>>>>
>>>> So, even setting a maximum heap size of 1GB, we don't seem to be
>>>> able to full text index a 15MB PDF without encountering
>>>> "OutOfMemory: Java heap space" errors.  Strange, I know.  My
>>>> current theory is that there may be a memory leak in the PDFBox
>>>> tools.   I'm still working on a definite diagnosis though.  If no
>>>> one else out there has noticed this with DSpace 1.4.2, then I guess
>>>> it's possible there's something in our local settings (or
>>>> customizations of DSpace) which could be causing this issue.
>>>>
>>>> - Tim
>>>>
>>>> Mark Diggory wrote:
>>>>> We should consider adding more sane defaults, most machines that
>>>>> DSpace is running on have well over 1Gig of memory available and
>>>>> its important to remember this is a maximum heap size and is not
>>>>> take unless required. I think setting dsrun and the other
>>>>> commandline scripts to be 512m (1/2 * 1Gig)  would eliminate most
>>>>> outlying cases where PDF docs need to be held in memory.
>>>>> -Mark Diggory
>>>>> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:
>>>>>> Hi! Tim,
>>>>>>
>>>>>> Here we faced similar errors while trying out full-text indexing on
>>>>>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
>>>>>> records. This was rectified once dsrun.bat was given 1000m at java
>>>>>> -Xmx256m -classpath ........
>>>>>> http://repositorydev.ntu.edu.sg
>>>>>>
>>>>>> Jayan
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [EMAIL PROTECTED]
>>>>>> [mailto:[EMAIL PROTECTED] On
>> Behalf Of Tim
>>>>>> Donohue
>>>>>> Sent: Friday, September 21, 2007 1:58 AM
>>>>>> To: dspace-tech
>>>>>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
>>>>>>
>>>>>> All,
>>>>>>
>>>>>> I'm curious if anyone out there has run into strange OutOfMemory
>>>>>> errors
>>>>>> while full-text indexing larger (>10MB) PDF files in DSpace.
>>>>>>
>>>>>> It usually appears as either:
>>>>>>
>>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
>>>>>> space
>>>>>>
>>>>>> OR
>>>>>>
>>>>>> Exception in thread "main" java.lang.OutOfMemoryError: GC
>>>>>> Overhead limit
>>>>>>
>>>>>> exceeded
>>>>>>
>>>>>> I've located the main "problem" PDF in our DSpace instance:
>>>>>> http://hdl.handle.net/2142/2050
>>>>>>
>>>>>> I've also done a large amount of searching/testing based on
>>>>>> recommendations from various sites.   In particular, I've done a
>>>>>> memory
>>>>>> dump using JHat
>>>>>>
>> (http://java.sun.com/javase/6/docs/technotes/tools/share/
>>>>>> jhat.html), and
>>>>>>
>>>>>> it looks like the problem may reside with a potential memory leak
>>>>>> in the
>>>>>>
>>>>>> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it
>>>>>> *looks*
>>>>>> like PDFBox is attempting to load most/all of the textual content
>>>>>> into a
>>>>>>
>>>>>> giant HashMap)
>>>>>>
>>>>>> Here's the latest settings I've been testing on:
>>>>>>
>>>>>> RHEL 4
>>>>>> Java 1.6.0_02
>>>>>> Postgres 8.1.9
>>>>>> DSpace 1.4.2
>>>>>>
>>>>>> We also have the following JAVA_OPTS settings in place for our JVM:
>>>>>>
>>>>>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 - Dfile.encoding=UTF-8
>>>>>>
>>>>>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
>>>>>> still getting the OutOfMemory exception at 1GB!)
>>>>>>
>>>>>> Anyone have any hints/tips or JVM settings to share?  I
>>>>>> personally don't
>>>>>>
>>>>>> see why PDFBox would need so much JVM memory to parse a 15MB
>>>>>> PDF.  But,
>>>>>> the JHat analysis seemed to be pointing to PDFBox.
>>>>>>
>>>>>> - Tim
>>>>>>
>>>>>> P.S.  an example of the full error stack trace is below:
>>>>>>
>>>>>> Exception in thread "main" java.lang.OutOfMemoryError : Java heap
>>>>>> space
>>>>>>          at java.util.HashMap.resize(Unknown Source)
>>>>>>          at java.util.HashMap.addEntry(Unknown Source)
>>>>>>          at java.util.HashMap.put (Unknown Source)
>>>>>>          at
>> org.fontbox.cmap.CMap.addMapping(CMap.java:132)
>>>>>>          at
>> org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
>>>>>>          at org.pdfbox.pdmodel.font.PDFont.parseCmap
>> (PDFont.java:
>>>>>> 535)
>>>>>>          at
>> org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>>>>>>          at
>>>>>>
>> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java
>> :325)
>>>>>>          at org.pdfbox.util.operator.ShowText.process
>>>>>> (ShowText.java:64)
>>>>>>          at
>>>>>> org.pdfbox.util.PDFStreamEngine.processOperator
>>>>>> (PDFStreamEngine.java :452
>>>>>> )
>>>>>>          at
>>>>>> org.pdfbox.util.PDFStreamEngine.processSubStream
>>>>>> (PDFStreamEngine.java:21
>>>>>> 5)
>>>>>>          at
>>>>>> org.pdfbox.util.PDFStreamEngine.processStream
>>>>>> (PDFStreamEngine.java:174)
>>>>>>          at
>>>>>>
>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
>>>>>> 336)
>>>>>>          at
>>>>>> org.pdfbox.util.PDFTextStripper.processPages
>> (PDFTextStripper.java:
>>>>>> 259)
>>>>>>          at
>>>>>>
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>>>>          at
>>>>>> org.pdfbox.util.PDFTextStripper.getText
>> (PDFTextStripper.java:149)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream
>>>>>> (PDFFilter.java
>>>>>> :114)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream
>>>>>> (MediaFilt
>>>>>> erManager.java:602)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream
>>>>>> (MediaFilte
>>>>>> rManager.java:513)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.MediaFilterManager.filterItem
>>>>>> (MediaFilterMana
>>>>>> ger.java :461)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem
>>>>>> (MediaFilt
>>>>>> erManager.java:428)
>>>>>>          at
>>>>>>
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems
>>>>>> (Media
>>>>>> FilterManager.java:391)
>>>>>>          at
>>>>>> org.dspace.app.mediafilter.MediaFilterManager.main
>>>>>> (MediaFilterManager.ja
>>>>>> va:342)
>>>>>>
>>>>>>
>> --------------------------------------------------------------------
>>>>>> ----
>>>>>> -
>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>>>>>
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>> _______________________________________________
>>>>>> DSpace-tech mailing list
>>>>>> [email protected]
>>>>>>
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>>>>
>> --------------------------------------------------------------------
>>>>>> -----
>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>>>>>
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>> _______________________________________________
>>>>>> DSpace-tech mailing list
>>>>>> [email protected]
>>>>>>
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>>> ~~~~~~~~~~~~~
>>>>> Mark R. Diggory - DSpace Systems Manager
>>>>> MIT Libraries, Systems and Technology Services
>>>>> Massachusetts Institute of Technology
>>>> --
>>>>
>>>> ========================================
>>>> Tim Donohue
>>>> Research Programmer, Illinois Digital Environment for
>>>> Access to Learning and Scholarship (IDEALS)
>>>> 135 Grainger Engineering Library
>>>> University of Illinois at Urbana-Champaign
>>>>
>>>> email: [EMAIL PROTECTED]
>>>> web:   http://www.ideals.uiuc.edu
>>>> phone: (217) 333-4648
>>>> fax:   (217) 244-7764
>>>> ========================================
>>>
>>>
>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>
>>
>>
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>>
> 
> 

-- 

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: [EMAIL PROTECTED]
web:   http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax:   (217) 244-7764
========================================

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to