I should note to the list that the latest development version of
PDFBox claims to have solved an Out of Memory Exception error. Sounds
familiar :)

I had suggested to Tim privately that maybe he could test it out and let us
know if it resolves the problem:

http://www.pdfbox.org/changes.html#version_0.7.4-dev

On 22/09/2007, Mark Diggory <[EMAIL PROTECTED]> wrote:
> Yes, I recall some issues in the past which we addressed by upgrading that
> jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar
> (0.7.3) while DSpace 1.4.1 has an older version.
>
> Unfortunately, PDFBox is a more full featured PDF editor and not just a text
> extractor, it pulls large portions of the PDF into memory when processing
> it.  If we could find a more stream based text extractor for pdf files, it
> would make the memory footprint much more fixed for FilterMedia.
>
> -Mark
>
>
>
> On Sep 22, 2007, at 5:57 AM, Jimmy Zhang wrote:
>
> The responsibility of PDFBox is to extract the full text of the pdf file.I
> am wondering it maybe has to do with the pdf file.Do you mean any pdf files
> whose size more than 10M can cause problem or only that pdf file?
>
>  --
> Website: www.drepository.com
>
> On 9/21/07, Mark Diggory <[EMAIL PROTECTED] > wrote:
> > I would also then recommend trying to get the latest PDFBox and
> > replace the jar in your lib directory.
> >
> > http://sourceforge.net/project/showfiles.php?
> > group_id=78314&package_id=79377
> >
> > On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote:
> >
> > > Jayan & Mark,
> > >
> > > Thanks for the suggestions.  But, our problem is that we're
> > > currently running Java & dsrun using:
> > >
> > > JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
> > >
> > > (I've modified our local dsrun script to read from the JAVA_OPTS
> > > environment variable).
> > >
> > > So, even setting a maximum heap size of 1GB, we don't seem to be
> > > able to full text index a 15MB PDF without encountering
> > > "OutOfMemory: Java heap space" errors.  Strange, I know.  My
> > > current theory is that there may be a memory leak in the PDFBox
> > > tools.   I'm still working on a definite diagnosis though.  If no
> > > one else out there has noticed this with DSpace 1.4.2, then I guess
> > > it's possible there's something in our local settings (or
> > > customizations of DSpace) which could be causing this issue.
> > >
> > > - Tim
> > >
> > > Mark Diggory wrote:
> > >> We should consider adding more sane defaults, most machines that
> > >> DSpace is running on have well over 1Gig of memory available and
> > >> its important to remember this is a maximum heap size and is not
> > >> take unless required. I think setting dsrun and the other
> > >> commandline scripts to be 512m (1/2 * 1Gig)  would eliminate most
> > >> outlying cases where PDF docs need to be held in memory.
> > >> -Mark Diggory
> > >> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote:
> > >>> Hi! Tim,
> > >>>
> > >>> Here we faced similar errors while trying out full-text indexing on
> > >>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
> > >>> records. This was rectified once dsrun.bat was given 1000m at java
> > >>> -Xmx256m -classpath ........
> > >>> http://repositorydev.ntu.edu.sg
> > >>>
> > >>> Jayan
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: [EMAIL PROTECTED]
> > >>> [mailto:[EMAIL PROTECTED] On
> Behalf Of Tim
> > >>> Donohue
> > >>> Sent: Friday, September 21, 2007 1:58 AM
> > >>> To: dspace-tech
> > >>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing
> > >>>
> > >>> All,
> > >>>
> > >>> I'm curious if anyone out there has run into strange OutOfMemory
> > >>> errors
> > >>> while full-text indexing larger (>10MB) PDF files in DSpace.
> > >>>
> > >>> It usually appears as either:
> > >>>
> > >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> > >>> space
> > >>>
> > >>> OR
> > >>>
> > >>> Exception in thread "main" java.lang.OutOfMemoryError: GC
> > >>> Overhead limit
> > >>>
> > >>> exceeded
> > >>>
> > >>> I've located the main "problem" PDF in our DSpace instance:
> > >>> http://hdl.handle.net/2142/2050
> > >>>
> > >>> I've also done a large amount of searching/testing based on
> > >>> recommendations from various sites.   In particular, I've done a
> > >>> memory
> > >>> dump using JHat
> > >>>
> (http://java.sun.com/javase/6/docs/technotes/tools/share/
> > >>> jhat.html), and
> > >>>
> > >>> it looks like the problem may reside with a potential memory leak
> > >>> in the
> > >>>
> > >>> 3rd party PDFBox tool used by DSpace 1.4.2.  (In particular, it
> > >>> *looks*
> > >>> like PDFBox is attempting to load most/all of the textual content
> > >>> into a
> > >>>
> > >>> giant HashMap)
> > >>>
> > >>> Here's the latest settings I've been testing on:
> > >>>
> > >>> RHEL 4
> > >>> Java 1.6.0_02
> > >>> Postgres 8.1.9
> > >>> DSpace 1.4.2
> > >>>
> > >>> We also have the following JAVA_OPTS settings in place for our JVM:
> > >>>
> > >>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 - Dfile.encoding=UTF-8
> > >>>
> > >>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're
> > >>> still getting the OutOfMemory exception at 1GB!)
> > >>>
> > >>> Anyone have any hints/tips or JVM settings to share?  I
> > >>> personally don't
> > >>>
> > >>> see why PDFBox would need so much JVM memory to parse a 15MB
> > >>> PDF.  But,
> > >>> the JHat analysis seemed to be pointing to PDFBox.
> > >>>
> > >>> - Tim
> > >>>
> > >>> P.S.  an example of the full error stack trace is below:
> > >>>
> > >>> Exception in thread "main" java.lang.OutOfMemoryError : Java heap
> > >>> space
> > >>>          at java.util.HashMap.resize(Unknown Source)
> > >>>          at java.util.HashMap.addEntry(Unknown Source)
> > >>>          at java.util.HashMap.put (Unknown Source)
> > >>>          at
> org.fontbox.cmap.CMap.addMapping(CMap.java:132)
> > >>>          at
> org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
> > >>>          at org.pdfbox.pdmodel.font.PDFont.parseCmap
> (PDFont.java:
> > >>> 535)
> > >>>          at
> org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
> > >>>          at
> > >>>
> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java
> :325)
> > >>>          at org.pdfbox.util.operator.ShowText.process
> > >>> (ShowText.java:64)
> > >>>          at
> > >>> org.pdfbox.util.PDFStreamEngine.processOperator
> > >>> (PDFStreamEngine.java :452
> > >>> )
> > >>>          at
> > >>> org.pdfbox.util.PDFStreamEngine.processSubStream
> > >>> (PDFStreamEngine.java:21
> > >>> 5)
> > >>>          at
> > >>> org.pdfbox.util.PDFStreamEngine.processStream
> > >>> (PDFStreamEngine.java:174)
> > >>>          at
> > >>>
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
> > >>> 336)
> > >>>          at
> > >>> org.pdfbox.util.PDFTextStripper.processPages
> (PDFTextStripper.java:
> > >>> 259)
> > >>>          at
> > >>>
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
> > >>>          at
> > >>> org.pdfbox.util.PDFTextStripper.getText
> (PDFTextStripper.java:149)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.PDFFilter.getDestinationStream
> > >>> (PDFFilter.java
> > >>> :114)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.MediaFilterManager.processBitstream
> > >>> (MediaFilt
> > >>> erManager.java:602)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream
> > >>> (MediaFilte
> > >>> rManager.java:513)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.MediaFilterManager.filterItem
> > >>> (MediaFilterMana
> > >>> ger.java :461)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem
> > >>> (MediaFilt
> > >>> erManager.java:428)
> > >>>          at
> > >>>
> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems
> > >>> (Media
> > >>> FilterManager.java:391)
> > >>>          at
> > >>> org.dspace.app.mediafilter.MediaFilterManager.main
> > >>> (MediaFilterManager.ja
> > >>> va:342)
> > >>>
> > >>>
> --------------------------------------------------------------------
> > >>> ----
> > >>> -
> > >>> This SF.net email is sponsored by: Microsoft
> > >>> Defy all challenges. Microsoft(R) Visual Studio 2005.
> > >>>
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>> _______________________________________________
> > >>> DSpace-tech mailing list
> > >>> [email protected]
> > >>>
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > >>>
> > >>>
> --------------------------------------------------------------------
> > >>> -----
> > >>> This SF.net email is sponsored by: Microsoft
> > >>> Defy all challenges. Microsoft(R) Visual Studio 2005.
> > >>>
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>> _______________________________________________
> > >>> DSpace-tech mailing list
> > >>> [email protected]
> > >>>
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > >> ~~~~~~~~~~~~~
> > >> Mark R. Diggory - DSpace Systems Manager
> > >> MIT Libraries, Systems and Technology Services
> > >> Massachusetts Institute of Technology
> > >
> > > --
> > >
> > > ========================================
> > > Tim Donohue
> > > Research Programmer, Illinois Digital Environment for
> > > Access to Learning and Scholarship (IDEALS)
> > > 135 Grainger Engineering Library
> > > University of Illinois at Urbana-Champaign
> > >
> > > email: [EMAIL PROTECTED]
> > > web:   http://www.ideals.uiuc.edu
> > > phone: (217) 333-4648
> > > fax:   (217) 244-7764
> > > ========================================
> >
> >
> >
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > DSpace-tech mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> >
>
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>


-- 
Dan Scott
Laurentian University

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to