I should note to the list that the latest development version of PDFBox claims to have solved an Out of Memory Exception error. Sounds familiar :)
I had suggested to Tim privately that maybe he could test it out and let us know if it resolves the problem: http://www.pdfbox.org/changes.html#version_0.7.4-dev On 22/09/2007, Mark Diggory <[EMAIL PROTECTED]> wrote: > Yes, I recall some issues in the past which we addressed by upgrading that > jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar > (0.7.3) while DSpace 1.4.1 has an older version. > > Unfortunately, PDFBox is a more full featured PDF editor and not just a text > extractor, it pulls large portions of the PDF into memory when processing > it. If we could find a more stream based text extractor for pdf files, it > would make the memory footprint much more fixed for FilterMedia. > > -Mark > > > > On Sep 22, 2007, at 5:57 AM, Jimmy Zhang wrote: > > The responsibility of PDFBox is to extract the full text of the pdf file.I > am wondering it maybe has to do with the pdf file.Do you mean any pdf files > whose size more than 10M can cause problem or only that pdf file? > > -- > Website: www.drepository.com > > On 9/21/07, Mark Diggory <[EMAIL PROTECTED] > wrote: > > I would also then recommend trying to get the latest PDFBox and > > replace the jar in your lib directory. > > > > http://sourceforge.net/project/showfiles.php? > > group_id=78314&package_id=79377 > > > > On Sep 21, 2007, at 9:46 AM, Tim Donohue wrote: > > > > > Jayan & Mark, > > > > > > Thanks for the suggestions. But, our problem is that we're > > > currently running Java & dsrun using: > > > > > > JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 > > > > > > (I've modified our local dsrun script to read from the JAVA_OPTS > > > environment variable). > > > > > > So, even setting a maximum heap size of 1GB, we don't seem to be > > > able to full text index a 15MB PDF without encountering > > > "OutOfMemory: Java heap space" errors. Strange, I know. My > > > current theory is that there may be a memory leak in the PDFBox > > > tools. I'm still working on a definite diagnosis though. If no > > > one else out there has noticed this with DSpace 1.4.2, then I guess > > > it's possible there's something in our local settings (or > > > customizations of DSpace) which could be causing this issue. > > > > > > - Tim > > > > > > Mark Diggory wrote: > > >> We should consider adding more sane defaults, most machines that > > >> DSpace is running on have well over 1Gig of memory available and > > >> its important to remember this is a maximum heap size and is not > > >> take unless required. I think setting dsrun and the other > > >> commandline scripts to be 512m (1/2 * 1Gig) would eliminate most > > >> outlying cases where PDF docs need to be held in memory. > > >> -Mark Diggory > > >> On Sep 21, 2007, at 2:10 AM, Jayan Chirayath Kurian wrote: > > >>> Hi! Tim, > > >>> > > >>> Here we faced similar errors while trying out full-text indexing on > > >>> DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000 > > >>> records. This was rectified once dsrun.bat was given 1000m at java > > >>> -Xmx256m -classpath ........ > > >>> http://repositorydev.ntu.edu.sg > > >>> > > >>> Jayan > > >>> > > >>> > > >>> -----Original Message----- > > >>> From: [EMAIL PROTECTED] > > >>> [mailto:[EMAIL PROTECTED] On > Behalf Of Tim > > >>> Donohue > > >>> Sent: Friday, September 21, 2007 1:58 AM > > >>> To: dspace-tech > > >>> Subject: [Dspace-tech] OutOfMemory errors during large PDF indexing > > >>> > > >>> All, > > >>> > > >>> I'm curious if anyone out there has run into strange OutOfMemory > > >>> errors > > >>> while full-text indexing larger (>10MB) PDF files in DSpace. > > >>> > > >>> It usually appears as either: > > >>> > > >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap > > >>> space > > >>> > > >>> OR > > >>> > > >>> Exception in thread "main" java.lang.OutOfMemoryError: GC > > >>> Overhead limit > > >>> > > >>> exceeded > > >>> > > >>> I've located the main "problem" PDF in our DSpace instance: > > >>> http://hdl.handle.net/2142/2050 > > >>> > > >>> I've also done a large amount of searching/testing based on > > >>> recommendations from various sites. In particular, I've done a > > >>> memory > > >>> dump using JHat > > >>> > (http://java.sun.com/javase/6/docs/technotes/tools/share/ > > >>> jhat.html), and > > >>> > > >>> it looks like the problem may reside with a potential memory leak > > >>> in the > > >>> > > >>> 3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it > > >>> *looks* > > >>> like PDFBox is attempting to load most/all of the textual content > > >>> into a > > >>> > > >>> giant HashMap) > > >>> > > >>> Here's the latest settings I've been testing on: > > >>> > > >>> RHEL 4 > > >>> Java 1.6.0_02 > > >>> Postgres 8.1.9 > > >>> DSpace 1.4.2 > > >>> > > >>> We also have the following JAVA_OPTS settings in place for our JVM: > > >>> > > >>> JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 - Dfile.encoding=UTF-8 > > >>> > > >>> (We initially had Xmx and Xms at 512MB, but I bumped it up and we're > > >>> still getting the OutOfMemory exception at 1GB!) > > >>> > > >>> Anyone have any hints/tips or JVM settings to share? I > > >>> personally don't > > >>> > > >>> see why PDFBox would need so much JVM memory to parse a 15MB > > >>> PDF. But, > > >>> the JHat analysis seemed to be pointing to PDFBox. > > >>> > > >>> - Tim > > >>> > > >>> P.S. an example of the full error stack trace is below: > > >>> > > >>> Exception in thread "main" java.lang.OutOfMemoryError : Java heap > > >>> space > > >>> at java.util.HashMap.resize(Unknown Source) > > >>> at java.util.HashMap.addEntry(Unknown Source) > > >>> at java.util.HashMap.put (Unknown Source) > > >>> at > org.fontbox.cmap.CMap.addMapping(CMap.java:132) > > >>> at > org.fontbox.cmap.CMapParser.parse(CMapParser.java:153) > > >>> at org.pdfbox.pdmodel.font.PDFont.parseCmap > (PDFont.java: > > >>> 535) > > >>> at > org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387) > > >>> at > > >>> > org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java > :325) > > >>> at org.pdfbox.util.operator.ShowText.process > > >>> (ShowText.java:64) > > >>> at > > >>> org.pdfbox.util.PDFStreamEngine.processOperator > > >>> (PDFStreamEngine.java :452 > > >>> ) > > >>> at > > >>> org.pdfbox.util.PDFStreamEngine.processSubStream > > >>> (PDFStreamEngine.java:21 > > >>> 5) > > >>> at > > >>> org.pdfbox.util.PDFStreamEngine.processStream > > >>> (PDFStreamEngine.java:174) > > >>> at > > >>> > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java: > > >>> 336) > > >>> at > > >>> org.pdfbox.util.PDFTextStripper.processPages > (PDFTextStripper.java: > > >>> 259) > > >>> at > > >>> > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) > > >>> at > > >>> org.pdfbox.util.PDFTextStripper.getText > (PDFTextStripper.java:149) > > >>> at > > >>> > org.dspace.app.mediafilter.PDFFilter.getDestinationStream > > >>> (PDFFilter.java > > >>> :114) > > >>> at > > >>> > org.dspace.app.mediafilter.MediaFilterManager.processBitstream > > >>> (MediaFilt > > >>> erManager.java:602) > > >>> at > > >>> > org.dspace.app.mediafilter.MediaFilterManager.filterBitstream > > >>> (MediaFilte > > >>> rManager.java:513) > > >>> at > > >>> > org.dspace.app.mediafilter.MediaFilterManager.filterItem > > >>> (MediaFilterMana > > >>> ger.java :461) > > >>> at > > >>> > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem > > >>> (MediaFilt > > >>> erManager.java:428) > > >>> at > > >>> > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems > > >>> (Media > > >>> FilterManager.java:391) > > >>> at > > >>> org.dspace.app.mediafilter.MediaFilterManager.main > > >>> (MediaFilterManager.ja > > >>> va:342) > > >>> > > >>> > -------------------------------------------------------------------- > > >>> ---- > > >>> - > > >>> This SF.net email is sponsored by: Microsoft > > >>> Defy all challenges. Microsoft(R) Visual Studio 2005. > > >>> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > >>> _______________________________________________ > > >>> DSpace-tech mailing list > > >>> [email protected] > > >>> > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > >>> > > >>> > -------------------------------------------------------------------- > > >>> ----- > > >>> This SF.net email is sponsored by: Microsoft > > >>> Defy all challenges. Microsoft(R) Visual Studio 2005. > > >>> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > >>> _______________________________________________ > > >>> DSpace-tech mailing list > > >>> [email protected] > > >>> > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > >> ~~~~~~~~~~~~~ > > >> Mark R. Diggory - DSpace Systems Manager > > >> MIT Libraries, Systems and Technology Services > > >> Massachusetts Institute of Technology > > > > > > -- > > > > > > ======================================== > > > Tim Donohue > > > Research Programmer, Illinois Digital Environment for > > > Access to Learning and Scholarship (IDEALS) > > > 135 Grainger Engineering Library > > > University of Illinois at Urbana-Champaign > > > > > > email: [EMAIL PROTECTED] > > > web: http://www.ideals.uiuc.edu > > > phone: (217) 333-4648 > > > fax: (217) 244-7764 > > > ======================================== > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > DSpace-tech mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > -- Dan Scott Laurentian University ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

