[ https://issues.apache.org/jira/browse/PDFBOX-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619610#action_12619610 ]
Jukka Zitting commented on PDFBOX-313: -------------------------------------- [Comment on SourceForge] Date: 2008-06-09 19:16 Sender: nobody Logged In: NO I getting the exact same exception: java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.resize(HashMap.java:462) at java.util.HashMap.addEntry(HashMap.java:755) at java.util.HashMap.put(HashMap.java:385) at org.fontbox.cmap.CMap.addMapping(CMap.java:131) at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202) at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:510) at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:381) at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:345) at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:506) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:219) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219) at us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.addContent(NmdLucenePDFDocument.java:456) I see that the problem is in FontBox. Is it an infinite loop or is there just too much data to parse? Please let me know where I can upload the pdf so that you test this out. James > OutOfMemoryError for larger PDF text extraction > ----------------------------------------------- > > Key: PDFBOX-313 > URL: https://issues.apache.org/jira/browse/PDFBOX-313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > > [imported from SourceForge] > http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1805929 > Originally submitted by tdonohue on 2007-10-01 13:51. > Hello, > I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org) > version 1.4.2. Currently, I'm running into OutOfMemoryError exceptions > whenever I attempt text extraction from a few larger PDFs (>10MB). I've also > just tried replacing PDFBox 0.7.3 with your latest nightly-build (from Oct > 1), and the error still seems to be happening. > My JVM options are currently set to: > -Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8 > Here's a few of the problem PDFs: > 15MB PDF: > https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf > 13MB PDF: > https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF > Here's an example error stacktrace: > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.HashMap.addEntry(HashMap.java:753) > at java.util.HashMap.put(HashMap.java:385) > at org.fontbox.cmap.CMap.addMapping(CMap.java:131) > at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202) > at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509) > at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380) > at > org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:343) > at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) > at > org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:497) > at > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:218) > at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177) > at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339) > at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263) > at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219) > at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152) > at > org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114) > at > org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602) > at > org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513) > at > org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461) > at > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428) > at > org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:417) > at > org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359) > Finally, here's how the DSpace API is calling PDFBox: > PDFTextStripper pts = new PDFTextStripper(); > PDFParser parser = null; > String extractedText = null; > try > { > parser = new PDFParser(source); > parser.parse(); > extractedText = pts.getText(new PDDocument(parser.getDocument())); > } > finally > { > try > { > parser.getDocument().close(); > } > catch(Exception e) > { > log.error("Error closing temporary PDF file: " + > e.getMessage(), e); > } > } > [comment on SourceForge] > Originally sent by tdonohue. > Logged In: YES > user_id=1320825 > Originator: YES > I neglected to mention both of these PDFs were initially image-based and were > recently OCRed using Adobe Acrobat 8 Pro. I'm not sure that would matter for > PDFBox to perform text extraction, but it's another commonality between these > PDFs. > Thanks in advance for any help you can provide! > - Tim -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.