[ https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728911#comment-17728911 ]
Tilman Hausherr commented on PDFBOX-5606: ----------------------------------------- The attached file [^819127-p1.pdf] throws "IOException: Stream closed". The file has a messy content stream. parseNextToken() is closing the content stream if an error occurs, but it sometimes calls itself. Because of the closed content stream the method returns null, which is reported with the position. Trying to get the position on a closed stream throws the exception. > PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code > ------------------------------------------------------------------------ > > Key: PDFBOX-5606 > URL: https://issues.apache.org/jira/browse/PDFBOX-5606 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.28 > Reporter: Joe Li > Assignee: Andreas Lehmkühler > Priority: Major > Labels: memory-bug > Fix For: 2.0.29 > > Attachments: 590031dc-2131-4a00-a936-d1175b7b926c.pdf, 819127-p1.pdf, > pdfbox-2.0.27.png, pdfbox-2.0.28.png, screenshot-1.png, screenshot-2.png > > > Given the follwing simplified Groovy code (for succinctness over Java) > > {code:java} > // Groovy 4.0.12 > import org.apache.pdfbox.pdmodel.PDDocument > import org.apache.pdfbox.pdmodel.PDPage > import org.apache.pdfbox.text.PDFTextStripperByArea > import java.awt.geom.Rectangle2D > int GRID_WIDTH = 10 > int GRID_HEIGHT = 10 > PDDocument.load(new File('./test.pdf')).withCloseable { doc -> > doc.pages.eachWithIndex { PDPage page, int pageIndex -> > int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT) > int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH) > println "processing page $pageIndex, rows = $rows, columns = $columns" > def rectangles = [:] > (0..<rows).each {rowIndex -> > (0..<columns).each { colIndex -> > rectangles["${rowIndex * columns + colIndex}"] = new > Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, > GRID_HEIGHT) > } > } > rectangles.each { key, rect -> > PDFTextStripperByArea textStripper = new PDFTextStripperByArea() > textStripper.addRegion(key, rect) > textStripper.extractRegions(page) > } > } > }{code} > > > PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does > not. > The test.pdf file I am using can be downloaded from Apple SEC filings page, > `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ > page pdf with a lot of text will work. > I have attached profiler screenshots of the difference. > Thanks in advance for your help. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org