[jira] [Commented] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Tilman Hausherr (Jira) Fri, 02 Jun 2023 20:53:10 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728911#comment-17728911
 ]


Tilman Hausherr commented on PDFBOX-5606:
-----------------------------------------

The attached file  [^819127-p1.pdf] throws "IOException: Stream closed". The 
file has a messy content stream. parseNextToken() is closing the content stream 
if an error occurs, but it sometimes calls itself. Because of the closed 
content stream the method returns null, which is reported with the position. 
Trying to get the position on a closed stream throws the exception. 

> PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code
> ------------------------------------------------------------------------
>
>                 Key: PDFBOX-5606
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5606
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.28
>            Reporter: Joe Li
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: memory-bug
>             Fix For: 2.0.29
>
>         Attachments: 590031dc-2131-4a00-a936-d1175b7b926c.pdf, 819127-p1.pdf, 
> pdfbox-2.0.27.png, pdfbox-2.0.28.png, screenshot-1.png, screenshot-2.png
>
>
> Given the follwing simplified Groovy code (for succinctness over Java)
>  
> {code:java}
> // Groovy 4.0.12
> import org.apache.pdfbox.pdmodel.PDDocument
> import org.apache.pdfbox.pdmodel.PDPage
> import org.apache.pdfbox.text.PDFTextStripperByArea
> import java.awt.geom.Rectangle2D
> int GRID_WIDTH = 10
> int GRID_HEIGHT = 10
> PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
>     doc.pages.eachWithIndex { PDPage page, int pageIndex ->
>         int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
>         int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
>         println "processing page $pageIndex, rows = $rows, columns = $columns"
>         def rectangles = [:]
>         (0..<rows).each {rowIndex ->
>             (0..<columns).each { colIndex ->
>                 rectangles["${rowIndex * columns + colIndex}"] = new 
> Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
> GRID_HEIGHT)
>             }
>         }
>         rectangles.each { key, rect ->
>             PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
>             textStripper.addRegion(key, rect)
>             textStripper.extractRegions(page)
>         }
>     }
> }{code}
>  
>  
> PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does 
> not. 
> The test.pdf file I am using can be downloaded from Apple SEC filings page, 
> `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ 
> page pdf with a lot of text will work. 
> I have attached profiler screenshots of the difference. 
> Thanks in advance for your help. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Reply via email to