[jira] [Updated] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Joe Li (Jira) Fri, 19 May 2023 08:58:00 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joe Li updated PDFBOX-5606:
---------------------------
    Description: 
Given the follwing simplified Groovy code (for succinctness over Java)

 
{code:java}
// Groovy 4.0.12
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.PDPage
import org.apache.pdfbox.text.PDFTextStripperByArea
import java.awt.geom.Rectangle2D

int GRID_WIDTH = 10
int GRID_HEIGHT = 10

PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
    doc.pages.eachWithIndex { PDPage page, int pageIndex ->
        int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
        int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
        println "processing page $pageIndex, rows = $rows, columns = $columns"
        def rectangles = [:]
        (0..<rows).each {rowIndex ->
            (0..<columns).each { colIndex ->
                rectangles["${rowIndex * columns + colIndex}"] = new 
Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
GRID_HEIGHT)
            }
        }
        rectangles.each { key, rect ->
            PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
            textStripper.addRegion(key, rect)
            textStripper.extractRegions(page)
        }
    }
}{code}
 

 

PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does not. 

The test.pdf file I am using can be downloaded from Apple SEC filings page, 
`8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ 
page pdf with a lot of text will work. 

I have attached profiler screenshots of the difference. 

Thanks in advance for your help. 

  was:
Given the follwing simplified Groovy code (for succinctness over Java)



 
{code:java}
// Groovy 4.0.12
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.PDPage
import org.apache.pdfbox.text.PDFTextStripperByArea
import java.awt.geom.Rectangle2D

int GRID_WIDTH = 10
int GRID_HEIGHT = 10

PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
    doc.pages.eachWithIndex { PDPage page, int pageIndex ->
        int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
        int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
        println "processing page $pageIndex, rows = $rows, columns = $columns"
        def rectangles = [:]
        (0..<rows).each {rowIndex ->
            (0..<columns).each { colIndex ->
                rectangles["${rowIndex * columns + colIndex}"] = new 
Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
GRID_HEIGHT)
            }
        }
        rectangles.each { key, rect ->
            PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
            textStripper.addRegion(key, rect)
            textStripper.extractRegions(page)
        }
    }
}{code}
 

 

PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does not. 

The test.pdf file I am using can be downloaded from Apple SEC filings page, 
`8-K` from [here |[https://investor.apple.com/sec-filings/default.aspx],] but 
any 10+ page pdf with a lot of text will work. 

I have attached profiler screenshots of the difference. 

Thanks in advance for your help. 


> PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code
> ------------------------------------------------------------------------
>
>                 Key: PDFBOX-5606
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5606
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.28
>            Reporter: Joe Li
>            Priority: Major
>              Labels: memory-bug
>         Attachments: pdfbox-2.0.27.png, pdfbox-2.0.28.png
>
>
> Given the follwing simplified Groovy code (for succinctness over Java)
>  
> {code:java}
> // Groovy 4.0.12
> import org.apache.pdfbox.pdmodel.PDDocument
> import org.apache.pdfbox.pdmodel.PDPage
> import org.apache.pdfbox.text.PDFTextStripperByArea
> import java.awt.geom.Rectangle2D
> int GRID_WIDTH = 10
> int GRID_HEIGHT = 10
> PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
>     doc.pages.eachWithIndex { PDPage page, int pageIndex ->
>         int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
>         int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
>         println "processing page $pageIndex, rows = $rows, columns = $columns"
>         def rectangles = [:]
>         (0..<rows).each {rowIndex ->
>             (0..<columns).each { colIndex ->
>                 rectangles["${rowIndex * columns + colIndex}"] = new 
> Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
> GRID_HEIGHT)
>             }
>         }
>         rectangles.each { key, rect ->
>             PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
>             textStripper.addRegion(key, rect)
>             textStripper.extractRegions(page)
>         }
>     }
> }{code}
>  
>  
> PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does 
> not. 
> The test.pdf file I am using can be downloaded from Apple SEC filings page, 
> `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ 
> page pdf with a lot of text will work. 
> I have attached profiler screenshots of the difference. 
> Thanks in advance for your help. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Reply via email to