[
https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725037#comment-17725037
]
Joe Li commented on PDFBOX-5606:
--------------------------------
[~tilman] Below is the java code. Please change the pdf file path to the actual
location before running it. Thanks!
{code:java}
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.geom.Rectangle2D;
import java.io.File;
public class App {
private static final int GRID_WIDTH = 10;
private static final int GRID_HEIGHT = 10;
public static void main(String[] args) {
try {
PDDocument doc = PDDocument.load(new
File("/590031dc-2131-4a00-a936-d1175b7b926c.pdf"));
for (int pageIndex = 0; pageIndex < doc.getNumberOfPages();
pageIndex++) {
PDPage page = doc.getPage(pageIndex);
int rows = (int) Math.ceil(page.getMediaBox().getHeight()
/GRID_HEIGHT);
int columns = (int) Math.ceil(page.getMediaBox().getWidth()
/GRID_WIDTH);
System.out.println("processing page " + (pageIndex + 1) + ",
rows = " + rows + ", columns = " + columns);
for (int rowIndex = 0; rowIndex < rows; rowIndex++) {
for (int colIndex = 0; colIndex < columns; colIndex++) {
PDFTextStripperByArea textStripper = new
PDFTextStripperByArea();
textStripper.addRegion(Integer.toString(rowIndex *
columns + colIndex), new Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex *
GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT));
textStripper.extractRegions(page);
}
}
}
doc.close();
} catch (Exception e) {
System.out.println(e);
}
}
} {code}
> PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code
> ------------------------------------------------------------------------
>
> Key: PDFBOX-5606
> URL: https://issues.apache.org/jira/browse/PDFBOX-5606
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.28
> Reporter: Joe Li
> Priority: Major
> Labels: memory-bug
> Attachments: 590031dc-2131-4a00-a936-d1175b7b926c.pdf,
> pdfbox-2.0.27.png, pdfbox-2.0.28.png
>
>
> Given the follwing simplified Groovy code (for succinctness over Java)
>
> {code:java}
> // Groovy 4.0.12
> import org.apache.pdfbox.pdmodel.PDDocument
> import org.apache.pdfbox.pdmodel.PDPage
> import org.apache.pdfbox.text.PDFTextStripperByArea
> import java.awt.geom.Rectangle2D
> int GRID_WIDTH = 10
> int GRID_HEIGHT = 10
> PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
> doc.pages.eachWithIndex { PDPage page, int pageIndex ->
> int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
> int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
> println "processing page $pageIndex, rows = $rows, columns = $columns"
> def rectangles = [:]
> (0..<rows).each {rowIndex ->
> (0..<columns).each { colIndex ->
> rectangles["${rowIndex * columns + colIndex}"] = new
> Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH,
> GRID_HEIGHT)
> }
> }
> rectangles.each { key, rect ->
> PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
> textStripper.addRegion(key, rect)
> textStripper.extractRegions(page)
> }
> }
> }{code}
>
>
> PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does
> not.
> The test.pdf file I am using can be downloaded from Apple SEC filings page,
> `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+
> page pdf with a lot of text will work.
> I have attached profiler screenshots of the difference.
> Thanks in advance for your help.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]