[
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206862#comment-15206862
]
Tilman Hausherr commented on PDFBOX-3284:
-----------------------------------------
PDFBox doesn't parse on demand. Thus many structures that aren't needed are
parsed and expanded and do exist in memory, e.g. the many annotations (links)
in the file. We've had complains about more memory usage than your complaint
:-) Btw you can save a little bit of memory by using FIle instead of stream,
and by using a scratch file.
> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
> Key: PDFBOX-3284
> URL: https://issues.apache.org/jira/browse/PDFBOX-3284
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
> Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however
> I'm facing a huge memory consumption leading to out of memory error. Running
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it
> work.
> The resulting text is only 3MB so I don't understand why it is taking so much
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream =
> getClass().getResourceAsStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
> StringWriter writer = new StringWriter();
> FileWriter fileWriter = new FileWriter(new
> File("c:/tmp/test.txt"));
> PDFTextStripper pdfTextStripper = new PDFTextStripper();
> pdfTextStripper.writeText(PDDocument.load(inputStream),
> fileWriter);
> fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]