[
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212079#comment-15212079
]
John Hewson edited comment on PDFBOX-3284 at 3/25/16 5:01 PM:
--------------------------------------------------------------
First of all -Xmx768M isn't _that_ much memory. I'd recommend 1-2GB. I've
parsed 100MB+ PDFs with PDFBox with this amount of memory. As Tilman says,
often memory usage is due to how you're opening files, and sometimes its due to
a particular PDF (e.g. a file which includes a single giant image). Remember
that while the compressed PDF file may only be 23MB PDFBox has to handle its
uncompressed contents, parse that into various data structures, and load all
the fonts from disk and parse them into various memory structures too, which
can start using up quite a bit of memory.
Personally I've had the best results using a 32-bit JVM and opening the PDF
directly from a File with no scratch file. Feel free to upload the problem PDF
and we can see if there's something specific about that file which is causing
the problem.
was (Author: jahewson):
First of all -Xmx768M isn't _that_ much memory. I'd recommend 1-2GB. I've
parsed 100MB+ PDFs with PDFBox with this amount of memory. As Tilman says,
often memory usage is due to how you're opening files, and sometimes its due to
a particular PDF (e.g. a file which includes a single giant image). Remember
that while the PDF file may only be 23MB PDFBox has to handle its uncompressed
contents, parse that into various data structures, and load all the fonts from
disk and parse them into various memory structures too, which can start using
up quite a bit of memory.
Personally I've had the best results using a 32-bit JVM and opening the PDF
directly from a File with no scratch file. Feel free to upload the problem PDF
and we can see if there's something specific about that file which is causing
the problem.
> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
> Key: PDFBOX-3284
> URL: https://issues.apache.org/jira/browse/PDFBOX-3284
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
> Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however
> I'm facing a huge memory consumption leading to out of memory error. Running
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it
> work.
> The resulting text is only 3MB so I don't understand why it is taking so much
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
> StringWriter writer = new StringWriter();
> FileWriter fileWriter = new FileWriter(new
> File("c:/tmp/test.txt"));
> PDFTextStripper pdfTextStripper = new PDFTextStripper();
> pdfTextStripper.writeText(PDDocument.load(inputStream),
> fileWriter);
> fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]