[
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213265#comment-15213265
]
Timo Boehme commented on PDFBOX-3284:
-------------------------------------
While doing a memory profiling on the provided PDF I found that a huge number
of COSDictionary objects are created (more than 1 Mill.) while more than 90%
only contain 2 or 3 items. After adding an instance logging I get the following
figures for COSDictionary instances per item count:
{code}9, 887416, 77608, 136, 879, 7332, 61326, 106, 805, 2534, 1, 8{code} (9
instances with 1 item , 887,416 with 2 items etc.).
Thus the used LinkedHashMap in COSDictionary is far away from being memory
efficient for this case.
In order to resolve this in revision 1736709 I've added a
{code}o.a.p.util.SmallMap{code} implementation which is memory optimized for
maps with few entries (smallest footprint I can think of) and used this instead
of the LinkedHashMap in COSDictionary.
As a result text extraction of the PDF in question is down from a minimum of
1.4GB of heap space to 1.2 and even 1.1 (more garbage collection) thus the new
map saves approx. 250MB heap space - and it is even a bit faster (3 seconds).
While there may be other parts which should also be memory optimized this
should be a good start.
In the current version I've kept the code for getting the COSDictionary
instance count in this class (enable it by changing boolean
DO_DEBUG_INSTANCE_COUNT to true). It would be good to check other PDFs in
regard to the item count distribution in COSDictionary. If we find examples
with large number of items we may have to change the map implementation after
reaching a certain item count for performance reasons.
> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
> Key: PDFBOX-3284
> URL: https://issues.apache.org/jira/browse/PDFBOX-3284
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
> Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however
> I'm facing a huge memory consumption leading to out of memory error. Running
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it
> work.
> The resulting text is only 3MB so I don't understand why it is taking so much
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
> StringWriter writer = new StringWriter();
> FileWriter fileWriter = new FileWriter(new
> File("c:/tmp/test.txt"));
> PDFTextStripper pdfTextStripper = new PDFTextStripper();
> pdfTextStripper.writeText(PDDocument.load(inputStream),
> fileWriter);
> fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]