[jira] [Commented] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

Timo Boehme (JIRA) Sat, 26 Mar 2016 17:17:38 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213265#comment-15213265
 ]


Timo Boehme commented on PDFBOX-3284:
-------------------------------------

While doing a memory profiling on the provided PDF I found that a huge number 
of COSDictionary objects are created (more than 1 Mill.) while more than 90% 
only contain 2 or 3 items. After adding an instance logging I get the following 
figures for COSDictionary instances per item count:
{code}9, 887416, 77608, 136, 879, 7332, 61326, 106, 805, 2534, 1, 8{code} (9 
instances with 1 item , 887,416 with 2 items etc.).
Thus the used LinkedHashMap in COSDictionary is far away from being memory 
efficient for this case.
In order to resolve this in revision 1736709 I've added a 
{code}o.a.p.util.SmallMap{code} implementation which is memory optimized for 
maps with few entries (smallest footprint I can think of) and used this instead 
of the LinkedHashMap in COSDictionary.
As a result text extraction of the PDF in question is down from a minimum of 
1.4GB of heap space to 1.2 and even 1.1 (more garbage collection) thus the new 
map saves approx. 250MB heap space - and it is even a bit faster (3 seconds).
While there may be other parts which should also be memory optimized this 
should be a good start.
In the current version I've kept the code for getting the COSDictionary 
instance count in this class (enable it by changing boolean 
DO_DEBUG_INSTANCE_COUNT to true). It would be good to check other PDFs in 
regard to the item count distribution in COSDictionary. If we find examples 
with large number of items we may have to change the map implementation after 
reaching a certain item count for performance reasons.

> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
>                 Key: PDFBOX-3284
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3284
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
>            Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however 
> I'm facing a huge memory consumption leading to out of memory error. Running 
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it 
> work. 
> The resulting text is only 3MB so I don't understand why it is taking so much 
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found 
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
>     InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
>     try {
>              StringWriter writer = new StringWriter();
>            FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>              PDFTextStripper pdfTextStripper = new PDFTextStripper();
>            pdfTextStripper.writeText(PDDocument.load(inputStream), 
> fileWriter);
>              fileWriter.close();
>     } finally {
>         inputStream.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

Reply via email to