[jira] Commented: (PDFBOX-778) OutOfMemory when extracting text from pdf

David Wright (JIRA) Thu, 02 Sep 2010 06:03:23 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905493#action_12905493
 ]


David Wright commented on PDFBOX-778:
-------------------------------------

Hi Jukka

I found this thread because I also had a bad memory leak with 1.1.0.
Upgrading to 1.2.1 has improved matters a lot but  there still seems to
be a small leak. 

I'm extracting text from ~1000 PDFs, ~700MB containing ~40MB of text;
some are old and poorly constructed and with 1.1.0 generated a lot of
complaints from PDFontFactory et al.  Memory use changes as follows:

added=0 before gc MB=2.88, after MB=1.88
added=1 before gc MB=13.5, after MB=9.38
added=501 before gc MB=23.0, after MB=9.75
added=551 before gc MB=25.7, after MB=15.3
added=926 before gc MB=42.6, after MB=23.6
added=1076 before gc MB=45.4, after MB=18.9

Hope this info is of some use. Generally I'm delighted with PDFBox, it
'does what it says on the tin'.

Kind regards

David Wright
Technical Author
LDS Test and Measurement, Royston Herts UK SG8 5BQ
Direct Dial +44 1763 255235


This e-mail is confidential and may be read, copied and used only by the 
intended recipient. If you have received it in error, please contact the sender 
immediately by return e-mail. Please then delete the e-mail and do not disclose 
its contents to any other person.

LDS Test & Measurement Ltd is registered in England and its registration number 
is 01539186. The registered Office of LDS Test & Measurement Ltd is Jarman Way, 
Royston, Herts, SG8 5BQ, England.


> OutOfMemory when extracting text from pdf
> -----------------------------------------
>
>                 Key: PDFBOX-778
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Mac OS X
>            Reporter: Mario Sangiorgio
>         Attachments: 92.pdf
>
>
> I have to extract text from hundreds of documents, but at a certain point I 
> get an out of memory exception.
> It seems that the memory leak is related to a single file that I attached.
> Please let me know if you need more details.
> This is the stacktrace of the exception:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:2734)
>       at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>       at java.util.ArrayList.add(ArrayList.java:351)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
>       at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
>       at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
>       at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
>       at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
>       at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
>       at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
>       at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-778) OutOfMemory when extracting text from pdf

Reply via email to