[
https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179784#comment-14179784
]
Maruan Sahyoun commented on PDFBOX-2445:
----------------------------------------
The file uses a lot of small images which are duplicated on each page leading
to the memory issue.
[~jahewson] couldn’t we probably change PDFTextStripper to not use
document.getDocumentCatalog().getAllPages() as I understand that this loads
everything? Or did that change already?
> Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf
> --------------------------------------------------------------
>
> Key: PDFBOX-2445
> URL: https://issues.apache.org/jira/browse/PDFBOX-2445
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Maruan Sahyoun
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)