[
https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208274#comment-15208274
]
Tim Allison commented on TIKA-1907:
-----------------------------------
Thank you for raising this issue. As [~tilman] pointed out, there may be some
areas for memory optimization within PDFBox. However, to be fair,
AcrobatReader consumed 500MB of memory when saving the file to text. When you
decode the doc with PDFBox app's WriteDecodedDoc, the file blossoms to 190MB.
pdftotext appears to have better memory consumption for this file.
If there's anything you can recommend we do on the Tika side to decrease the
memory footprint, let us know...
I plan to parameterize the scratch file usage, but as you found, that doesn't
offer enormous savings.
> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
> Key: TIKA-1907
> URL: https://issues.apache.org/jira/browse/TIKA-1907
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.12
> Reporter: Nicolas Daniels
>
> Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
> I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe
> PDFBox is not the appropriate lib to use in such case.
> Trying to read the same PDF using Tika leads to the same problem:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
> StringWriter writer = new StringWriter();
> FileWriter fileWriter = new FileWriter(new
> File("c:/tmp/test.txt"));
> BodyContentHandler handler = new BodyContentHandler(fileWriter);
> Metadata metadata = new Metadata();
> new PDFParser().parse(inputStream, handler, metadata, new
> ParseContext());
> fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)