[
https://issues.apache.org/jira/browse/TIKA-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Pearcy updated TIKA-818:
-----------------------------
Attachment: choose_inmemory_vs_temp_file_pdf_passes_tests.patch
Here's a version that should pass all tests. Please ignore the first one.
> Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow
> for a memory vs performance tradeoff
> -----------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-818
> URL: https://issues.apache.org/jira/browse/TIKA-818
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.10, 1.0
> Reporter: Paul Pearcy
> Attachments: choose_inmemory_vs_temp_file_pdf.patch,
> choose_inmemory_vs_temp_file_pdf_passes_tests.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> After upgrading to Tika 0.10, began having OOM errors processing large
> amounts of PDFs in parallel. The heap dump indicated that all the memory was
> getting used up by PDFBox RandomAccessBuffers. After digging around, it looks
> like PDFBox now defaults to using RAM vs temporary files for PDF extraction.
> This can be overridden to use RandomAccessFiless.
> I propose that Tika controls file vs buffer based on the inputstream type
> received. If the TikaInputStream is a file, RandomAccessFile should be used
> and for other stream types, RandomAccessBuffer can be used.
> I believe the code to control this is here:
> https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> At ~line 87:
> PDDocument pdfDocument =
> PDDocument.load(new CloseShieldInputStream(stream), true);
> Not sure if this is the best approach and am curious if there are other ideas
> on how to control this and keep the interface clean.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira