Nicolas Daniels created TIKA-1907:
-------------------------------------
Summary: Big Pdf parsing to text - Out of memory
Key: TIKA-1907
URL: https://issues.apache.org/jira/browse/TIKA-1907
Project: Tika
Issue Type: Bug
Affects Versions: 1.12
Reporter: Nicolas Daniels
Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe
PDFBox is not the appropriate lib to use in such case.
Trying to read the same PDF using Tika leads to the same problem:
{code:title=Test.java|borderStyle=solid}
@Test
public void testParsePdf_Content_Memory() throws Exception {
{
InputStream inputStream = new
FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
try {
StringWriter writer = new StringWriter();
FileWriter fileWriter = new FileWriter(new
File("c:/tmp/test.txt"));
BodyContentHandler handler = new BodyContentHandler(fileWriter);
Metadata metadata = new Metadata();
new PDFParser().parse(inputStream, handler, metadata, new
ParseContext());
fileWriter.close();
} finally {
inputStream.close();
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)