yuying zhang created TIKA-4492:
----------------------------------

             Summary: Large file parsing fails (RecordFormatException), using 
FileInputStream throws exception, but using 
TikaInputStream.get(Path)successfully parses
                 Key: TIKA-4492
                 URL: https://issues.apache.org/jira/browse/TIKA-4492
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.0.0
            Reporter: yuying zhang


I encountered a {{org.apache.tika.exception.TikaException: 
org.apache.poi.ooxml.util.RecordFormatException}} exception when using 
{{AutoDetectParser}} to parse a 20MB full text {{docx}} file.

Using the follow code snippet for parsing (throws exception):

 
{code:java}
FileInputStream fileInputStream = new FileInputStream(file);
autoDetectParser.parse(fileInputStream,handler,metadata,context);{code}
Try using TikaInputStram to wrap the input stream:
{code:java}
TikaInputStream tikaInputStream = new TikaInputStream(file);
autoDetectParser.parse(tikaInputStream,handler,metadata,context); {code}
I looked at the source code of TikaInputStream.parse(InputStream, 
ContentHandler, Metadata, ParseContext) and found it internally calls 
TikaInputStream tis = TikaInputStream.get(stream, tmp, metadata)

Why does directly using {{FileInputStream}} cause the parsing of a 20MB 
{{docx}} file to fail? Why does using {{TikaInputStream.get()}} or calling 
{{TikaInputStream.parse()}} succeed?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to