[ 
https://issues.apache.org/jira/browse/TIKA-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021363#comment-18021363
 ] 

Tilman Hausherr commented on TIKA-4492:
---------------------------------------

Please include a few lines from the stack trace and attach such a file or if it 
is too large, upload it to a sharehoster.

> Large file parsing fails (RecordFormatException), using FileInputStream 
> throws exception, but using TikaInputStream.get(Path)successfully parses
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4492
>                 URL: https://issues.apache.org/jira/browse/TIKA-4492
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.0.0
>            Reporter: yuying zhang
>            Priority: Major
>
> I encountered a {{org.apache.tika.exception.TikaException: 
> org.apache.poi.ooxml.util.RecordFormatException}} exception when using 
> {{AutoDetectParser}} to parse a 20MB full text {{docx}} file.
> Using the follow code snippet for parsing (throws exception):
>  
> {code:java}
> FileInputStream fileInputStream = new FileInputStream(file);
> autoDetectParser.parse(fileInputStream,handler,metadata,context);{code}
> Try using TikaInputStram to wrap the input stream:
> {code:java}
> TikaInputStream tikaInputStream = new TikaInputStream(file);
> autoDetectParser.parse(tikaInputStream,handler,metadata,context); {code}
> I looked at the source code of TikaInputStream.parse(InputStream, 
> ContentHandler, Metadata, ParseContext) and found it internally calls 
> TikaInputStream tis = TikaInputStream.get(stream, tmp, metadata)
> Why does directly using {{FileInputStream}} cause the parsing of a 20MB 
> {{docx}} file to fail? Why does using {{TikaInputStream.get()}} or calling 
> {{TikaInputStream.parse()}} succeed?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to