[
https://issues.apache.org/jira/browse/TIKA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949551#comment-16949551
]
Tilman Hausherr edited comment on TIKA-2963 at 10/11/19 3:27 PM:
-----------------------------------------------------------------
"For docx and pptx type files, Tika can configure the SAX parser to improve
decimation performance. However, Tika still has an OOM error when extracting
large files of type .xlsx. I have not found a solution from the official. I
have attached my own code below. It is also a solution based on SAX parser. The
code can be adjusted according to the actual situation. Excellent, there are
many shortcomings, everyone criticizes and corrects, thank you"
I wonder what is meant with "decimation performance". After deleting single
words in google translation, I suspect it is extraction. So what she/he means
that the current solution uses too much memory and the proposed SAX based
solution is better.
was (Author: tilman):
"For docx and pptx type files, Tika can configure the SAX parser to improve
decimation performance. However, Tika still has an OOM error when extracting
large files of type .xlsx. I have not found a solution from the official. I
have attached my own code below. It is also a solution based on SAX parser. The
code can be adjusted according to the actual situation. Excellent, there are
many shortcomings, everyone criticizes and corrects, thank you"
I wonder what is meant with "decimation performance". After deleting single
words, I suspect it is extraction. So what she/he means that the current
solution uses too much memory and the proposed SAX based solution is better.
> Tika在抽取.xlsx类型的大文件时出现OOM错误
> --------------------------
>
> Key: TIKA-2963
> URL: https://issues.apache.org/jira/browse/TIKA-2963
> Project: Tika
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.20
> Reporter: Feng Jiao Jiang
> Priority: Major
> Attachments: demo.java
>
>
> 对于docx和pptx类型的文件,Tika可配置SAX解析器来提高抽取性能。但是Tika在抽取.xlsx类型的大文件时仍会出现OOM错误,我暂时没有从官方找到解决方案,下面附上自己的代码,也是基于SAX解析器的解决方案,代码可根据实际情况进行参数调优,多有不足之处,大家批评指正,谢谢
--
This message was sent by Atlassian Jira
(v8.3.4#803005)