[jira] [Comment Edited] (TIKA-2963) Tika在抽取.xlsx类型的大文件时出现OOM错误

Tilman Hausherr (Jira) Fri, 11 Oct 2019 08:28:17 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949551#comment-16949551
 ]


Tilman Hausherr edited comment on TIKA-2963 at 10/11/19 3:27 PM:
-----------------------------------------------------------------

"For docx and pptx type files, Tika can configure the SAX parser to improve 
decimation performance. However, Tika still has an OOM error when extracting 
large files of type .xlsx. I have not found a solution from the official. I 
have attached my own code below. It is also a solution based on SAX parser. The 
code can be adjusted according to the actual situation. Excellent, there are 
many shortcomings, everyone criticizes and corrects, thank you"

I wonder what is meant with "decimation performance". After deleting single 
words in google translation, I suspect it is extraction. So what she/he means 
that the current solution uses too much memory and the proposed SAX based 
solution is better.


was (Author: tilman):
"For docx and pptx type files, Tika can configure the SAX parser to improve 
decimation performance. However, Tika still has an OOM error when extracting 
large files of type .xlsx. I have not found a solution from the official. I 
have attached my own code below. It is also a solution based on SAX parser. The 
code can be adjusted according to the actual situation. Excellent, there are 
many shortcomings, everyone criticizes and corrects, thank you"

I wonder what is meant with "decimation performance". After deleting single 
words, I suspect it is extraction. So what she/he means that the current 
solution uses too much memory and the proposed SAX based solution is better.

> Tika在抽取.xlsx类型的大文件时出现OOM错误
> --------------------------
>
>                 Key: TIKA-2963
>                 URL: https://issues.apache.org/jira/browse/TIKA-2963
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.20
>            Reporter: Feng Jiao Jiang
>            Priority: Major
>         Attachments: demo.java
>
>
> 对于docx和pptx类型的文件，Tika可配置SAX解析器来提高抽取性能。但是Tika在抽取.xlsx类型的大文件时仍会出现OOM错误，我暂时没有从官方找到解决方案，下面附上自己的代码，也是基于SAX解析器的解决方案，代码可根据实际情况进行参数调优，多有不足之处，大家批评指正，谢谢



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-2963) Tika在抽取.xlsx类型的大文件时出现OOM错误

Reply via email to