[
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499194#comment-17499194
]
Tim Allison commented on TIKA-3668:
-----------------------------------
I reviewed our parsers to make sure there wasn't any sharing of parsecontexts
across parses for embedded documents, and I didn't find anything obvious.
I ran a multithreaded test with testOCR.pdf and testOCR.pptx randomly turning
on and off OCR via the header, and I got the expected output.
I did this with both the main development branch and the 2.2.0 server. I was
not able to reproduce this.
Do you have any custom settings on your parsers? Is there any way that you'd
be able to tell that the time to process specific file types has gone up? Or,
can you identify problematic file types...or is this across the board (IIRC,
you said across the board)...
> High CPU utilization in Tika 2.2.0
> ----------------------------------
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
> Issue Type: Bug
> Reporter: Manjunath Dhongadi
> Priority: Major
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0.
> Any fine tuning parameters available for same.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)