[
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
]
Tim Allison commented on TIKA-3668:
-----------------------------------
Thank you. I tried three things this morning.
1) Manually reviewed and re-tested image rendering and extract inline images
code in the PDFParser. With debugging and custom logging, I could see that
even running multi-threaded, the code works as expected. If the header says
no-ocr, pages aren't rendered in the PDFParser and inline images are not
extracted.
2) In a single thread, I ran all the files in our unit tests with custom
logging to detect if the TesseractOCRParser was being called on any of the file
types when the header was set to no_ocr. I couldn't find any problems. The
TesseractOCRParser was never called to parse.
3) I ran pidstat with three settings; the client was single threaded. The
results all basically look the same to me. The f
{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU)
11:31:47 AM UID PID %usr %system %guest %wait %CPU CPU
Command
11:31:47 AM 1000 254595 0.16 0.00 0.00 0.00 0.17 2 java
11:31:47 AM UID PID usr-ms system-ms guest-ms Command
11:31:47 AM 1000 254595 442080 11820 0 java
disable ocr parser via tika-config and include "no-ocr header"
~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU)
11:08:39 AM UID PID %usr %system %guest %wait %CPU CPU
Command
11:08:39 AM 1000 250033 0.16 0.00 0.00 0.00 0.17 5 java
11:08:39 AM UID PID usr-ms system-ms guest-ms Command
11:08:39 AM 1000 250033 439390 11780 0 java
disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU)
11:16:50 AM UID PID %usr %system %guest %wait %CPU CPU
Command
11:16:50 AM 1000 252228 0.16 0.00 0.00 0.00 0.17 5 java
11:16:50 AM UID PID usr-ms system-ms guest-ms Command
11:16:50 AM 1000 252228 437250 12380 0 java
{noformat}
> High CPU utilization in Tika 2.2.0
> ----------------------------------
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
> Issue Type: Bug
> Reporter: Manjunath Dhongadi
> Priority: Major
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0.
> Any fine tuning parameters available for same.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)