[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874 ]
Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM: ------------------------------------------------------------ Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings against all of our test files 10 times. The client was single threaded. I ran pidstat against the forked process, not the primary watcher process. The results all basically look the same to me. {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:31:47 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:31:47 AM 1000 254595 0.16 0.00 0.00 0.00 0.17 2 java 11:31:47 AM UID PID usr-ms system-ms guest-ms Command 11:31:47 AM 1000 254595 442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:08:39 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:08:39 AM 1000 250033 0.16 0.00 0.00 0.00 0.17 5 java 11:08:39 AM UID PID usr-ms system-ms guest-ms Command 11:08:39 AM 1000 250033 439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:16:50 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:16:50 AM 1000 252228 0.16 0.00 0.00 0.00 0.17 5 java 11:16:50 AM UID PID usr-ms system-ms guest-ms Command 11:16:50 AM 1000 252228 437250 12380 0 java {noformat} was (Author: talli...@mitre.org): Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings; the client was single threaded. I ran pidstat against the forked process, not the primary watcher process. The results all basically look the same to me. {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:31:47 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:31:47 AM 1000 254595 0.16 0.00 0.00 0.00 0.17 2 java 11:31:47 AM UID PID usr-ms system-ms guest-ms Command 11:31:47 AM 1000 254595 442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:08:39 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:08:39 AM 1000 250033 0.16 0.00 0.00 0.00 0.17 5 java 11:08:39 AM UID PID usr-ms system-ms guest-ms Command 11:08:39 AM 1000 250033 439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_ (8 CPU) 11:16:50 AM UID PID %usr %system %guest %wait %CPU CPU Command 11:16:50 AM 1000 252228 0.16 0.00 0.00 0.00 0.17 5 java 11:16:50 AM UID PID usr-ms system-ms guest-ms Command 11:16:50 AM 1000 252228 437250 12380 0 java {noformat} > High CPU utilization in Tika 2.2.0 > ---------------------------------- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug > Reporter: Manjunath Dhongadi > Priority: Major > Attachments: tika-config-no-tess.xml, tika-config.xml > > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)