[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
 ] 

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM:
------------------------------------------------------------

Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings against all of our test files 10 times. 
The client was single threaded.  I ran pidstat against the forked process, not 
the primary watcher process.  The results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:31:47 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:31:47 AM  1000    254595    0.16    0.00    0.00    0.00    0.17     2  java

11:31:47 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:31:47 AM  1000    254595    442080     11820         0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:08:39 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:08:39 AM  1000    250033    0.16    0.00    0.00    0.00    0.17     5  java

11:08:39 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:08:39 AM  1000    250033    439390     11780         0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:16:50 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:16:50 AM  1000    252228    0.16    0.00    0.00    0.00    0.17     5  java

11:16:50 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:16:50 AM  1000    252228    437250     12380         0  java
{noformat}


was (Author: talli...@mitre.org):
Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  I ran 
pidstat against the forked process, not the primary watcher process.  The 
results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:31:47 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:31:47 AM  1000    254595    0.16    0.00    0.00    0.00    0.17     2  java

11:31:47 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:31:47 AM  1000    254595    442080     11820         0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:08:39 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:08:39 AM  1000    250033    0.16    0.00    0.00    0.00    0.17     5  java

11:08:39 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:08:39 AM  1000    250033    439390     11780         0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()      03/03/2022      _x86_64_        (8 CPU)

11:16:50 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  
Command
11:16:50 AM  1000    252228    0.16    0.00    0.00    0.00    0.17     5  java

11:16:50 AM   UID       PID    usr-ms system-ms  guest-ms  Command
11:16:50 AM  1000    252228    437250     12380         0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> ----------------------------------
>
>                 Key: TIKA-3668
>                 URL: https://issues.apache.org/jira/browse/TIKA-3668
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Manjunath Dhongadi
>            Priority: Major
>         Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to