[jira] [Comment Edited] (TIKA-3103) Tesseract fails to respect timeouts and clean up after itself

Radim Rehurek (Jira) Wed, 20 May 2020 08:38:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112356#comment-17112356
 ]


Radim Rehurek edited comment on TIKA-3103 at 5/20/20, 3:37 PM:
---------------------------------------------------------------

I take it back. There are still Tesseract processes that have been up for 4+ 
minutes, even with `tika-config.xml` and `X-Tika-OCRTimeout` set to 30:

 

``` 

Tasks: 84 total, 11 running, 72 sleeping, 0 stopped, 1 zombie
 %Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
 MiB Mem : 64344.3 total, 7654.1 free, 45423.2 used, 11267.0 buff/cache
 MiB Swap: 32736.0 total, 17849.2 free, 14886.8 used. 11932.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 11403 root 20 0 232336 165780 11912 R 162.0 0.3 *12:56.57* tesseract
 27935 root 20 0 173224 106872 12000 R 150.0 0.2 *9:06.77* tesseract
 28508 root 20 0 150164 83960 12024 R 148.0 0.1 *8:57.43* tesseract
 11663 root 20 0 182396 116284 12028 R 121.0 0.2 *12:28.29* tesseract
 24297 root 20 0 139424 72448 11564 R 109.0 0.1 *9:04.79* tesseract
 11659 root 20 0 182276 115956 12064 R 104.0 0.2 *12:44.52* tesseract
 28519 root 20 0 181128 114368 11964 R 104.0 0.2 *8:48.41* tesseract
 11944 root 20 0 180840 114300 12100 R 102.0 0.2 *12:29.86* tesseract
 23419 root 20 0 174876 107872 11932 R 101.0 0.2 *9:21.44* tesseract
 23426 root 20 0 144064 77112 11464 R 100.0 0.1 *9:12.65* tesseract

```

 

I don't understand how this works. Both the CPU and disk are still a mess.


was (Author: piskvorky):
I take it back. There are still Tesseract processes that have been up for 4+ 
minutes, even with `tika-config.xml` and `X-Tika-OCRTimeout` set to 30:

 

``` 

Tasks: 84 total, 11 running, 72 sleeping, 0 stopped, 1 zombie
%Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64344.3 total, 7654.1 free, 45423.2 used, 11267.0 buff/cache
MiB Swap: 32736.0 total, 17849.2 free, 14886.8 used. 11932.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11403 root 20 0 232336 165780 11912 R 162.0 0.3 *12:56.57* tesseract
27935 root 20 0 173224 106872 12000 R 150.0 0.2 *9:06.77* tesseract
28508 root 20 0 150164 83960 12024 R 148.0 0.1 *8:57.43* tesseract
11663 root 20 0 182396 116284 12028 R 121.0 0.2 *12:28.29* tesseract
24297 root 20 0 139424 72448 11564 R 109.0 0.1 *9:04.79* tesseract
11659 root 20 0 182276 115956 12064 R 104.0 0.2 *12:44.52* tesseract
28519 root 20 0 181128 114368 11964 R 104.0 0.2 *8:48.41* tesseract
11944 root 20 0 180840 114300 12100 R 102.0 0.2 *12:29.86* tesseract
23419 root 20 0 174876 107872 11932 R 101.0 0.2 *9:21.44* tesseract
23426 root 20 0 144064 77112 11464 R 100.0 0.1 *9:12.65* tesseract

```

 

I don't understand how this works.

> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
>                 Key: TIKA-3103
>                 URL: https://issues.apache.org/jira/browse/TIKA-3103
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.24.1
>            Reporter: Radim Rehurek
>            Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
>  
> Two undersirable things happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests 
> have finished.
> These processes show in _top_ as "tesseract" (OCR) and consume all CPU cores 
> at 100%.
> They eventually die (or finish?) but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most 
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> {{root@38acd588ee22:/# ll /tmp/}}
>  {{total 197320}}
>  {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
>  {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
>  {{-rw-r--r-- 1 root root 9273920 May 20 08:56 
> TIKA_streamstore_11144988934311367241.tmp}}
>  {{-rw-r--r-- 1 root root 8938048 May 20 08:57 
> TIKA_streamstore_11649337406504198407.tmp}}
>  {{-rw-r--r-- 1 root root 9478720 May 20 08:56 
> TIKA_streamstore_13551529918743702933.tmp}}
>  {{-rw-r--r-- 1 root root 9151040 May 20 08:57 
> TIKA_streamstore_13568226047805501311.tmp}}
>  {{-rw-r--r-- 1 root root 7701056 May 20 08:56 
> TIKA_streamstore_13908373602714189455.tmp}}
>  {{…}}
>  {{-rw-r--r-- 1 root root 33367 May 20 08:55 
> apache-tika-11167866320029165062.tmp}}
>  {{-rw-r--r-- 1 root root 44353 May 20 08:54 
> apache-tika-1152515137515755865.tmp}}
>  {{-rw-r--r-- 1 root root 245279 May 20 08:52 
> apache-tika-12106368488659105236.tmp}}
>  {{-rw-r--r-- 1 root root 1759 May 20 08:47 
> apache-tika-12291680472524021463.tmp}}
> {{…}}
>  
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>  
> These bugs are critical for us. What's the best way to avoid them?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3103) Tesseract fails to respect timeouts and clean up after itself

Reply via email to