[ 
https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Rehurek updated TIKA-3103:
--------------------------------
    Description: 
We're using the Tika Server with OCR:

_java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_

 

Two undersirable things happen:
h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests 
should have finished.

These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores 
at 100%.

They eventually die but the machine is unusable in the mean time.

*Expected behaviour:* Tika cleans up spawned processes after itself: at most 
after its timeout limit (which is 2 minutes I believe?)
h3. 2. The temp is full of files like:

{{root@38acd588ee22:/# ll /tmp/}}
 {{total 197320}}
 {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
 {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
 {{-rw-r--r-- 1 root root 9273920 May 20 08:56 
TIKA_streamstore_11144988934311367241.tmp}}
 {{-rw-r--r-- 1 root root 8938048 May 20 08:57 
TIKA_streamstore_11649337406504198407.tmp}}
 {{-rw-r--r-- 1 root root 9478720 May 20 08:56 
TIKA_streamstore_13551529918743702933.tmp}}
 {{-rw-r--r-- 1 root root 9151040 May 20 08:57 
TIKA_streamstore_13568226047805501311.tmp}}
 {{-rw-r--r-- 1 root root 7701056 May 20 08:56 
TIKA_streamstore_13908373602714189455.tmp}}
 {{…}}
 {{-rw-r--r-- 1 root root 33367 May 20 08:55 
apache-tika-11167866320029165062.tmp}}
 {{-rw-r--r-- 1 root root 44353 May 20 08:54 
apache-tika-1152515137515755865.tmp}}
 {{-rw-r--r-- 1 root root 245279 May 20 08:52 
apache-tika-12106368488659105236.tmp}}
 {{-rw-r--r-- 1 root root 1759 May 20 08:47 
apache-tika-12291680472524021463.tmp}}

{{…}}

 

slowly filling up the disk.

*Expected behaviour*: Tika cleans up disk space after itself.

 

These bugs I critical for us. What's the best way to avoid these issues?

 

  was:
We're using the Tika Server with OCR:

_java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_

 

This used to work fine in previous versions (1.22, without _-spawnChild_).

But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things 
happen:
h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests 
should have finished.

These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores 
at 100%.

They eventually die but the machine is unusable in the mean time.

*Expected behaviour:* Tika cleans up spawned processes after itself: at most 
after its timeout limit (which is 2 minutes I believe?)
h3. 2. The temp is full of files like:

{{root@38acd588ee22:/# ll /tmp/}}
 {{total 197320}}
 {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
 {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
 {{-rw-r--r-- 1 root root 9273920 May 20 08:56 
TIKA_streamstore_11144988934311367241.tmp}}
 {{-rw-r--r-- 1 root root 8938048 May 20 08:57 
TIKA_streamstore_11649337406504198407.tmp}}
 {{-rw-r--r-- 1 root root 9478720 May 20 08:56 
TIKA_streamstore_13551529918743702933.tmp}}
 {{-rw-r--r-- 1 root root 9151040 May 20 08:57 
TIKA_streamstore_13568226047805501311.tmp}}
 {{-rw-r--r-- 1 root root 7701056 May 20 08:56 
TIKA_streamstore_13908373602714189455.tmp}}
 {{…}}
 {{-rw-r--r-- 1 root root 33367 May 20 08:55 
apache-tika-11167866320029165062.tmp}}
 {{-rw-r--r-- 1 root root 44353 May 20 08:54 
apache-tika-1152515137515755865.tmp}}
 {{-rw-r--r-- 1 root root 245279 May 20 08:52 
apache-tika-12106368488659105236.tmp}}
 {{-rw-r--r-- 1 root root 1759 May 20 08:47 
apache-tika-12291680472524021463.tmp}}

{{…}}

 

slowly filling up the disk.

*Expected behaviour*: Tika cleans up disk space after itself.

 

These bugs I critical for us so we had to revert back to 1.22. What's the best 
way to avoid these issues?

 


> Tesseract fails to respect timeouts and clean up after itself
> -------------------------------------------------------------
>
>                 Key: TIKA-3103
>                 URL: https://issues.apache.org/jira/browse/TIKA-3103
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.24.1
>            Reporter: Radim Rehurek
>            Priority: Critical
>
> We're using the Tika Server with OCR:
> _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_
>  
> Two undersirable things happen:
> h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests 
> should have finished.
> These (zombie?) processes show in _top_ as Tesseract and consume all CPU 
> cores at 100%.
> They eventually die but the machine is unusable in the mean time.
> *Expected behaviour:* Tika cleans up spawned processes after itself: at most 
> after its timeout limit (which is 2 minutes I believe?)
> h3. 2. The temp is full of files like:
> {{root@38acd588ee22:/# ll /tmp/}}
>  {{total 197320}}
>  {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}}
>  {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}}
>  {{-rw-r--r-- 1 root root 9273920 May 20 08:56 
> TIKA_streamstore_11144988934311367241.tmp}}
>  {{-rw-r--r-- 1 root root 8938048 May 20 08:57 
> TIKA_streamstore_11649337406504198407.tmp}}
>  {{-rw-r--r-- 1 root root 9478720 May 20 08:56 
> TIKA_streamstore_13551529918743702933.tmp}}
>  {{-rw-r--r-- 1 root root 9151040 May 20 08:57 
> TIKA_streamstore_13568226047805501311.tmp}}
>  {{-rw-r--r-- 1 root root 7701056 May 20 08:56 
> TIKA_streamstore_13908373602714189455.tmp}}
>  {{…}}
>  {{-rw-r--r-- 1 root root 33367 May 20 08:55 
> apache-tika-11167866320029165062.tmp}}
>  {{-rw-r--r-- 1 root root 44353 May 20 08:54 
> apache-tika-1152515137515755865.tmp}}
>  {{-rw-r--r-- 1 root root 245279 May 20 08:52 
> apache-tika-12106368488659105236.tmp}}
>  {{-rw-r--r-- 1 root root 1759 May 20 08:47 
> apache-tika-12291680472524021463.tmp}}
> {{…}}
>  
> slowly filling up the disk.
> *Expected behaviour*: Tika cleans up disk space after itself.
>  
> These bugs I critical for us. What's the best way to avoid these issues?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to