[ https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112356#comment-17112356 ]
Radim Rehurek edited comment on TIKA-3103 at 5/20/20, 3:44 PM: --------------------------------------------------------------- I take it back. There are still Tesseract processes that have been up for 2+ minutes, even with `tika-config.xml` and `X-Tika-OCRTimeout` set to `30`: ``` top - 15:39:13 up 490 days, 20:21, 0 users, load average: 34.39, 36.20, 28.37 Tasks: 84 total, 9 running, 74 sleeping, 0 stopped, 1 zombie %Cpu(s): 99.8 us, 0.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 64344.3 total, 6492.5 free, 45222.7 used, 12629.2 buff/cache MiB Swap: 32736.0 total, 17848.7 free, 14887.3 used. 12133.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11944 root 20 0 180840 114300 12100 R 192.0 0.2 *16:08.89* tesseract 23426 root 20 0 144064 77112 11464 R 187.0 0.1 *12:46.41* tesseract 11663 root 20 0 182396 116284 12028 R 164.0 0.2 *15:58.92* tesseract 23419 root 20 0 175532 108916 11932 R 160.0 0.2 *13:00.50* tesseract 11659 root 20 0 182276 115956 12064 R 141.0 0.2 *16:13.39* tesseract 24297 root 20 0 141420 74780 11564 R 129.0 0.1 *12:43.09* tesseract 28508 root 20 0 151148 84472 12024 R 114.0 0.1 *12:39.77* tesseract 28519 root 20 0 181868 115412 11964 R 108.0 0.2 *12:23.54* tesseract ``` I don't understand how this works. Both the CPU and disk are still a mess. Maybe it's something to do with concurrency in Tika server instead? We're sending up to 8 requests at a time – can that break Tika's timeout logic or configuration logic somehow? was (Author: piskvorky): I take it back. There are still Tesseract processes that have been up for 2+ minutes, even with `tika-config.xml` and `X-Tika-OCRTimeout` set to `30`: ``` top - 15:39:13 up 490 days, 20:21, 0 users, load average: 34.39, 36.20, 28.37 Tasks: 84 total, 9 running, 74 sleeping, 0 stopped, 1 zombie %Cpu(s): 99.8 us, 0.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 64344.3 total, 6492.5 free, 45222.7 used, 12629.2 buff/cache MiB Swap: 32736.0 total, 17848.7 free, 14887.3 used. 12133.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11944 root 20 0 180840 114300 12100 R 192.0 0.2 *16:08.89* tesseract 23426 root 20 0 144064 77112 11464 R 187.0 0.1 *12:46.41* tesseract 11663 root 20 0 182396 116284 12028 R 164.0 0.2 *15:58.92* tesseract 23419 root 20 0 175532 108916 11932 R 160.0 0.2 *13:00.50* tesseract 11659 root 20 0 182276 115956 12064 R 141.0 0.2 *16:13.39* tesseract 24297 root 20 0 141420 74780 11564 R 129.0 0.1 *12:43.09* tesseract 28508 root 20 0 151148 84472 12024 R 114.0 0.1 *12:39.77* tesseract 28519 root 20 0 181868 115412 11964 R 108.0 0.2 *12:23.54* tesseract ``` I don't understand how this works. Both the CPU and disk are still a mess. Maybe it's something to do with concurrency in Tika server instead? We're sending up to 8 requests at a time – can that break Tika's timeout logic somehow? > Tesseract fails to respect timeouts and clean up after itself > ------------------------------------------------------------- > > Key: TIKA-3103 > URL: https://issues.apache.org/jira/browse/TIKA-3103 > Project: Tika > Issue Type: Bug > Components: ocr > Affects Versions: 1.24.1 > Reporter: Radim Rehurek > Priority: Critical > > We're using the Tika Server with OCR: > _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_ > > Two undersirable things happen: > h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests > have finished. > These processes show in _top_ as "tesseract" (OCR) and consume all CPU cores > at 100%. > They eventually die (or finish?) but the machine is unusable in the mean time. > *Expected behaviour:* Tika cleans up spawned processes after itself: at most > after its timeout limit (which is 2 minutes I believe?) > h3. 2. The temp is full of files like: > {{root@38acd588ee22:/# ll /tmp/}} > {{total 197320}} > {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}} > {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}} > {{-rw-r--r-- 1 root root 9273920 May 20 08:56 > TIKA_streamstore_11144988934311367241.tmp}} > {{-rw-r--r-- 1 root root 8938048 May 20 08:57 > TIKA_streamstore_11649337406504198407.tmp}} > {{-rw-r--r-- 1 root root 9478720 May 20 08:56 > TIKA_streamstore_13551529918743702933.tmp}} > {{-rw-r--r-- 1 root root 9151040 May 20 08:57 > TIKA_streamstore_13568226047805501311.tmp}} > {{-rw-r--r-- 1 root root 7701056 May 20 08:56 > TIKA_streamstore_13908373602714189455.tmp}} > {{…}} > {{-rw-r--r-- 1 root root 33367 May 20 08:55 > apache-tika-11167866320029165062.tmp}} > {{-rw-r--r-- 1 root root 44353 May 20 08:54 > apache-tika-1152515137515755865.tmp}} > {{-rw-r--r-- 1 root root 245279 May 20 08:52 > apache-tika-12106368488659105236.tmp}} > {{-rw-r--r-- 1 root root 1759 May 20 08:47 > apache-tika-12291680472524021463.tmp}} > {{…}} > > slowly filling up the disk. > *Expected behaviour*: Tika cleans up disk space after itself. > > These bugs are critical for us. What's the best way to avoid them? > -- This message was sent by Atlassian Jira (v8.3.4#803005)