[ https://issues.apache.org/jira/browse/TIKA-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Radim Rehurek updated TIKA-3103: -------------------------------- Description: We're using the Tika Server with OCR: _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_ Two undersirable things happen: h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests should have finished. These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores at 100%. They eventually die but the machine is unusable in the mean time. *Expected behaviour:* Tika cleans up spawned processes after itself: at most after its timeout limit (which is 2 minutes I believe?) h3. 2. The temp is full of files like: {{root@38acd588ee22:/# ll /tmp/}} {{total 197320}} {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}} {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}} {{-rw-r--r-- 1 root root 9273920 May 20 08:56 TIKA_streamstore_11144988934311367241.tmp}} {{-rw-r--r-- 1 root root 8938048 May 20 08:57 TIKA_streamstore_11649337406504198407.tmp}} {{-rw-r--r-- 1 root root 9478720 May 20 08:56 TIKA_streamstore_13551529918743702933.tmp}} {{-rw-r--r-- 1 root root 9151040 May 20 08:57 TIKA_streamstore_13568226047805501311.tmp}} {{-rw-r--r-- 1 root root 7701056 May 20 08:56 TIKA_streamstore_13908373602714189455.tmp}} {{…}} {{-rw-r--r-- 1 root root 33367 May 20 08:55 apache-tika-11167866320029165062.tmp}} {{-rw-r--r-- 1 root root 44353 May 20 08:54 apache-tika-1152515137515755865.tmp}} {{-rw-r--r-- 1 root root 245279 May 20 08:52 apache-tika-12106368488659105236.tmp}} {{-rw-r--r-- 1 root root 1759 May 20 08:47 apache-tika-12291680472524021463.tmp}} {{…}} slowly filling up the disk. *Expected behaviour*: Tika cleans up disk space after itself. These bugs I critical for us. What's the best way to avoid these issues? was: We're using the Tika Server with OCR: _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_ This used to work fine in previous versions (1.22, without _-spawnChild_). But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things happen: h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests should have finished. These (zombie?) processes show in _top_ as Tesseract and consume all CPU cores at 100%. They eventually die but the machine is unusable in the mean time. *Expected behaviour:* Tika cleans up spawned processes after itself: at most after its timeout limit (which is 2 minutes I believe?) h3. 2. The temp is full of files like: {{root@38acd588ee22:/# ll /tmp/}} {{total 197320}} {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}} {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}} {{-rw-r--r-- 1 root root 9273920 May 20 08:56 TIKA_streamstore_11144988934311367241.tmp}} {{-rw-r--r-- 1 root root 8938048 May 20 08:57 TIKA_streamstore_11649337406504198407.tmp}} {{-rw-r--r-- 1 root root 9478720 May 20 08:56 TIKA_streamstore_13551529918743702933.tmp}} {{-rw-r--r-- 1 root root 9151040 May 20 08:57 TIKA_streamstore_13568226047805501311.tmp}} {{-rw-r--r-- 1 root root 7701056 May 20 08:56 TIKA_streamstore_13908373602714189455.tmp}} {{…}} {{-rw-r--r-- 1 root root 33367 May 20 08:55 apache-tika-11167866320029165062.tmp}} {{-rw-r--r-- 1 root root 44353 May 20 08:54 apache-tika-1152515137515755865.tmp}} {{-rw-r--r-- 1 root root 245279 May 20 08:52 apache-tika-12106368488659105236.tmp}} {{-rw-r--r-- 1 root root 1759 May 20 08:47 apache-tika-12291680472524021463.tmp}} {{…}} slowly filling up the disk. *Expected behaviour*: Tika cleans up disk space after itself. These bugs I critical for us so we had to revert back to 1.22. What's the best way to avoid these issues? > Tesseract fails to respect timeouts and clean up after itself > ------------------------------------------------------------- > > Key: TIKA-3103 > URL: https://issues.apache.org/jira/browse/TIKA-3103 > Project: Tika > Issue Type: Bug > Components: ocr > Affects Versions: 1.24.1 > Reporter: Radim Rehurek > Priority: Critical > > We're using the Tika Server with OCR: > _java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_ > > Two undersirable things happen: > h3. 1. The CPU runs at 100% for >10 minutes, long after any Tika requests > should have finished. > These (zombie?) processes show in _top_ as Tesseract and consume all CPU > cores at 100%. > They eventually die but the machine is unusable in the mean time. > *Expected behaviour:* Tika cleans up spawned processes after itself: at most > after its timeout limit (which is 2 minutes I believe?) > h3. 2. The temp is full of files like: > {{root@38acd588ee22:/# ll /tmp/}} > {{total 197320}} > {{drwxrwxrwt 1 root root 24576 May 20 09:35 ./}} > {{drwxr-xr-x 1 root root 4096 May 20 08:40 ../}} > {{-rw-r--r-- 1 root root 9273920 May 20 08:56 > TIKA_streamstore_11144988934311367241.tmp}} > {{-rw-r--r-- 1 root root 8938048 May 20 08:57 > TIKA_streamstore_11649337406504198407.tmp}} > {{-rw-r--r-- 1 root root 9478720 May 20 08:56 > TIKA_streamstore_13551529918743702933.tmp}} > {{-rw-r--r-- 1 root root 9151040 May 20 08:57 > TIKA_streamstore_13568226047805501311.tmp}} > {{-rw-r--r-- 1 root root 7701056 May 20 08:56 > TIKA_streamstore_13908373602714189455.tmp}} > {{…}} > {{-rw-r--r-- 1 root root 33367 May 20 08:55 > apache-tika-11167866320029165062.tmp}} > {{-rw-r--r-- 1 root root 44353 May 20 08:54 > apache-tika-1152515137515755865.tmp}} > {{-rw-r--r-- 1 root root 245279 May 20 08:52 > apache-tika-12106368488659105236.tmp}} > {{-rw-r--r-- 1 root root 1759 May 20 08:47 > apache-tika-12291680472524021463.tmp}} > {{…}} > > slowly filling up the disk. > *Expected behaviour*: Tika cleans up disk space after itself. > > These bugs I critical for us. What's the best way to avoid these issues? > -- This message was sent by Atlassian Jira (v8.3.4#803005)