Radim Rehurek created TIKA-3103: ----------------------------------- Summary: Tesseract fails to respect timeouts and clean up after itself Key: TIKA-3103 URL: https://issues.apache.org/jira/browse/TIKA-3103 Project: Tika Issue Type: Bug Components: ocr Affects Versions: 1.24.1 Reporter: Radim Rehurek
We're using the Tika Server with OCR: _java -jar /pii_tools/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m_ This used to work fine in previous versions (1.22, without _-spawnChild_). But after upgrading to 1.24.1 with _-spawnChild_, two undersirable things happen: # The CPU runs at 100% for >10 minutes, long after any Tika requests should have finished. The processes show in _top_ as Tesseract. They eventually die but the machine is unusable in the mean time. *Expected behaviour:* Tika cleans up after itself: at most after its timeout limit (which is 2 minutes I believe?) # The temp is full of files like: _-rw------- 1 root root 0 May 20 08:47 /tmp/apache-tika-6183308518561170276.tmp_ _-rw-r--r-- 1 root root 140 May 20 08:48 /tmp/apache-tika-6183308518561170276.tmp.txt_ _-rw-r--r-- 1 root root 208416 May 20 08:53 /tmp/apache-tika-6262109250322677208.tmp_ _-rw-r--r-- 1 root root 399550 May 20 08:49 /tmp/apache-tika-6358810719289028940.tmp_ _-rw------- 1 root root 0 May 20 08:55 /tmp/apache-tika-6452032540225217628.tmp_ _-rw-r--r-- 1 root root 368 May 20 09:02 /tmp/apache-tika-6452032540225217628.tmp.txt_ _-rw------- 1 root root 0 May 20 08:46 /tmp/apache-tika-6874839592996549275.tmp_ _-rw-r--r-- 1 root root 3700 May 20 08:48 /tmp/apache-tika-6874839592996549275.tmp.txt_ slowly filling up the disk. *Expected behaviour*: Tika cleans up after itself. These bugs I critical for us so we had to revert back to 1.22. What's the best way to avoid these issues? -- This message was sent by Atlassian Jira (v8.3.4#803005)