[ 
https://issues.apache.org/jira/browse/TIKA-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115299#comment-14115299
 ] 

Nick Burch commented on TIKA-1404:
----------------------------------

Also, since you mention production - you might be better off using the Tika 
Server JaxRS server instead of the Tika App in server mode. As of 1.6 the Tika 
Server should have all the features that the app does, plus additional ones, 
and is more suited to heavy use

> tika-server leaking temporary files when converting Word97 (doc)
> ----------------------------------------------------------------
>
>                 Key: TIKA-1404
>                 URL: https://issues.apache.org/jira/browse/TIKA-1404
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.5
>         Environment: Linux (observed on CentOS 6.5 and SuSE SLES 11)
>            Reporter: Lukas Graf
>            Assignee: Nick Burch
>         Attachments: simple_word97.doc
>
>
> When converting Word97 documents (*.doc), tika-server reproducibly leaves 
> behind temporary files.
> Steps to reproduce:
> - Start {{tika-app-1.5.jar}} in {{--server}} mode
> - Send a {{*.doc}} file to server for conversion
> - Stop tika-server using CTRL+C or {{kill -15}}
> For example:
> {code}
> lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text
> # ...
> lukas@host:/tmp> ls -lah apache-tika-*
> ls: cannot access apache-tika-*: No such file or directory
> lukas@host:/tmp>
> lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
> Simple Word-97 Document
> Lorem Ipsum.
> lukas@host:/tmp> ls -lah apache-tika-*
> -rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 
> apache-tika-2457738389388821864.tmp
> # after conversion is done, tmp file handles are still open
> lukas@host:/tmp> lsof | grep tika
> java   29857   lukas   32r   REG   104,2  28628386  4571740 
> /home/lukas/tika-app-1.5.jar
> java   29857   lukas   85r   REG   104,2     22528  8604717 
> /tmp/apache-tika-2457738389388821864.tmp
> java   29857   lukas   86r   REG   104,2     22528  8604717 
> /tmp/apache-tika-2457738389388821864.tmp
> # stop tika-server...
> ^C
> lukas@host:~>
> # ...
> lukas@host:/tmp> lsof | grep tika
> lukas@host:/tmp>
> {code}
> No exceptions are thrown, and the plaintext is being extracted correctly from 
> the document, but temporary files are still left behind every single time.
> This obviously is a major issue in a production environment when converting 
> thousands of documents a day. Our temp directories are filling up rapidly, 
> and we had to configure cron jobs to clean up after Tika on most of our 
> production servers. I wasn't able to reproduce this issue using 
> {{tika-app-1.5.jar}} in non-server mode. However, booting up a JVM for every 
> single conversion is just too slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to