[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196046#comment-14196046
 ] 

Tim Barrett commented on TIKA-1464:
-----------------------------------

I did have a suspicion that the file types may be the culprits. Another of our 
projects contains only PDF and MSOffice files, so no MSG files. That one runs 
without problems, although is not as large as the set which eventually errors 
out. So cabnnot be 100% sure that MSG files are the culprits, but I have a 
sneaking suspicion that they *are* the culprit. Many of our msg files contain 
embedded msg files, and/or PDF, MSOffice, image files etc. I am 100% confident 
we are not leaking non closed input streams, as I have already pointed out, pre 
1.6 Tika runs smoothly without any form of open files build up.

> Too many open files in system when parsing thousands of files
> -------------------------------------------------------------
>
>                 Key: TIKA-1464
>                 URL: https://issues.apache.org/jira/browse/TIKA-1464
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
>            Reporter: Tim Barrett
>            Priority: Blocker
>              Labels: TooManyOpenFilesInSystem
>
> Our big data project parses many thousands of different kinds of files 
> sequentially. Up to and including Tika 1.5 this has been trouble free and 
> Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
> files in roughly equal measure.
> We switched to Tika 1.6 last week and this was a good enhancement for us as a 
> number of files (MSOffice) that previously failed to parse do now parse 
> correctly under Tika 1.6.
> However we have seen that a Too many open files in system exception is raised 
> somewhere above 10000 files having been parsed. On a windows server this 
> exception is not raised but the system eventually begins to crawl.
> Watching the system's behaviour with the apache tmp files we see that the 
> apache tika files *are* being deleted from the file system, but lsof is 
> showing all these files as remaining open by the running process using Tika. 
> It would appear that the files are being deleted but handles to these files 
> are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to