[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196765#comment-14196765
 ] 

Tim Allison edited comment on TIKA-1464 at 11/4/14 8:35 PM:
------------------------------------------------------------

On Windows 7 with Tika 1.7-SNAPSHOT, on a batch of 3k msg files that have many 
attachments, the most I can get with a 4 thread process is 12 descriptors open 
at a time according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the 
full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
....

#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)

....
#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

....
#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

....

#9 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
...

#10 ...file4.msg by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...

#11 ...Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on 
Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
...

#12 ...\Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on 
Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)

----


{noformat}


was (Author: [email protected]):
On Windows 7, on a batch of 3k msg files that have many of attachments, the 
most I can get with a 4 thread process is 12 descriptors open at a time 
according to the leak detector.

The Windows task manager shows no more than 300 files open at a time for the 
full process.

{noformat}
12 descriptors are open
#1 ...file1.msg by thread:pool-1-thread-7 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...
#2 ...\Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#3 ...Local\Temp\apache-tika-7913996149639657350.tmp by thread:pool-1-thread-7 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
#4 ...file2.msg by thread:pool-1-thread-4 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
....

#5 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)

....
#6 ...Local\Temp\apache-tika-4841525150295414987.tmp by thread:pool-1-thread-4 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

....
#7 ...file3.msg by thread:pool-1-thread-6 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

#8 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)

....

#9 ...Local\Temp\apache-tika-877604211166291703.tmp by thread:pool-1-thread-6 
on Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
...

#10 ...file4.msg by thread:pool-1-thread-5 on Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at 
org.mitre.tallison.complexfiles.tika.TikaDefaultTextExtractor.extract(Unknown 
Source)
...

#11 ...Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on 
Tue Nov 04 15:21:03 EST 2014
        at java.io.FileInputStream.<init>(FileInputStream.java:147)
        at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:542)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:378)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)
...

#12 ...\Temp\apache-tika-5646769580530000299.tmp by thread:pool-1-thread-5 on 
Tue Nov 04 15:21:03 EST 2014
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:242)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.newSrcFile(FileBackedDataSource.java:130)
        at 
org.apache.poi.poifs.nio.FileBackedDataSource.<init>(FileBackedDataSource.java:46)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:192)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:163)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:381)
        at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:168)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.mitre.tallison.complexfiles.tika.RecursiveMetadataParser.parse(Unknown 
Source)

----


{noformat}

> Too many open files in system when parsing thousands of files
> -------------------------------------------------------------
>
>                 Key: TIKA-1464
>                 URL: https://issues.apache.org/jira/browse/TIKA-1464
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
>            Reporter: Tim Barrett
>            Priority: Blocker
>              Labels: TooManyOpenFilesInSystem
>
> Our big data project parses many thousands of different kinds of files 
> sequentially. Up to and including Tika 1.5 this has been trouble free and 
> Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
> files in roughly equal measure.
> We switched to Tika 1.6 last week and this was a good enhancement for us as a 
> number of files (MSOffice) that previously failed to parse do now parse 
> correctly under Tika 1.6.
> However we have seen that a Too many open files in system exception is raised 
> somewhere above 10000 files having been parsed. On a windows server this 
> exception is not raised but the system eventually begins to crawl.
> Watching the system's behaviour with the apache tmp files we see that the 
> apache tika files *are* being deleted from the file system, but lsof is 
> showing all these files as remaining open by the running process using Tika. 
> It would appear that the files are being deleted but handles to these files 
> are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to