[
https://issues.apache.org/jira/browse/FLUME-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309339#comment-15309339
]
Attila Simon commented on FLUME-2918:
-------------------------------------
Comparing how could the same functionality be implemented clarified that using
java.nio.file.DirectoryStream to list the files gives the best overall
performance (only very first invocation has a JIT overhead when it performs
little bit worse than the proper FileFilter). Please see attachments.
- PerfHugeDir.java generated the execution times
- test.csv captured result of executing PerfHugeDir.main()
- perftest.png charted version of the csv data (execution time in millisecs
comparing the different implementations)
I started with a directory of 59k files, only a single file matched the
pattern, there were couple of subdirs. After ~230 rounds I started massively
removing the files not matched by the pattern and reduced the number to ~20
files all together within the parent dir which reduction was responsible for
the fade out. (Secondly I ran the same test starting with empty dir and adding
300files/sec to 59k that was also won by DirectoryStream. No attachment for
this.)
> TaildirSource is underperforming with huge parent directories
> -------------------------------------------------------------
>
> Key: FLUME-2918
> URL: https://issues.apache.org/jira/browse/FLUME-2918
> Project: Flume
> Issue Type: Improvement
> Components: Sinks+Sources
> Reporter: Attila Simon
> Labels: performance
> Fix For: v1.7.0
>
> Attachments: profiling_after.png, profiling_before.png
>
>
> TailDir source cause high cpu utilization, when large amount of file is
> sitting in the target directory. File pattern matches only a single file, but
> the parent directory contains about 50,000 other file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)