[
https://issues.apache.org/jira/browse/FLUME-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308404#comment-15308404
]
Attila Simon commented on FLUME-2918:
-------------------------------------
After checking the control flow it turned out that the function
(ReliableTaildirEventReader.getMatchFiles) - which is responsible for checking
whether new files has been added or removed within the parent dir of the file
pattern - is called every time when the PollableSourceRunner$PollingRunner
instructed the TaildirSource to harvest new data. Even though nothing changed
in that directory. This check requires listing all of the files and filtering
those using a pattern match and a isDirectory check within a single if
statement calling directory check first. Profiling showed that isDirectory is
much more expensive call than pattern match on the filename so changing the
order of the expressions would speed up the evaluation(short-circuit nature of
the java evaluation of boolean expressions) hence listing the dir. On the other
hand caching what was the last modification time of the parent directory and
the list of matched files for each filepattern prevent unnecessary rechecks.
> TaildirSource is underperforming with huge parent directories
> -------------------------------------------------------------
>
> Key: FLUME-2918
> URL: https://issues.apache.org/jira/browse/FLUME-2918
> Project: Flume
> Issue Type: Improvement
> Components: Sinks+Sources
> Reporter: Attila Simon
> Labels: performance
> Fix For: v1.7.0
>
>
> TailDir source cause high cpu utilization, when large amount of file is
> sitting in the target directory. File pattern matches only a single file, but
> the parent directory contains about 50,000 other file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)