[ 
https://issues.apache.org/jira/browse/FLUME-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308404#comment-15308404
 ] 

Attila Simon commented on FLUME-2918:
-------------------------------------

After checking the control flow it turned out that the function 
(ReliableTaildirEventReader.getMatchFiles) - which is responsible for checking 
whether new files has been added or removed within the parent dir of the file 
pattern - is called every time when the PollableSourceRunner$PollingRunner 
instructed the TaildirSource to harvest new data. Even though nothing changed 
in that directory. This check requires listing all of the files and filtering 
those using a pattern match and a isDirectory check within a single if 
statement calling directory check first. Profiling showed that isDirectory is 
much more expensive call than pattern match on the filename so changing the 
order of the expressions would speed up the evaluation(short-circuit nature of 
the java evaluation of boolean expressions) hence listing the dir. On the other 
hand caching what was the last modification time of the parent directory and 
the list of matched files for each filepattern prevent unnecessary rechecks.

> TaildirSource is underperforming with huge parent directories
> -------------------------------------------------------------
>
>                 Key: FLUME-2918
>                 URL: https://issues.apache.org/jira/browse/FLUME-2918
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Attila Simon
>              Labels: performance
>             Fix For: v1.7.0
>
>
> TailDir source cause high cpu utilization, when large amount of file is 
> sitting in the target directory. File pattern matches only a single file, but 
> the parent directory contains about 50,000 other file. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to