[
https://issues.apache.org/jira/browse/FLUME-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888860#comment-13888860
]
Muhammad Ehsan ul Haque commented on FLUME-2309:
------------------------------------------------
No it doesn't maintain a sorted list. Before consuming a file, it gets a list
of files and if the list is not empty then it sorts it.
{code}
List<File> candidateFiles = Arrays.asList(spoolDirectory.listFiles(filter));
if (candidateFiles.isEmpty()) {
return Optional.absent();
} else {
Collections.sort(candidateFiles, new Comparator<File>() {
....
{code}
We can have a boolean parameter lets say *??consumeOldestFirst??*. If its value
is *true*
*then*
* use a sorted buffer and consume from it, for this we will need to check if
the file exists or it has been deleted. However, if there are so many files (of
the order of millions) then this will be very resource consuming as we still
need to get a list of files before sorting and buffering the top N sorted.
*else*
* use [Java
Files.newDirectoryStream|http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#newDirectoryStream(java.nio.file.Path,
java.nio.file.DirectoryStream.Filter)], but as mentioned by [~hshreedharan]
may cause older files to be sent very late.
I can produce a patch for it, if someone can accept this proposal or propose
something else.
> Spooling directory should not always consume the oldest file first.
> -------------------------------------------------------------------
>
> Key: FLUME-2309
> URL: https://issues.apache.org/jira/browse/FLUME-2309
> Project: Flume
> Issue Type: New Feature
> Reporter: Muhammad Ehsan ul Haque
> Priority: Minor
>
> The ReliableSpoolingFileEventReader reads the oldest file in the spooling
> directory first. This is done by listing the directory contents and then
> sorting file list based on timestamp. This may be very slow if there are a
> lot of files (of the order of 100K or more) in the directory.
> However, this is not always needed, there can be simple cases in which the
> order to consume the file is not important.
> There should be an option of consuming the files in arbitrary order, allowing
> the files to be consumed quickly without any delay.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)