[ https://issues.apache.org/jira/browse/FLINK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-10518: ----------------------------------- Labels: Source:FileSystem auto-unassigned stale-major (was: Source:FileSystem auto-unassigned) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Major but is unassigned and neither itself nor its Sub-Tasks have been updated for 30 days. I have gone ahead and added a "stale-major" to the issue". If this ticket is a Major, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Inefficient design in ContinuousFileMonitoringFunction > ------------------------------------------------------ > > Key: FLINK-10518 > URL: https://issues.apache.org/jira/browse/FLINK-10518 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Affects Versions: 1.5.2 > Reporter: Huyen Levan > Priority: Major > Labels: Source:FileSystem, auto-unassigned, stale-major > > The ContinuousFileMonitoringFunction class keeps track of the latest file > modification time to rule out all files it has processed in the previous > cycles. For a long-running job, the list of eligible files will be much > smaller than the list of all files in the folder being monitored. > In the current implementation of the getInputSplitsSortedByModTime method, a > (big) list of all available splits are created first, and then every single > split is checked with the list of eligible files. > {quote}for (FileInputSplit split: > format.createInputSplits(readerParallelism)) { > FileStatus fileStatus = eligibleFiles.get(split.getPath()); > if (fileStatus != null) { > {quote} > The improvement can be done as: > * Listing of all files should be done once in > _ContinuousFileMonitoringFunction.listEligibleFiles()_ (as of now it is done > the 2nd time in _FileInputFormat.createInputSplits()_ ) > * The list of file-splits should then be created from the list of paths in > eligibleFiles. -- This message was sent by Atlassian Jira (v8.3.4#803005)