[
https://issues.apache.org/jira/browse/FLINK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen closed FLINK-19221.
--------------------------------
> Exploit LocatableFileStatus from Hadoop
> ---------------------------------------
>
> Key: FLINK-19221
> URL: https://issues.apache.org/jira/browse/FLINK-19221
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Hadoop Compatibility
> Affects Versions: 1.11.1
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.12.0
>
>
> When the HDFS Client returns a {{FileStatus}} (description of a file) it
> sometimes returns a {{LocatedFileStatus}} which already contains all the
> {{BlockLocation}} information.
> We should expose this on the Flink side, because it may save is a lot of RPC
> calls to the name node. The file enumerators often request block locations
> for all files, currently doing an RPC call for each file.
> When the FileStatus obtained from listing the directory (or getting details
> for a file) already has all the block locations, we can save the extra RPC
> call per file.
> The suggested implementation is as follows:
> 1. We introduce a {{LocatedInputSplit}} in Flink that we integrate with the
> built-in LocalFileSystem
> 2. We integrate this with the HadoopFileSystems by creating a Flink
> {{LocatedInputSplit}} whenever the underlying file system created a {{Hadoop
> LocatedInputSplit}}
> 3. As a safety net, the FS methods to access block information check
> whether the presented file status already contains the block information and
> return that information directly.
> Steps one and two are for simplification of FileSystem users (no need to ask
> for extra info if it is available).
> Step three is the transparent shortcut that all applications get even if they
> do not explicitly use the {{LocatedInputSplit}} and keep calling
> {{FileSystem.getBlockLocations()}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)