[
https://issues.apache.org/jira/browse/IMPALA-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
bharath v resolved IMPALA-4847.
-------------------------------
Resolution: Fixed
Fix Version/s: Impala 2.11.0
https://gerrit.cloudera.org/#/c/7652/
> Simplify the code for file/block metadata loading by manually calling
> listLocatedStatus() for each partition.
> -------------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-4847
> URL: https://issues.apache.org/jira/browse/IMPALA-4847
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Affects Versions: Impala 2.8.0
> Reporter: Alexander Behm
> Assignee: bharath v
> Priority: Critical
> Fix For: Impala 2.11.0
>
>
> The fix for IMPALA-4172/IMPALA-3653 uses Hadoop's Filesystem.listFiles() API
> to recursively list all files under an HDFS table's parent directory. We then
> map each file to its corresponding partition. However, the use of listFiles()
> and the associated code for doing the file-to-partition mapping does not
> really make sense because listFiles() is just a recursive wrapper around
> listLocatedStatus(). So for a table with 10k partitions there will be 10k
> RPCs doing listLocatedStatus().
> We should simplify our code to just loop over all partitions and call
> listLocatedStatus(). This has the following benefits:
> * Simper code. Would have avoided bugs like IMPALA-4789.
> * Faster code. No need to map files to partitions.
> * Easier to parallelize in the future.
> * Easier to decouple table and partition loading in the future.
> Keep in mind that for S3 tables we do want to use the listFiles() API to
> avoid being throttled by S3.
> Relevant links:
> https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1720
> https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L766
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)