[ 
https://issues.apache.org/jira/browse/IMPALA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-8630.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.3.0

> Consistent remote placement should include partition information when 
> calculating placement
> -------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8630
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8630
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.2.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 3.3.0
>
>
> For partitioned tables, the actual filenames within partitions may not have 
> large entropy. Impala includes information in its filenames that would not be 
> the same across partitions, but this is common for tables written by the 
> current CDH version of Hive. For example, in our minicluster, the TPC-DS 
> store_sales table has many partitions, but the actual filenames within 
> partitions are very simple:
> {noformat}
> hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642
> Found 1 items
> -rwxr-xr-x 3 joe supergroup 379535 2019-06-05 15:16 
> /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642/000000_0
> hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640
> Found 1 items
> -rwxr-xr-x 3 joe supergroup 412959 2019-06-05 15:16 
> /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640/000000_0{noformat}
> Right now, consistent remote placement uses the filename+offset without the 
> partition id.
> {code:java}
> uint32_t hash = HashUtil::Hash(hdfs_file_split->relative_path.data(),
>       hdfs_file_split->relative_path.length(), 0);
> {code}
> This would produce a poor balance of files across nodes when there is low 
> entropy in filenames. This should be amended to include the partition id, 
> which is already accessible on the THdfsFileSplit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to