Joe McDonnell created IMPALA-8630:
-------------------------------------
Summary: Consistent remote placement should include partition id
when calculating placement
Key: IMPALA-8630
URL: https://issues.apache.org/jira/browse/IMPALA-8630
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 3.2.0
Reporter: Joe McDonnell
For partitioned tables, the actual filenames within partitions may not have
large entropy. Impala includes information in its filenames that would not be
the same across partitions, but this is common for tables written by the
current CDH version of Hive. For example, in our minicluster, the TPC-DS
store_sales table has many partitions, but the actual filenames within
partitions are very simple:
{noformat}
hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642
Found 1 items
-rwxr-xr-x 3 joe supergroup 379535 2019-06-05 15:16
/test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642/000000_0
hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640
Found 1 items
-rwxr-xr-x 3 joe supergroup 412959 2019-06-05 15:16
/test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640/000000_0{noformat}
Right now, consistent remote placement uses the filename+offset without the
partition id.
{code:java}
uint32_t hash = HashUtil::Hash(hdfs_file_split->relative_path.data(),
hdfs_file_split->relative_path.length(), 0);
{code}
This would produce a poor balance of files across nodes when there is low
entropy in filenames. This should be amended to include the partition id, which
is already accessible on the THdfsFileSplit.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)