[
https://issues.apache.org/jira/browse/IMPALA-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-9805:
-----------------------------------
Summary: Unnecessary reloading partitions with inconsistent name strings
between Impala and Hive (was: Alter table recover partitions will repeatedly
drop and load partitions with inconsistent partition name strings between
Impala and Hive)
> Unnecessary reloading partitions with inconsistent name strings between
> Impala and Hive
> ---------------------------------------------------------------------------------------
>
> Key: IMPALA-9805
> URL: https://issues.apache.org/jira/browse/IMPALA-9805
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
>
> For AlterTable Recover Partitions, Impala compares the Hive partition names
> with the names of cached partitions to drop non-existing partitions and add
> new partitions. However, comparing by name strings is not enough since some
> partitions will have inconsistent names between Hive and Impala. This usually
> happens when the partition directory is created by non-Hive apps.
> Let's say external table my_part_tbl is partitioned by (year int, month
> {color:#de350b}*int*{color}, day {color:#de350b}*int*{color}). User creates
> and uploads data to HDFS dir
> year=2020/month={color:#ff0000}*01*{color}/day={color:#ff0000}*01*{color}{color:#172b4d},
> and then triggers an AlterTable RecoverPartitions command in Impala. Impala
> will create partition
> (year=2020/month={color:#de350b}*1*{color}{color}/day={color:#de350b}*1*{color})
> in Hive using this location
> ".../year=2020/month=*{color:#de350b}01{color}*/day=*{color:#de350b}01{color}*".
> Next time when running AlterTable RecoverPartition again (e.g. when new
> partition dirs are created again), the partition name list got from Hive is
> [year=2020/month=01/day=01]. However, the name list of cached partitions is
> [year=2020/month=1/day=1]. Impala will drop this partition and load it as a
> new partition.
> {color:#172b4d}This impacts the performance of AlterTable RecoverPartition on
> partitioned tables if the partition directories are all in such case. Many
> partitions will be reload and reload.{color}
> *{color:#172b4d}Reproduction{color}*
> {code:java}
> impala> create external table my_part_tbl (id int) partitioned by (year int,
> month int, day int);
> impala> describe formatted my_part_tbl;{code}
> Found the table location is hdfs://localhost:20500/test-warehouse/my_part_tbl
> Create and upload data to a partition dir using HDFS CLI:
> {code:java}
> $ cat data.txt
> 1
> 2
> 3
> $ hdfs dfs -mkdir -p
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=01/day=01
> $ hdfs dfs -mkdir -p
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=01/day=02
> $ hdfs dfs -mkdir -p
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=09/day=01
> $ hdfs dfs -mkdir -p
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=09/day=02
> $ hdfs dfs -put data.txt
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=01/day=01
> $ hdfs dfs -put data.txt
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=01/day=02
> $ hdfs dfs -put data.txt
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=09/day=01
> $ hdfs dfs -put data.txt
> hdfs://localhost:20500/test-warehouse/my_part_tbl/year=2020/month=09/day=02{code}
> Let Impala detect these partitions.
> {code:java}
> impala> alter table my_part_tbl recover partitions;
> {code}
> Then everytime when running AlterTable RecoverPartitions, these 4 partitions
> will be reloaded again. The logs of catalogd reflects this:
> {code:java}
> I0531 11:51:45.037181 27878 HdfsTable.java:1001] Reloading metadata for all
> partition(s) of default.my_part_tbl (ALTER TABLE RECOVER_PARTITIONS)
> I0531 11:51:45.037286 27878 HdfsTable.java:2095] Load Valid Write Id List
> Done. Time taken: 4.456us
> I0531 11:51:45.049304 27878 ParallelFileMetadataLoader.java:144] Loading file
> and block metadata for 4 paths for table default.my_part_tbl using a thread
> pool of size 4
> I0531 11:51:45.053562 27878 HdfsTable.java:697] Loaded file and block
> metadata for default.my_part_tbl partitions: year=2020/month=1/day=1,
> year=2020/month=1/day=2, year=2020/month=9/day=1, and 1 others. Time taken:
> 4.537ms
> I0531 11:51:45.053689 27878 HdfsTable.java:1032] Incrementally loaded table
> metadata for: default.my_part_tbl
> I0531 11:51:45.279531 26899 catalog-server.cc:735] Collected update:
> 1:TABLE:default.my_part_tbl, version=1512, original size=3989, compressed
> size=1294
> I0531 11:51:45.281633 26899 catalog-server.cc:735] Collected update:
> 1:CATALOG_SERVICE_ID, version=1512, original size=60, compressed size=58
> I0531 11:51:47.277819 26906 catalog-server.cc:340] A catalog update with 2
> entries is assembled. Catalog version: 1512 Last sent catalog version: 1511
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]