[
https://issues.apache.org/jira/browse/HIVE-25800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sourabh Goyal updated HIVE-25800:
---------------------------------
Description:
HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to
not add partitions one by one in HMS. As part of that improvement, following
code was introduced:
{code:java}
// fetch all the partitions matching the part spec using the partition iterable
// this way the maximum batch size configuration parameter is considered
PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl,
partSpec,
conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(),
300));
Iterator<Partition> iterator = partitionIterable.iterator();
// Match valid partition path to partitions
while (iterator.hasNext()) {
Partition partition = iterator.next();
partitionDetailsMap.entrySet().stream()
.filter(entry ->
entry.getValue().fullSpec.equals(partition.getSpec()))
.findAny().ifPresent(entry -> {
entry.getValue().partition = partition;
entry.getValue().hasOldPartition = true;
});
} {code}
The above code fetches all the existing partitions for a table from HMS and
compare that dynamic partitions list to decide old and new partitions to be
added to HMS (in batches). The call to fetch all partitions has introduced a
performance regression for tables with large number of partitions (of the order
of 100K).
This is fixed for external tables in
https://issues.apache.org/jira/browse/HIVE-25178. However for ACID tables
there is an open Jirahttps://issues.apache.org/jira/browse/HIVE-25187. Until we
have an appropriate fix in HIVE-25187, we can apply the following:
Skip fetching all partitions. Instead, in the threadPool which loads each
partition individually, call get_partition() to check if the partition already
exists in HMS or not.
This will introduce additional getPartition() call for every partition to be
loaded dynamically but removes fetching all existing partitions for a table.
I believe this is fine since for tables with small number of existing
partitions in HMS - getPartitions() won't add too much overhead but for tables
with large number of existing partitions, it will certainly avoid getting all
partitions from HMS
cc - [~lpinter] [~ngangam]
was:
HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to
not add partitions one by one in HMS. As part of that improvement, following
code was introduced:
{code:java}
// fetch all the partitions matching the part spec using the partition iterable
// this way the maximum batch size configuration parameter is considered
PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl,
partSpec,
conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(),
300));
Iterator<Partition> iterator = partitionIterable.iterator();
// Match valid partition path to partitions
while (iterator.hasNext()) {
Partition partition = iterator.next();
partitionDetailsMap.entrySet().stream()
.filter(entry ->
entry.getValue().fullSpec.equals(partition.getSpec()))
.findAny().ifPresent(entry -> {
entry.getValue().partition = partition;
entry.getValue().hasOldPartition = true;
});
} {code}
The above code fetches all the existing partitions for a table from HMS and
compare that dynamic partitions list to decide old and new partitions to be
added to HMS (in batches). The call to fetch all partitions has introduced a
performance regression for tables with large number of partitions (of the order
of 100K).
The fix is to skip fetching all partitions. Instead, in the threadPool which
loads each partition individually, call get_partition() to check if the
partition already exists in HMS or not.
This will introduce additional getPartition() call for every partition to be
loaded dynamically but removes fetching all existing partitions for a table.
I believe this is fine since for tables with small number of existing
partitions in HMS - getPartitions() won't add too much overhead but for tables
with large number of existing partitions, it will certainly avoid getting all
partitions from HMS
cc - [~lpinter] [~ngangam]
> loadDynamicPartitions in Hive.java should not load all partitions of a
> managed table
> -------------------------------------------------------------------------------------
>
> Key: HIVE-25800
> URL: https://issues.apache.org/jira/browse/HIVE-25800
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Reporter: Sourabh Goyal
> Assignee: Sourabh Goyal
> Priority: Major
>
> HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java
> to not add partitions one by one in HMS. As part of that improvement,
> following code was introduced:
> {code:java}
> // fetch all the partitions matching the part spec using the partition
> iterable
> // this way the maximum batch size configuration parameter is considered
> PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl,
> partSpec,
> conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(),
> 300));
> Iterator<Partition> iterator = partitionIterable.iterator();
> // Match valid partition path to partitions
> while (iterator.hasNext()) {
> Partition partition = iterator.next();
> partitionDetailsMap.entrySet().stream()
> .filter(entry ->
> entry.getValue().fullSpec.equals(partition.getSpec()))
> .findAny().ifPresent(entry -> {
> entry.getValue().partition = partition;
> entry.getValue().hasOldPartition = true;
> });
> } {code}
> The above code fetches all the existing partitions for a table from HMS and
> compare that dynamic partitions list to decide old and new partitions to be
> added to HMS (in batches). The call to fetch all partitions has introduced a
> performance regression for tables with large number of partitions (of the
> order of 100K).
>
> This is fixed for external tables in
> https://issues.apache.org/jira/browse/HIVE-25178. However for ACID tables
> there is an open Jirahttps://issues.apache.org/jira/browse/HIVE-25187. Until
> we have an appropriate fix in HIVE-25187, we can apply the following:
> Skip fetching all partitions. Instead, in the threadPool which loads each
> partition individually, call get_partition() to check if the partition
> already exists in HMS or not.
> This will introduce additional getPartition() call for every partition to be
> loaded dynamically but removes fetching all existing partitions for a table.
> I believe this is fine since for tables with small number of existing
> partitions in HMS - getPartitions() won't add too much overhead but for
> tables with large number of existing partitions, it will certainly avoid
> getting all partitions from HMS
> cc - [~lpinter] [~ngangam]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)