[ 
https://issues.apache.org/jira/browse/HIVE-25800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Goyal updated HIVE-25800:
---------------------------------
    Description: 
HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to 
not add partitions one by one in HMS. As part of that improvement, following 
code was introduced: 
{code:java}
// fetch all the partitions matching the part spec using the partition iterable
// this way the maximum batch size configuration parameter is considered
PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, 
partSpec,
          conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 
300));
Iterator<Partition> iterator = partitionIterable.iterator();

// Match valid partition path to partitions
while (iterator.hasNext()) {
  Partition partition = iterator.next();
  partitionDetailsMap.entrySet().stream()
          .filter(entry -> 
entry.getValue().fullSpec.equals(partition.getSpec()))
          .findAny().ifPresent(entry -> {
            entry.getValue().partition = partition;
            entry.getValue().hasOldPartition = true;
          });
} {code}
The above code fetches all the existing partitions for a table from HMS and 
compare that dynamic partitions list to decide old and new partitions to be 
added to HMS (in batches). The call to fetch all partitions has introduced a 
performance regression for tables with large number of partitions (of the order 
of 100K). 

 

This is fixed for external tables in 
https://issues.apache.org/jira/browse/HIVE-25178.  However for ACID tables 
there is an open Jira(HIVE-25187). Until we have an appropriate fix in 
HIVE-25187, we can apply the following: 

Skip fetching all partitions. Instead, in the threadPool which loads each 
partition individually,  call get_partition() to check if the partition already 
exists in HMS or not.  

This will introduce additional getPartition() call for every partition to be 
loaded dynamically but removes fetching all existing partitions for a table. 

I believe this is fine since for tables with small number of existing 
partitions in HMS - getPartitions() won't add too much overhead but for tables 
with large number of existing partitions, it will certainly avoid getting all 
partitions from HMS 

cc - [~lpinter] [~ngangam] 

  was:
HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to 
not add partitions one by one in HMS. As part of that improvement, following 
code was introduced: 
{code:java}
// fetch all the partitions matching the part spec using the partition iterable
// this way the maximum batch size configuration parameter is considered
PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, 
partSpec,
          conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 
300));
Iterator<Partition> iterator = partitionIterable.iterator();

// Match valid partition path to partitions
while (iterator.hasNext()) {
  Partition partition = iterator.next();
  partitionDetailsMap.entrySet().stream()
          .filter(entry -> 
entry.getValue().fullSpec.equals(partition.getSpec()))
          .findAny().ifPresent(entry -> {
            entry.getValue().partition = partition;
            entry.getValue().hasOldPartition = true;
          });
} {code}
The above code fetches all the existing partitions for a table from HMS and 
compare that dynamic partitions list to decide old and new partitions to be 
added to HMS (in batches). The call to fetch all partitions has introduced a 
performance regression for tables with large number of partitions (of the order 
of 100K). 

 

This is fixed for external tables in 
https://issues.apache.org/jira/browse/HIVE-25178.  However for ACID tables 
there is an open Jirahttps://issues.apache.org/jira/browse/HIVE-25187. Until we 
have an appropriate fix in HIVE-25187, we can apply the following: 

Skip fetching all partitions. Instead, in the threadPool which loads each 
partition individually,  call get_partition() to check if the partition already 
exists in HMS or not.  

This will introduce additional getPartition() call for every partition to be 
loaded dynamically but removes fetching all existing partitions for a table. 

I believe this is fine since for tables with small number of existing 
partitions in HMS - getPartitions() won't add too much overhead but for tables 
with large number of existing partitions, it will certainly avoid getting all 
partitions from HMS 

cc - [~lpinter] [~ngangam] 


> loadDynamicPartitions in Hive.java should not load all partitions of a 
> managed table 
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-25800
>                 URL: https://issues.apache.org/jira/browse/HIVE-25800
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>            Reporter: Sourabh Goyal
>            Assignee: Sourabh Goyal
>            Priority: Major
>
> HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java 
> to not add partitions one by one in HMS. As part of that improvement, 
> following code was introduced: 
> {code:java}
> // fetch all the partitions matching the part spec using the partition 
> iterable
> // this way the maximum batch size configuration parameter is considered
> PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, 
> partSpec,
>           conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 
> 300));
> Iterator<Partition> iterator = partitionIterable.iterator();
> // Match valid partition path to partitions
> while (iterator.hasNext()) {
>   Partition partition = iterator.next();
>   partitionDetailsMap.entrySet().stream()
>           .filter(entry -> 
> entry.getValue().fullSpec.equals(partition.getSpec()))
>           .findAny().ifPresent(entry -> {
>             entry.getValue().partition = partition;
>             entry.getValue().hasOldPartition = true;
>           });
> } {code}
> The above code fetches all the existing partitions for a table from HMS and 
> compare that dynamic partitions list to decide old and new partitions to be 
> added to HMS (in batches). The call to fetch all partitions has introduced a 
> performance regression for tables with large number of partitions (of the 
> order of 100K). 
>  
> This is fixed for external tables in 
> https://issues.apache.org/jira/browse/HIVE-25178.  However for ACID tables 
> there is an open Jira(HIVE-25187). Until we have an appropriate fix in 
> HIVE-25187, we can apply the following: 
> Skip fetching all partitions. Instead, in the threadPool which loads each 
> partition individually,  call get_partition() to check if the partition 
> already exists in HMS or not.  
> This will introduce additional getPartition() call for every partition to be 
> loaded dynamically but removes fetching all existing partitions for a table. 
> I believe this is fine since for tables with small number of existing 
> partitions in HMS - getPartitions() won't add too much overhead but for 
> tables with large number of existing partitions, it will certainly avoid 
> getting all partitions from HMS 
> cc - [~lpinter] [~ngangam] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to