Re: [PR] HIVE-28062: Optimize get_partitions_by_names in direct sql [hive]

via GitHub Mon, 11 Mar 2024 21:35:42 -0700


zhangbutao commented on code in PR #5063:
URL: https://github.com/apache/hive/pull/5063#discussion_r1520802015



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##########
@@ -742,14 +742,8 @@ public List<Partition> getPartitionsViaPartNames(final 
String catName, final Str
     return Batchable.runBatched(batchSize, partNames, new Batchable<String, 
Partition>() {
       @Override
       public List<Partition> run(List<String> input) throws MetaException {
-        String filter = "" + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(input.size()) + ")";
-        List<Long> partitionIds = getPartitionIdsViaSqlFilter(catName, dbName, 
tblName,

Review Comment:
   > Skipping get partition ids in MetaStoreDirectSql#getPartitionsViaPartNames 
to reduce the Database calls.
   
   Seems that you merge the two sql(getPartitionIdsViaSqlFilter and 
`getPartitionsFromPartitionIds`) into one sql, so that reduce the db calls.
   I think your change maybe reasonable, but i also think there have other 
reasons which can prove why we use two sql to do this before. If you give any 
clues that can explain the history code, that's will be good. :)



##########
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java:
##########
@@ -3199,19 +3199,24 @@ public Map<Map<String, String>, Partition> 
loadDynamicPartitions(final LoadTable
           
partitionNames.add(Warehouse.makeDynamicPartNameNoTrailingSeperator(details.fullSpec));
         }
       }
-      List<Partition> partitions = Hive.get().getPartitionsByNames(tbl, 
partitionNames);
-      for(Partition partition : partitions) {
-        LOG.debug("HMS partition spec: {}", partition.getSpec());
-        partitionDetailsMap.entrySet().parallelStream()
-            .filter(entry -> 
entry.getValue().fullSpec.equals(partition.getSpec()))
-            .findAny().ifPresent(entry -> {
-          entry.getValue().partition = partition;
-          entry.getValue().hasOldPartition = true;
-        });
-      }
-      // no need to fetch partition again in tasks since we have already 
fetched partitions
-      // info in getPartitionsByNames()
-      fetchPartitionInfo = false;
+      try {
+        List<Partition> partitions = Hive.get().getPartitionsByNames(tbl, 
partitionNames);
+        for(Partition partition : partitions) {
+          LOG.debug("HMS partition spec: {}", partition.getSpec());
+          partitionDetailsMap.entrySet().parallelStream()
+              .filter(entry -> 
entry.getValue().fullSpec.equals(partition.getSpec()))
+              .findAny().ifPresent(entry -> {
+            entry.getValue().partition = partition;
+            entry.getValue().hasOldPartition = true;
+          });
+        }
+        // no need to fetch partition again in tasks since we have already 
fetched partitions
+        // info in getPartitionsByNames()
+        fetchPartitionInfo = false;
+      } catch (HiveException e) {
+        // Failed to fetch partitions in some cases (e.g., partition number 
limit)
+        LOG.warn("Fetching partitions by names failed.", e);

Review Comment:
   IMO, this exception should be thrown directly, as we need the engines(HS2 & 
Spark) to know that the `load_partition` is failed so that they can do some 
rollback based the exception, or the users know that they should control the 
limit size by setting conf.
   
   BTW, i think we had better not to change this client code (i know you are 
trying to fix the failed tests). The PR should not have any effect on client. 
Maybe we can try to fix this in sever side to minimize the potential 
incompatibilities?
   



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##########
@@ -1027,15 +1021,42 @@ public <T> List<T> getPartitionFieldsViaSqlFilter(
     }
   }
 
+
+  /** Should be called with the list short enough to not trip up Oracle/etc. */
+  private List<Partition> getPartitionsFromPartitionNames(String catName, 
String dbName,
+      String tblName, Boolean isView, List<String> partNameList, List<String> 
projectionFields,
+      boolean isAcidTable, GetPartitionsArgs args) throws MetaException {
+    // Get most of the fields for the partNames provided.
+    // Assume db and table names are the same for all partition, as provided 
in arguments.
+    String partNames = partNameList.stream()
+            .map(name -> "'" + name + "'")
+            .collect(Collectors.joining(","));
+    String queryText =

Review Comment:
   Can we ensure that other backend db like oracle & pg can use the sql without 
incompatibility?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-28062: Optimize get_partitions_by_names in direct sql [hive]

Reply via email to