Re: [PR] HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables [hive]

via GitHub Mon, 21 Oct 2024 06:33:48 -0700


kasakrisz commented on code in PR #5498:
URL: https://github.com/apache/hive/pull/5498#discussion_r1808463369



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -471,6 +483,22 @@ public Map<String, String> getBasicStatistics(Partish 
partish) {
         Map<String, String> summary = table.currentSnapshot().summary();
         if (summary != null) {
 
+          if 
(Boolean.parseBoolean(summary.get(SnapshotSummary.PARTITION_SUMMARY_PROP)) &&
+                partish.getPartition() != null) {
+            String key = SnapshotSummary.CHANGED_PARTITION_PREFIX + 
partish.getPartition().getName();
+            Map<String, String> map = Maps.newHashMap();
+
+            if (summary.containsKey(key)) {
+              StringTokenizer tokenizer = new 
StringTokenizer(summary.get(key), ",");
+              while (tokenizer.hasMoreTokens()) {
+                String token = tokenizer.nextToken();
+                String[] keyValue = token.split("=");
+                map.put(keyValue[0], keyValue[1]);
+              }
+            }
+            summary = map;

Review Comment:
   This can be extracted to a method which parses the summary into a `Map`. 
This may help reducing the cyclomatic complexity of the `getBasicStatistics` 
method 



##########
ql/src/test/results/clientpositive/llap/vector_bucket.q.out:
##########
@@ -61,10 +61,11 @@ STAGE PLANS:
             Execution mode: llap
             LLAP IO: no inputs
             Map Vectorization:
-                enabled: false
+                enabled: true
                 enabledConditionsMet: 
hive.vectorized.use.vectorized.input.format IS true
-                enabledConditionsNotMet: Could not enable vectorization due to 
partition column names size 1 is greater than the number of table column names 
size 0 IS false
                 inputFileFormats: 
org.apache.hadoop.hive.ql.io.NullRowsInputFormat
+                notVectorizedReason: UDTF Operator (UDTF) not supported
+                vectorized: false

Review Comment:
   What is the cause of these changes? This is not an iceberg related test.



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnInfo.java:
##########
@@ -166,9 +170,13 @@ public String getTabAlias() {
   }
 
   public boolean getIsVirtualCol() {
-    return isVirtualCol;
+    return isVirtualCol || isPartitionCol;

Review Comment:
   Please don't mix virtual columns and partition columns. These are very 
different things. `getIsVirtualCol()` should depend on `isVirtualCol` only.



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##########
@@ -445,9 +444,9 @@ public static List<DeleteFile> getDeleteFiles(Table table, 
String partitionPath)
         t -> ((PositionDeletesScanTask) t).file()));
   }
 
-  public static Expression generateExpressionFromPartitionSpec(Table table, 
Map<String, String> partitionSpec)
-      throws SemanticException {
-    Map<String, PartitionField> partitionFieldMap = 
getPartitionFields(table).stream()
+  public static Expression generateExpressionFromPartitionSpec(Table table, 
Map<String, String> partitionSpec,
+      boolean latestSpecOnly) throws SemanticException {
+    Map<String, PartitionField> partitionFieldMap = getPartitionFields(table, 
latestSpecOnly).stream()

Review Comment:
   The method `generateExpressionFromPartitionSpec` is called from two places 
and the actual parameter value `latestSpecOnly` is always `true`. How about 
calling
   ```
   icebergTable.spec().fields()
   ```
   instead of
   ```
   getPartitionFields(table, latestSpecOnly)
   ```
   



##########
ql/src/java/org/apache/hadoop/hive/ql/metadata/DummyPartition.java:
##########
@@ -51,11 +48,15 @@ public DummyPartition(Table tbl, String name) {
     this.name = name;
   }  
 
-  public DummyPartition(Table tbl, String name,
-      Map<String, String> partSpec) {
-    setTable(tbl);
+  public DummyPartition(Table tbl, String name, Map<String, String> partSpec) 
throws HiveException {

Review Comment:
   Is the functionality of `DummyPartition` exploited? I saw that when we 
create `DummyPartition` instances we return a `Partition` type reference and 
never access any `DummyPartition` defined method. If this is not the case then 
`DummyPartition` is fine. The name is misleading though since these objects 
represent real partitions, aren't they?



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##########
@@ -129,19 +129,19 @@ public static Table getTable(Configuration configuration, 
org.apache.hadoop.hive
    */
   static Table getTable(Configuration configuration, Properties properties, 
boolean skipCache) {
     String metaTable = 
properties.getProperty(IcebergAcidUtil.META_TABLE_PROPERTY);
-    String tableName = properties.getProperty(Catalogs.NAME);
-    String location = properties.getProperty(Catalogs.LOCATION);
+    Properties props = (metaTable != null) ? new Properties() : properties;
+
     if (metaTable != null) {
+      props.computeIfAbsent(InputFormatConfig.CATALOG_NAME, properties::get);
       // HiveCatalog, HadoopCatalog uses NAME to identify the metadata table
-      properties.setProperty(Catalogs.NAME, tableName + "." + metaTable);
+      props.computeIfAbsent(Catalogs.NAME, k -> properties.get(k) + "." + 
metaTable);
       // HadoopTable uses LOCATION to identify the metadata table
-      properties.setProperty(Catalogs.LOCATION, location + "#" + metaTable);
+      props.computeIfAbsent(Catalogs.LOCATION, k -> properties.get(k) + "#" + 
metaTable);
     }

Review Comment:
   Is this a necessary change? The original code had one if statement the new 
one has 5.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables [hive]

Reply via email to