kasakrisz commented on code in PR #6374:
URL: https://github.com/apache/hive/pull/6374#discussion_r3027804243


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java:
##########
@@ -256,14 +256,32 @@ private void replaceSelectOperatorProcess(SelectOperator 
operator, Operator<? ex
     //                                |
 
     // 1. deal with non-partition columns
+    Map<String, Integer> columnNameToIndex = new HashMap<>();
+    List<ColumnInfo> selRSSig = selRS.getSignature();
+    for (int i = 0; i < selRSSig.size(); i++) {
+      columnNameToIndex.putIfAbsent(selRSSig.get(i).getAlias(), i);
+    }
     for (int i = 0; i < this.columns.size(); i++) {
       ColumnInfo col = columns.get(i);
+      ObjectInspector objectInspector = col.getObjectInspector();
+      if (objectInspector == null) {
+        continue;
+      }
+      boolean columnSupported = 
isColumnSupported(objectInspector.getCategory(), col::getType);
+      if (!columnSupported) {
+        continue;
+      }
+
+      Integer selRSIdx = columnNameToIndex.get(this.columns.get(i).getName());
+      if (selRSIdx == null) {
+        continue;
+      }

Review Comment:
   What was the original behavior? Did we simply exit this loop when a column 
that didn't support autogather was found?"



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -103,11 +102,27 @@ private boolean shouldRewrite(ASTNode tree) {
     return rwt;
   }
 
+  /**
+   * Get the names of the columns that support column statistics.
+   */
+  private static List<String> getColumnNames(Table tbl) {

Review Comment:
   How about `getColumnNamesSupportStats` ?



##########
ql/src/test/queries/clientpositive/stats_unsupported_type.q:
##########
@@ -0,0 +1,11 @@
+set hive.stats.autogather=true;

Review Comment:
   I went through the golden file changes in this PR, and many of them 
indicates that   column stats autogathering is enabled.
   
   Do we need this additional test? I have the impression that we already have 
extensive coverage.



##########
ql/src/test/results/clientpositive/llap/lateral_view.q.out:
##########
@@ -671,23 +671,23 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: tmp_pyang_src_rcfile
-                  Statistics: Num rows: 20 Data size: 42080 Basic stats: 
COMPLETE Column stats: NONE
+                  Statistics: Num rows: 20 Data size: 40140 Basic stats: 
COMPLETE Column stats: PARTIAL

Review Comment:
   `Partial` is still better than `None` but do you know what is missing for 
`Complete`?



##########
ql/src/test/results/clientpositive/llap/cast_null_to_complex.q.out:
##########
@@ -87,7 +87,7 @@ Retention:            0
 #### A masked pattern was here ####
 Table Type:            MANAGED_TABLE            
 Table Parameters:               
-       COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
+       COLUMN_STATS_ACCURATE   
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"_c2\":\"true\"}}

Review Comment:
   Although this test wasn't specifically intended for column stats 
autogathering, it tests that functionality as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to