[GitHub] [hive] jcamachor commented on a change in pull request #1324: HIVE-23939: SharedWorkOptimizer: take the union of columns in mergeable TableScans

GitBox Thu, 30 Jul 2020 19:35:43 -0700


jcamachor commented on a change in pull request #1324:
URL: https://github.com/apache/hive/pull/1324#discussion_r463040481




##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
       }
     }
 
+    if (LOG.isDebugEnabled()) {

Review comment:
       Can we move the new call before the 
`if(pctx.getConf().getBoolVar(ConfVars.HIVE_SHARED_WORK_REUSE_MAPJOIN_CACHE)) 
{` block? It makes sense to trigger that block at the very end in case we 
continue adding phases.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
       }
     }
 
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Before SharedWorkOptimizer #2:\n" + 
Operator.toString(pctx.getTopOps().values()));
+    }
+
+    // Execute shared work optimization
+    new BaseSharedWorkOptimizer().sharedWorkOptimization(

Review comment:
       Can we put this additional step under a new flag (`true` by default)?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -273,258 +287,332 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
     return pctx;
   }
 
-  private static boolean sharedWorkOptimization(ParseContext pctx, 
SharedWorkOptimizerCache optimizerCache,
-      ArrayListMultimap<String, TableScanOperator> tableNameToOps, 
List<Entry<String, Long>> sortedTables,
-      boolean removeSemijoin) throws SemanticException {
-    // Boolean to keep track of whether this method actually merged any TS 
operators
-    boolean mergedExecuted = false;
-
-    Multimap<String, TableScanOperator> existingOps = 
ArrayListMultimap.create();
-    Set<Operator<?>> removedOps = new HashSet<>();
-    for (Entry<String, Long> tablePair : sortedTables) {
-      String tableName = tablePair.getKey();
-      for (TableScanOperator discardableTsOp : tableNameToOps.get(tableName)) {
-        if (removedOps.contains(discardableTsOp)) {
-          LOG.debug("Skip {} as it has already been removed", discardableTsOp);
-          continue;
-        }
-        Collection<TableScanOperator> prevTsOps = existingOps.get(tableName);
-        for (TableScanOperator retainableTsOp : prevTsOps) {
-          if (removedOps.contains(retainableTsOp)) {
-            LOG.debug("Skip {} as it has already been removed", 
retainableTsOp);
+  private static class BaseSharedWorkOptimizer {

Review comment:
       Can we add a clarifying comment to the new internal classes with the 
difference between both of them?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
       }
     }
 
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Before SharedWorkOptimizer #2:\n" + 
Operator.toString(pctx.getTopOps().values()));

Review comment:
       This is the same as `After SharedWorkSJOptimizer`, no need to print it 
again.

##########
File path: 
ql/src/test/results/clientpositive/llap/annotate_stats_join_pkfk.q.out
##########
@@ -1191,14 +1191,6 @@ STAGE PLANS:
                         sort order: +
                         Map-reduce partition columns: _col0 (type: int)
                         Statistics: Num rows: 12 Data size: 48 Basic stats: 
COMPLETE Column stats: COMPLETE
-            Execution mode: vectorized, llap

Review comment:
       I do not see projected columns, so it is difficult to confirm whether 
this is because of the schema. Is it?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws 
SemanticException {
       }
     }
 
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Before SharedWorkOptimizer #2:\n" + 
Operator.toString(pctx.getTopOps().values()));
+    }
+
+    // Execute shared work optimization
+    new BaseSharedWorkOptimizer().sharedWorkOptimization(
+        pctx, optimizerCache, tableNameToOps, sortedTables, false);
+
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("After SharedWorkOptimizer #2:\n" + 
Operator.toString(pctx.getTopOps().values()));

Review comment:
       `After SharedWorkOptimizer merging TS schema`?

##########
File path: 
ql/src/test/results/clientpositive/llap/auto_join_reordering_values.q.out
##########
@@ -144,122 +144,30 @@ STAGE PLANS:
                         tag: 0
                         value expressions: _col0 (type: int), _col2 (type: 
int), _col3 (type: int)
                         auto parallelism: true
-            Execution mode: vectorized, llap
-            LLAP IO: no inputs
-            Path -> Alias:
-#### A masked pattern was here ####
-            Path -> Partition:
-#### A masked pattern was here ####
-                Partition
-                  base file name: orderpayment_small
-                  input format: org.apache.hadoop.mapred.TextInputFormat
-                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
-                  properties:
-                    bucket_count -1
-                    bucketing_version 2
-                    column.name.delimiter ,
-                    columns dealid,date,time,cityid,userid
-                    columns.types int:string:string:int:int
-#### A masked pattern was here ####
-                    name default.orderpayment_small
-                    serialization.format 1
-                    serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
-                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
-                
-                    input format: org.apache.hadoop.mapred.TextInputFormat
-                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
-                    properties:
-                      bucketing_version 2
-                      column.name.delimiter ,
-                      columns dealid,date,time,cityid,userid
-                      columns.comments 
-                      columns.types int:string:string:int:int
-#### A masked pattern was here ####
-                      name default.orderpayment_small
-                      serialization.format 1
-                      serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
-                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
-                    name: default.orderpayment_small
-                  name: default.orderpayment_small
-            Truncated Path -> Alias:
-              /orderpayment_small [orderpayment]
-        Map 6 
-            Map Operator Tree:
-                TableScan
-                  alias: dim_pay_date
-                  filterExpr: date is not null (type: boolean)
-                  Statistics: Num rows: 1 Data size: 94 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  GatherStats: false
                   Filter Operator
                     isSamplingPred: false
-                    predicate: date is not null (type: boolean)
-                    Statistics: Num rows: 1 Data size: 94 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    predicate: dealid is not null (type: boolean)
+                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                     Select Operator
-                      expressions: date (type: string)

Review comment:
       Why is this projection changing? Is this correct? I do not see the 
branch with `date` column.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] jcamachor commented on a change in pull request #1324: HIVE-23939: SharedWorkOptimizer: take the union of columns in mergeable TableScans

Reply via email to