jcamachor commented on a change in pull request #1878:
URL: https://github.com/apache/hive/pull/1878#discussion_r564886525



##########
File path: ql/src/test/results/clientpositive/llap/subquery_in.q.out
##########
@@ -408,12 +408,18 @@ STAGE PLANS:
                     expressions: (UDFToDouble(_col0) / _col1) (type: double)
                     outputColumnNames: _col0
                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
-                    Reduce Output Operator
-                      key expressions: _col0 (type: double)
-                      null sort order: z
-                      sort order: +
-                      Map-reduce partition columns: _col0 (type: double)
+                    Group By Operator

Review comment:
       @vineetgarg02 , I was checking this. In the previous plan, we were 
executing an inner join. In this plan, we are executing a semijoin. From 
looking at the code, it seems for SJ we always create a mapside group by 
operator deterministically, without considering whether that group by would 
reduce the input data: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9406
 . That may not be too bad since the group by can internally switch to 
streaming mode if it's not reducing the input size.
   From your comment though, I think I understand that there is some 
optimization that may have kicked in to remove that group by? Could you 
elaborate?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to