siddharthteotia opened a new issue #4372: Support for single phase and 
two-phase distributed hash aggregation
URL: https://github.com/apache/incubator-pinot/issues/4372
 
 
   I am not entirely sure to what extent this is already supported -- so 
apologies if this is something that is already done or something not applicable
   
   Idea:
   
   GROUP BY processing using hash based aggregation in a distributed query 
engine can be done  using two different methods and each one has an impact on 
performance depending on cardinality
   
   (1) **Two-phase hash aggregation:** In the first phase, each node will do a 
local group by from the data (segment) on that node -- essentially computing 
partial resultset from a single thread query execution for a given segment at a 
given node.
   
   The physical plan or execution plan will have a shuffle above the 
aggregation group by operator -- the shuffle operator will use the partial 
aggregates and send them around such that each node now ends up with all the 
data (partial aggregates) corresponding to group by key(s). 
   
   The second phase of hash agg will now work on the partial aggregates and 
compute the final resultset for a given set of key(s) per node. Each node can 
now send its result set to broker. Broker only needs to union as opposed to 
doing any aggregation related processing.
   
   (2) **Single-phase hash aggregation**: The above described two phase method 
proves to be useful if the cardinality of group by key columns is low -- since 
doing local hash agg first achieves a higher degree of reduction and thus 
reduces the data sent around. However, if cardinality is very high then it is 
better to do single phase aggregation. In this case, below the aggregation 
operator, there will be a shuffle operator. Again, after hash agg, each node 
can send the final result set to broker and it just needs to union.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to