[GitHub] [druid] kaijianding edited a comment on pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

GitBox Thu, 08 Jul 2021 05:05:59 -0700


kaijianding edited a comment on pull request #11379:
URL: https://github.com/apache/druid/pull/11379#issuecomment-876371771



   > I don't think the number of rows should be different in those cases. This 
optimization not just modifies the granularity but also dimensions. When 
granularity is set to ALL, the dimensions should have the virtual column for 
the timefloor function. When granularity is set to HOUR, the dimensions should 
not have that virtual column. In these cases, the actual keys used for grouping 
are effectively the same because the ALL granularity groups all rows into one 
bucket. As a result, the number of rows must be the same in both cases.
   
   > select floor(__time to hour),dim1 from a_table group by floor(__time to 
hour),dim1 limit 10
   
   When granularity=ALL, the `cursors` contains only 1 element, and only 10 
results are produced on compute node.
   When granularity=HOUR, the `cursors` contains 24 elements for 1 day, and 10 
results are produced each `cursor` on compute node. This is 24 times cost for 1 
day. If the time interval is 1 year, it is 8,760 times.
   This is the basic idea that make me think it's a bad idea to change the 
granularity from All to HOUR.
   
   
   > What logic should we modify? The optimization you added is modifying the 
native query that is passed all the way down to the data nodes which use the 
BufferGrouper. Why does pushing down the limit make any difference in the 
result signature of the data nodes?
   
   > select dim1,floor(__time to hour),dim2 from a_table group by 
dim1,floor(__time to hour),dim2 limit 10
   
   When granularity=ALL, the keys in `grouper` are dim1,time_to_hour_dim,dim2 
and with limit=10.
   When granularity=HOUR, the keys in `grouper` are dim1,dim2 and with 
limit=10, the overall orderBy result sequence(considering the time) becomes 
different with that when granularity=ALL.
   Something is needed to make the order correct again on compute node, like 
the final sorting on broker.
   
   > And the last reason, if we don't pushdown the limit to compute node and 
apply this optimization and do limit on broker, I guess the total performance 
is degraded.
   
   > Hmm, I'm confused. Are you saying that we should push down the limit 
because the performance will be worse otherwise? If so, I agree.
   
   Yes, I mean the limit should be push down, otherwise the performance will be 
worse.
   
   
   > What do you mean by "the timestampResult optimization logic happens on 
broker"? The rewritten query will be passed down to the data nodes as far as I 
can tell.
   
   I mean the main change is to make the code on broker to be compatible with 
the fact that the result rows's format from compute nodes are changed, and let 
the group by code process no change on compute nodes.
   
   
   
   Base on these reasons, I think the timestampResult optimization is not 
necessary when `query.isApplyLimitPushDown()` is true.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] kaijianding edited a comment on pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

Reply via email to