jnturton commented on pull request #2432:
URL: https://github.com/apache/drill/pull/2432#issuecomment-1023528517


   @tdunning thank you for clarifying.  I speak under correction but the 
application of the t-digest here will be to help the query planner to estimate 
relation cardinalities, those being statistics which it will then use to decide 
e.g. which of two relations should be the "build side" of a hash join.  
Optimally, the smaller relation goes on the build side in that example.
   
   To estimate a relation's cardinality after a range predicate has been 
applied, the planner can each for quantiles that with luck were precomputed 
over the relevant column.  I can't see a reason why real-world range predicates 
would be more focused on the tails of distributions, the shapes of which we'd 
probably like to assume nothing about anyway.
   
   My take is that probably any t-digest scale factor is quite accurate enough 
for a query planner, but of them all I'd agree that K_0 is probably the best 
choice under the assumptions here.  We can now either hard code Drill to K_0 or 
default it to K_0 and add a config option so that users can override.  I'm not 
convinced the config option will justify its inclusion because it is one that 
users would need to vary from table to table depending on specifics of the data 
there.  So I'm inclined to just hard code a good general default.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to