jnturton commented on pull request #2432: URL: https://github.com/apache/drill/pull/2432#issuecomment-1023528517
@tdunning thank you for clarifying. I speak under correction but the application of the t-digest here will be to help the query planner to estimate relation cardinalities, those being statistics which it will then use to decide e.g. which of two relations should be the "build side" of a hash join. Optimally, the smaller relation goes on the build side in that example. To estimate a relation's cardinality after a range predicate has been applied, the planner can each for quantiles that with luck were precomputed over the relevant column. I can't see a reason why real-world range predicates would be more focused on the tails of distributions, the shapes of which we'd probably like to assume nothing about anyway. My take is that probably any t-digest scale factor is quite accurate enough for a query planner, but of them all I'd agree that K_0 is probably the best choice under the assumptions here. We can now either hard code Drill to K_0 or default it to K_0 and add a config option so that users can override. I'm not convinced the config option will justify its inclusion because it is one that users would need to vary from table to table depending on specifics of the data there. So I'm inclined to just hard code a good general default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
