gianm commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' 
distinct count sketch
URL: 
https://github.com/apache/incubator-druid/issues/6814#issuecomment-454881037
 
 
   > As such I do not think removing Druid-HLL is a valid path forward.
   
   100% agree. We do not ever want to break compatibility with existing on-disk 
segments, so we need to keep it.
   
   > I think the module system is pretty solid in Druid (despite a few hadoop 
related issues), so I would prefer if HLL were moved to a core module instead 
of the other way around.
   
   Druid core does need some kind of HLL implementation, so they can't all be 
moved to modules. It's used by DetermineHashedPartitionsJob and IndexTask to 
determine partitions. We might end up wanting to use it for other things too, 
like tracking dimension cardinality statistics on the broker to aid in query 
planning.
   
   Another consideration is that HLL-based counting functionality is a pretty 
standard thing in databases these days. Offering it as a core thing is nearly 
universal.
   
   A final consideration is the interaction with Druid SQL. In Druid SQL, by 
default, `COUNT(DISTINCT x)` uses HLL. You can set `useApproximateCountDistinct 
= false` but by default it's true. There have been objections raised to this 
(https://github.com/apache/incubator-druid/issues/6814#issuecomment-452056269), 
but I think it's a good choice for Druid, since it adheres to the Druid 
philosophy that fast is best, and approximations are how you get there. I feel 
that doing it in exact mode by default would be un-Druidy. It affects the 
decision of where to put HLL, since the strategy of count-distinct being 
approximate by default is only feasible if there's an HLL implementation in 
core Druid.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to