[GitHub] jon-wei opened a new issue #6869: [Proposal] Deprecating "approximate histogram" in favor of new sketches

GitBox Tue, 15 Jan 2019 19:59:25 -0800

jon-wei opened a new issue #6869: [Proposal] Deprecating "approximate 
histogram" in favor of new sketches
URL: https://github.com/apache/incubator-druid/issues/6869
 
 
   Deprecating "approximate histogram" in favor of new sketches
   ==========================
   
   Motivation
   ----------------
   Druid's "approximate histogram" aggregator has several significant drawbacks:
   - No formal error bounds
   - Accuracy is heavily data dependent
     - Doesn't handle sorted data well
     - Doesn't handle long tails well
   
   It's not uncommon for users to get bad results using that aggregator, 
without any clear idea as to why, e.g. 
https://github.com/apache/incubator-druid/issues/6853.
   
   Druid now has better options for quantile/histogram approximations:
   - The quantiles sketch from the DataSketches extension 
(http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html)
 is a better choice, with formal error bounds and a distribution independent 
algorithm.
   - The upcoming moments sketch aggregator 
(https://github.com/apache/incubator-druid/pull/6581) is another option, using 
a distribution dependent algorithm with better performance/accuracy 
characteristics than Druid's "approximate histogram".
   
   Proposed Changes
   ----------------
   - Add a doc page containing the following guidance:
     - Document the advantages of the sketch algorithms over "approximate 
histogram" to encourage users to transition
     - Provide advice on how to choose between the quantiles sketch or the 
moments sketch
   - Update docs to replace any examples/recommendations of "approximate 
histogram" as needed, mark "approximate histogram" as deprecated
   - In line with what's being discussed in 
https://github.com/apache/incubator-druid/issues/6814 re: APPROX_COUNT_DISTINCT:
     - Change the APPROX_QUANTILE Druid SQL function to use whatever aggregator 
type is stored in a segment, and use a default option when used on a numeric 
column
     - Add individual APPROX_QUANTILE_* functions for each quantile estimation 
option
   
   Changed Interfaces
   ----------------
   The behavior of APPROX_QUANTILE in Druid SQL would change as described above.
   
   Migration
   ----------------
   Is it possible/valid for the newer sketch aggs to be able to operate on an 
old "approximate histogram"? If so, this would make migration easier for users. 
   
   If such migration is not possible, then users will need to reingest existing 
data, accept the discontinuity, or continue using the old aggregator.
   
   Alternatives
   ----------------
   We could try to make improvements to the "approximate histogram" aggregator, 
but I think there's little value in doing so since better alternatives already 
exist.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jon-wei opened a new issue #6869: [Proposal] Deprecating "approximate histogram" in favor of new sketches

Reply via email to