jon-wei opened a new issue #6869: [Proposal] Deprecating "approximate histogram" in favor of new sketches URL: https://github.com/apache/incubator-druid/issues/6869 Deprecating "approximate histogram" in favor of new sketches ========================== Motivation ---------------- Druid's "approximate histogram" aggregator has several significant drawbacks: - No formal error bounds - Accuracy is heavily data dependent - Doesn't handle sorted data well - Doesn't handle long tails well It's not uncommon for users to get bad results using that aggregator, without any clear idea as to why, e.g. https://github.com/apache/incubator-druid/issues/6853. Druid now has better options for quantile/histogram approximations: - The quantiles sketch from the DataSketches extension (http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html) is a better choice, with formal error bounds and a distribution independent algorithm. - The upcoming moments sketch aggregator (https://github.com/apache/incubator-druid/pull/6581) is another option, using a distribution dependent algorithm with better performance/accuracy characteristics than Druid's "approximate histogram". Proposed Changes ---------------- - Add a doc page containing the following guidance: - Document the advantages of the sketch algorithms over "approximate histogram" to encourage users to transition - Provide advice on how to choose between the quantiles sketch or the moments sketch - Update docs to replace any examples/recommendations of "approximate histogram" as needed, mark "approximate histogram" as deprecated - In line with what's being discussed in https://github.com/apache/incubator-druid/issues/6814 re: APPROX_COUNT_DISTINCT: - Change the APPROX_QUANTILE Druid SQL function to use whatever aggregator type is stored in a segment, and use a default option when used on a numeric column - Add individual APPROX_QUANTILE_* functions for each quantile estimation option Changed Interfaces ---------------- The behavior of APPROX_QUANTILE in Druid SQL would change as described above. Migration ---------------- Is it possible/valid for the newer sketch aggs to be able to operate on an old "approximate histogram"? If so, this would make migration easier for users. If such migration is not possible, then users will need to reingest existing data, accept the discontinuity, or continue using the old aggregator. Alternatives ---------------- We could try to make improvements to the "approximate histogram" aggregator, but I think there's little value in doing so since better alternatives already exist.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
