tdunning commented on issue #409: URL: https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1853215209
The need for stability comes from the need to preserve the t-digest invariant which is defined by a scale function. The scale function is a function that maps q (the quantile) to k (a scaled version. Scale functions look like kind of like this: [image: image.png] The t-digest invariant says that the k-size of any centroid must be less than one except if the weight of the centroid is 1. The k=size of a centroid is k(q_right) - k(q_left) where q_left is the value of q on the left side of the centroid and q_right is the value on the right. Because the scale function is steep near q=0 and q=1, the k-size of a cluster with constant weight is bigger if it is near the ends rather than near the middle. Defining the bound in terms of the scale function is what constrains the centroid sizes so that interpolation near these extremes will give good results. The invariant also bounds the size of the t-digest for suitable scale functions (see https://arxiv.org/pdf/1903.09921.pdf for details). This is all very simple when all of the samples that we use to create the t-digest are distinct because that means all of the centroids will have different means and sorting will work predictably. But what if we have 1000 samples all at the same point? The sizes of the centroids in the resulting t-digest near the center will be bigger than the ones at the edge. That's good. But if we sort those centroids purely by their means (which are all the same) with an unstable sort, they could get all jumbled. That could put the biggest centroids near the edge where their k-size would be much larger than it was near the center. Does this make sense? On Tue, Dec 12, 2023 at 10:42 AM Alexander Saydakov < ***@***.***> wrote: > As I understand, we should not even provide a public add() method with > weight. > Of course I am going to call std::sort or std::stable_sort. I am afraid I > do not quite get the importance of stable sort and what exactly the > invariant is. > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1852608805>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAB5E6RBVBVGDRBCSZ6WVFDYJCQR7AVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSGYYDQOBQGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
