[GitHub] [druid] AlexanderSaydakov commented on pull request #14334: use the latest datasketches-java-4.0.0

via GitHub Tue, 23 May 2023 18:38:20 -0700


AlexanderSaydakov commented on PR #14334:
URL: https://github.com/apache/druid/pull/14334#issuecomment-1560339901


   We hesitated for some time, but finally decided that inclusive mode is a bit 
better. This is a major version change with some API incompatibility, so, if 
ever, this is the right time for the change.
   The difference is in the definition of rank. Suppose we are analyzing a 
distribution of some items exactly. The only thing required is a comparator of 
items ("less than" operator). We sort the items and define the rank of an item 
as the proportion of the whole distribution strictly less than that item in the 
exclusive mode or less than or equal to that item in the inclusive mode. It 
seems that the inclusive mode is more common in the literature and is slightly 
more well-behaved in some edge cases.
   To illustrate the difference, suppose we have just one item. Its rank in 
inclusive mode is 1, but 0 in exclusive mode. But with millions of items the 
difference in rank will be tiny, and, most probably, negligible. If we do a 
histogram or partitioning, some items on the edges can fall into the bin or 
partition on the right or on the left depending on the mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] AlexanderSaydakov commented on pull request #14334: use the latest datasketches-java-4.0.0

Reply via email to