felipecrv commented on code in PR #37380:
URL: https://github.com/apache/arrow/pull/37380#discussion_r1312038163
##########
python/pyarrow/_compute.pyx:
##########
@@ -2145,7 +2145,8 @@ class QuantileOptions(_QuantileOptions):
Parameters
----------
q : double or sequence of double, default 0.5
- Quantiles to compute. All values must be in [0, 1].
+ Probability levels of the quantiles to compute. All values must be in
Review Comment:
"Probability level" is new terminology and could mean anything -- that is
why it seems to work as a replacement for the ambiguous "quantile" here.
"quantile" can refer to the calculated value and the desired quantile as a
number between `0.0` and `1.0`. Isn't the comment saying they are between 0.0
and 1.0 enough to disambiguate?
The original t-Digest paper refers to these as quantiles [1]
> With the t-digest, accuracy for estimating the q quantile is nearly
constant relative to q(1 − q). This is in contrast to earlier algorithms which
had errors independent of q. The relative error bound of the t-digest is
convenient when computing quantiles for q near 0 or 1 as is commonly required.
As with the Q-digest algorithm, the accuracy/size trade-off for the t-digest
can be controlled by setting a single compression parameter δ with the amount
of memory required proportional only to Θ(δ).
I see you are referring to R's `quantile` documentation that uses "probs".
I would argue that they only did it because passing "quantile" to a "quantile"
function would be weird. In the case of t-Digest, calling this argument
"quantile" is more informative than "probability level". Probability of what?
If you really want to rename this to something that diverges from the the
original t-Digest paper, I suggest the term "fractions" that has been adoped by
Redis [2] and Druid [3].
[1] https://arxiv.org/pdf/1902.04023.pdf
[2] https://redis.io/docs/data-types/probabilistic/t-digest/
[3]
https://druid.apache.org/docs/latest/development/extensions-contrib/tdigestsketch-quantiles/
##########
python/pyarrow/_compute.pyx:
##########
@@ -2145,7 +2145,8 @@ class QuantileOptions(_QuantileOptions):
Parameters
----------
q : double or sequence of double, default 0.5
- Quantiles to compute. All values must be in [0, 1].
+ Probability levels of the quantiles to compute. All values must be in
Review Comment:
"Probability level" is new terminology and could mean anything -- that is
why it seems to work as a replacement for the ambiguous "quantile" here.
"quantile" can refer to the calculated value and the desired quantile as a
number between `0.0` and `1.0`. Isn't the comment saying they are between 0.0
and 1.0 enough to disambiguate?
The original t-Digest paper refers to these as quantiles [1]
> With the t-digest, accuracy for estimating the q quantile is nearly
constant relative to q(1 − q). This is in contrast to earlier algorithms which
had errors independent of q. The relative error bound of the t-digest is
convenient when computing quantiles for q near 0 or 1 as is commonly required.
As with the Q-digest algorithm, the accuracy/size trade-off for the t-digest
can be controlled by setting a single compression parameter δ with the amount
of memory required proportional only to Θ(δ).
I see you are referring to R's `quantile` documentation that uses "probs".
I would argue that they only did it because passing "quantile" to a "quantile"
function would be weird. In the case of t-Digest, calling this argument
"quantile" is more informative than "probability level". Probability of what?
If you really want to rename this to something that diverges from the the
original t-Digest paper, I suggest the term "fractions" that has been adoped by
Redis [2] and Druid [3].
[1] https://arxiv.org/pdf/1902.04023.pdf
[2] https://redis.io/docs/data-types/probabilistic/t-digest/
[3]
https://druid.apache.org/docs/latest/development/extensions-contrib/tdigestsketch-quantiles/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]