icexelloss commented on issue #45190:
URL: https://github.com/apache/arrow/issues/45190#issuecomment-2580396466

   tldr: I agree with the wikipedia definition. 
   
   Full thought:
   
   In Python, I actually didn't find any library that implements the wikipedia 
definition, which is quite surprising. The two reference implementation I found 
is Pandas and Scipy, both agree with each other but not the wikipedia one:
   
   For the example input from wikipedia:
   ```
   s = pd.Series([1, 2, 3, 3, 3, 4, 4, 5, 5, 7])
   ```
   
   The wikipedia result is 
   ```
   0    0.05
   1    0.15
   2    0.35
   3    0.35
   4    0.35
   5    0.60
   6    0.60
   7    0.80
   8    0.80
   9    0.95
   dtype: float64
   ```
   both `Pandas and Scipy` gives me
   ```
   0    0.10
   1    0.20
   2    0.40
   3    0.40
   4    0.40
   5    0.65
   6    0.65
   7    0.85
   8    0.85
   9    1.00
   dtype: float64
   ```
   However, the `Pandas` rank is not ideal because it is not (0, 1) exclusive 
and have issues, say calling `norm.ppf(rank)` (it will introduce inf values), 
so internally we "adjust" the `Pandas` rank scaling from (1/N, .... N/N) to 
(1/N+1, ... N/N+1) to make it (0-1) exclusive, but looking at the wikipedia 
definition it is probably better to use that instead of tweaking the scaling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to