icexelloss commented on issue #45190: URL: https://github.com/apache/arrow/issues/45190#issuecomment-2580396466
tldr: I agree with the wikipedia definition. Full thought: In Python, I actually didn't find any library that implements the wikipedia definition, which is quite surprising. The two reference implementation I found is Pandas and Scipy, both agree with each other but not the wikipedia one: For the example input from wikipedia: ``` s = pd.Series([1, 2, 3, 3, 3, 4, 4, 5, 5, 7]) ``` The wikipedia result is ``` 0 0.05 1 0.15 2 0.35 3 0.35 4 0.35 5 0.60 6 0.60 7 0.80 8 0.80 9 0.95 dtype: float64 ``` both `Pandas and Scipy` gives me ``` 0 0.10 1 0.20 2 0.40 3 0.40 4 0.40 5 0.65 6 0.65 7 0.85 8 0.85 9 1.00 dtype: float64 ``` However, the `Pandas` rank is not ideal because it is not (0, 1) exclusive and have issues, say calling `norm.ppf(rank)` (it will introduce inf values), so internally we "adjust" the `Pandas` rank scaling from (1/N, .... N/N) to (1/N+1, ... N/N+1) to make it (0-1) exclusive, but looking at the wikipedia definition it is probably better to use that instead of tweaking the scaling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org