Luc <ouaga...@gmail.com> added the comment:

If we are trying to fix this, the behavior should be like computing the mean or 
harmonic mean with the statistics library when there are missing values in the 
data.  At least that way, it is consistent with how the statistics library 
works when computing with NaNs in the data.  Then again, it should be mentioned 
somewhere in the docs.

import statistics as stats
import numpy as np
import pandas as pd
data = [75, 90,85, 92, 95, 80, np.nan]
stats.mean(data)
nan
stats.harmonic_mean(data)
nan
stats.stdev(data)
nan
As you can see, when there is a missing value, computing the mean, harmonic 
mean and sample standard deviation with the statistics library 
return a nan.
However, with the median, median_high and median_low, it computes those 
statistics incorrectly with the missing values present in the data.
It is better to return a nan, then let the user drop (or resolve) any missing 
values before computing.
## Another example using pandas serie
df = pd.DataFrame(data, columns=['data'])
df.head()
        data
0       75.0
1       90.0
2       85.0
3       92.0
4       95.0
5       80.0
6       NaN

### Use the statistics library to compute the median of the serie
stats.median(df1['data'])
90
 
## Pandas returns the correct median by dropping the missing values
## Now use pandas to compute the median of the serie with missing value
df['data'].median()
87.5

I did not test the median_grouped in statistics library, but will let you know 
afterwards if its affected as well.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33084>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to