The scipy.stats.qqplot and scipy.stats.probplot  functions plot expected values 
versus actual data values for visualization of fit to a distribution.  First a 
one-D array of expected percentiles is generated for  a sample of size N; then 
that is passed to  dist.ppf, the per cent point function for the chosen 
distribution, to return an array of expected values.  The visualized data 
points are pairs of expected and actual values, and a linear regression is done 
on these to produce the line data points in this distribution should lie on.

Where x is the input data array and dist the chosen distribution we have:

> osr = np.sort(x)
> osm_uniform = _calc_uniform_order_statistic_medians(len(x))
> osm = dist.ppf(osm_uniform)
> slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)

My question concerns the plot display.  

> ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')


The x-axis of the resulting plot is labeled quantiles, but the xticks and 
xticklabels produced produced by qqplot and problplot do not seem correct for 
the their intended interpretations.  First the numbers on the x-axis do not 
represent quantiles; the intervals between them do not in general contain equal 
numbers of points.  For a normal distribution with sigma=1, they represent 
standard deviations.  Changing the label on the x-axis does not seem like a 
very good solution, because the interpretation of the values on the x-axis will 
be different for different distributions.  Rather the right solution seems to 
be to actually show quantiles on the x-axis. The numbers on the x-axis can stay 
as they are, representing quantile indexes, but they need to be spaced so as to 
show the actual division points that carve the population up into  groups of 
the same size.  This can be done in something like the following way. 

> import numpy as np
> xt = np.arange(-3,3,dtype=int)

> # Find the 5 quantiles to divide the data into sixths
> percentiles = [x*.167 + .502 for x in xt]
> percentiles = np.array(percentiles + [.999])
> vals = dist.ppf(percentiles)
> ax.set_xticks(vals)
> xt = np.array(list(xt)+[3])
> ax.set_xticklabels(xt)
> ax.set_xlabel('Quantile')
> plt.show()



I’ve attached two images to show the difference between the current 
visualization and the suggested one.

Mark Gawron




_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to