[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

Kevin Sheppard Mon, 13 Dec 2021 01:05:42 -0800

To me, this feels like it might be a better fit for SciPy or possibly
statsmodels (but maybe not since neither have histogram functions
anymore).The challenge with weighted estimators is how the weights should
be interpreted. Stata covers the most important cases of weights
https://www.reed.edu/psychology/stata/gs/tutorials/weights.html.  Would
these be frequency weights?  Stata supports only frequency weights
https://www.stata.com/manuals/u11.pdf#u11.1.6weight.


Kevin


On Sun, Dec 12, 2021 at 9:45 AM Jonathan Crall <erote...@gmail.com> wrote:

> Hi all, this is my first post on this mailing list.
>
> I'm writing to propose a method for extending the histogram bandwidth
> estimators to work with weighted data. I originally submitted this proposal
> to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and mwaskom
> suggested I take it here.
>
> Currently the unweighted auto heuristic is a combination of
> the Freedman-Diaconis and Sturges estimator. For reference, these rules are
> as follows:
>
> Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and number
> of data points total=x.size. Then divide ptp by the log of one plus the
> number of data points.
>
> ptp / log2(total + 2)
>
> Freedman-Diaconis: Find the interquartile-range of the data
> iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data
> points total=x.size, then apply the formula:
>
> 2.0 * iqr * total ** (-1.0 / 3.0).
>
> Taking a look at these it seems (please correct me if I'm missing
> something that makes this not work) that there is a simple extension to
> weighted data. If we can find a weighted replacement for p2p, total, and
> iqr, the formulas should work exactly the same in the weighted case.
>
> The p2p case seems easy. Even if the data points are weighed, that doesn't
> change the min and max. Nothing changes here.
>
> For total, instead of taking the size of the array (which implicitly
> assumes each data point has a weight of 1), just sum the weight to get
> total=weights.sum().
>
> I believe the IQR is also computable in the weighted case.
>
> import numpy as np
> n = 10
> rng = np.random.RandomState(12554)
>
>
> x = rng.rand(n)
> w = rng.rand(n)
>
>
> sorted_idxs = x.argsort()
> x_sort = x[sorted_idxs]
> w_sort = w[sorted_idxs]
>
>
> cumtotal = w_sort.cumsum()
> quantiles = cumtotal / cumtotal[-1]
> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
> iqr_weighted = x_sort[idx2] - x_sort[idx1]
> print('iqr_weighted = {!r}'.format(iqr_weighted))
>
>
> # test this is the roughtly the same for the "unweighted case"
> # (wont be exactly the same because this method does not have
> interpolation)
> w = np.ones_like(x)
>
>
> w_sort = w[sorted_idxs]
> cumtotal = w_sort.cumsum()
> quantiles = cumtotal / cumtotal[-1]
> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
> iqr_weighted = x_sort[idx2] - x_sort[idx1]
> iqr_unweighted_repo = x_sort[idx2] - x_sort[idx1]
> print('iqr_unweighted_repo = {!r}'.format(iqr_unweighted_repo))
>
>
> iqr_unweighted_orig = np.subtract(*np.percentile(x, [75, 25]))
> print('iqr_unweighted_orig = {!r}'.format(iqr_unweighted_orig))
>
>
> This quick and dirty method if weighted quantiles give a close result
> (which is probably fine for a bandwidth estimator):
>
> iqr_weighted = 0.21964093625695036
> iqr_unweighted_repo = 0.36649977003903755
> iqr_unweighted_orig = 0.30888312408540963
>
> And I do see there is an open issue / PR for
> weighted quantiles/percentiles: https://github.com/numpy/numpy/issues/8935
>  https://github.com/numpy/numpy/pull/9211 so this code could make use of
> that after it lands.
>
> Lastly, I think the most common case (or at least my case) for using a
> weighted histogram is to combine multiple histograms. In this case the
> number of estimated bins might be greater than the number of weighted data
> points, and a simple min condition on that number and the estimated number
> of bins should take care of that.
>
> Please let me know: thoughts / opinions / ideas on this topic. I did do
> some searching for related discussion, but I may have missed it, so point
> me to that if I missed it. Also if the reason this feature does not exist
> is because there is some theoretical problem with estimating bandwidth for
> weighted data that I'm unaware of, I'd be interested to learn about that
> (although I can't see that being the case because these are just heuristics
> after all, and I have validated that this works well in my own use-cases).
>
> --
> -Dr. Jon Crall (him)
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: kevin.k.shepp...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

Reply via email to