Greetings,

I have a PR that warrants discussion according to @seberg. See
https://github.com/numpy/numpy/pull/14278.

It is an enhancement that fixes a bug. The original bug is that when using
the fd estimator on a dataset with small inter-quartile range and large
outliers, the current codebase produces more bins than memory allows. There
are several related bug reports (see #11879, #10297, #8203).

In terms of scope, I restricted my changes to conditions where
np.histogram(bins='auto') defaults to the 'fd'.  For the actual fix, I
actually enhanced the API. I used a suggestion from @eric-wieser to merge
empty histogram bins. In practice this solves the outsized bins issue.

However @seberg is concerned that extending the API in this way may not be
the way to go. For example, if you use "auto" once, and then re-use the
bins, the uneven bins may not be what you want.

Furthermore @eric-wieser is concerned that there may be a floating-point
devil in the details. He advocates using the hypothesis testing package to
increase our confidence that the current implementation adequately handles
corner cases.

I would like to do my part in improving the code base. I don't have strong
opinions but I have to admit that I would like to eventually make a PR that
resolves these bugs. This has been a PR half a year in the making after all.

Thoughts?

-areeves87
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to