On Wed, Apr 15, 2015 at 8:06 AM, Neil Girdhar <mistersh...@gmail.com> wrote:
> You got it. I remember this from when I worked at Google and we would > process (many many) logs. With enough bins, the approximation is still > really close. It's great if you want to make an automatic plot of data. > Calling numpy.partition a hundred times is probably slower than calling P^2 > with n=100 bins. I don't think it does O(n) computations per point. I > think it's more like O(log(n)). > Looking at it again, it probably is O(n) after all: it does a binary search, which is O(log n), but it then goes on to update all the n bin counters and estimations, so O(n) I'm afraid. So there is no algorithmic advantage over partition/percentile: if there are m samples and n bins, P-2 that O(n) m times, while partition does O(m) n times, so both end up being O(m n). It seems to me that the big thing of P^2 is not having to hold the full dataset in memory. Online statistics (is that the name for this?), even if only estimations, is a cool thing, but I am not sure numpy is the place for them. That's not to say that we couldn't eventually have P^2 implemented for histogram, but I would start off with a partition based one. Would SciPy have a place for online statistics? Perhaps there's room for yet another scikit? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion