Henry, the inputs and outputs are exactly what you had. Your function is much more clear than mine. Just goes to show that translating from another language is not often not the best approach for J. It also helps gain some intuition about what ecdf is doing when it's as clear as this
ecdf calculates the probability of the number falling within cumulative range. This can be used to calculate percentiles We can calculate the cumulative probability of a number in the empirical data if we know: 1. The interval index (a.k.a. rank) the number falls in (ties have the same rank) 2. The # items in the list The probability then becomes rank % # of items Given a list of numbers 1,3,3,5 what is the probability of the number being less than or equal to 3? We can easily see that it should be 75%. There are 3 numbers <= 3 out of 4. The first part of Henry's expression calculates the interval index (rank order) of each number, by comparing the input vector to a vector slightly less than the input. This forces the interval to start at 1 instead of 0 and takes the larger number for ties (I.~ (/:~@:(-&0.00001))) (1,3,3,5) 1 3 3 4 If ties weren't an issue, this would have been equivalent: (1&+@:I.~ (/:~))~ i.10 1 2 3 4 5 6 7 8 9 10 (I.~ (/:~@:(-&0.00001))) i.10 1 2 3 4 5 6 7 8 9 10 Simply adding one, chooses the smaller of the ranks for ties, which would not be correct: (1&+@:I.~ (/:~))~ (1,3,3,5) 1 2 2 4 vs: (I.~ (/:~@:(-&0.00001))) (1,3,3,5) 1 3 3 4 With the correct rank, the rest of the fork takes the rank and divides it by the count of items in the list ((I.~ % #@]) (/:~@:(-&0.00001)))~ (1,3,3,5) 0.25 0.75 0.75 1 Nicely done The alternate way of thinking of this is (which I had done) is Count the # of items in the list that are less than/equal to the current number. Take that number and divide it by the total # of items. The interval index/rank concept is easier to understand in my opinion On Wed, Aug 13, 2014 at 4:27 PM, Henry Rich <[email protected]> wrote: > I don't understand what the input and outputs of ecdf are supposed to be. > > But, if x is an empirical dataset and y is an array of values for which you > want the value of the ecdf, you could use > > ecdf =: ((I.~ % #@]) (/:~@:(-&0.00001)))~ > dataset ecdf 2 2 2 4 4 6 6 8 > > 0.375 0.375 0.375 0.625 0.625 0.875 0.875 1 > dataset ecdf i. 9 > 0 0 0.375 0.375 0.625 0.625 0.875 0.875 1 > > The 0.00001 bit is only if you really insist on P(X<=x) rather than P(X<x). > > > > Henry Rich > > > On 8/13/2014 3:06 PM, Joe Bogner wrote: >> >> I'm looking for any ideas to speed up this function. >> >> I patched together this ecdf function from a few different ideas: >> >> NB. > v<- c(2,2,2,4,4,6,6,8) >> NB. > ecdf(v)(v) >> NB. [1] 0.375 0.375 0.375 0.625 0.625 0.875 0.875 1.000 >> >> ecdf=: 3 : 0 >> valsct=. # y >> tbl=:y,.(valsct %~ #) \ y >> max=:(0{"1 tbl) (>./)/. tbl >> , 1{"1 (({."1 max) i. y) { max >> ) >> >> (0.375 0.375 0.375 0.625 0.625 0.875 0.875 1.000) -: ecdf >> (2,2,2,4,4,6,6,8) >> 1 >> >> >> timespacex 'ecdf i. 1e5' >> 8.84599 1.15392e7 >> >> The r function is nearly instantaneous >> >> I need to run this on a 1m+ array >> >> Thank you for any suggestions >> >> http://en.wikipedia.org/wiki/Empirical_distribution_function >> https://github.com/jstac/edtc-code/blob/master/python_code/ecdf.py >> https://github.com/dmbates/ecdfExample >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
