Henry, the inputs and outputs are exactly what you had. Your function
is much more clear than mine. Just goes to show that translating from
another language is not often not the best approach for J. It also
helps gain some intuition about what ecdf is doing when it's as clear
as this

ecdf calculates the probability of the number falling within
cumulative range. This can be used to calculate percentiles

We can calculate the cumulative probability of a number in the
empirical data if we know:

1. The interval index (a.k.a. rank) the number falls in (ties have the
same rank)
2. The # items in the list

The probability then becomes rank % # of items

Given a list of numbers 1,3,3,5 what is the probability of the number
being less than or equal to 3? We can easily see that it should be
75%. There are 3 numbers <= 3 out of 4.

The first part of Henry's expression calculates the interval index
(rank order) of each number, by comparing the input vector to a vector
slightly less than the input. This forces the interval to start at 1
instead of 0 and takes the larger number for ties

(I.~ (/:~@:(-&0.00001))) (1,3,3,5)
1 3 3 4

If ties weren't an issue, this would have been equivalent:
(1&+@:I.~  (/:~))~ i.10
1 2 3 4 5 6 7 8 9 10
(I.~ (/:~@:(-&0.00001))) i.10
1 2 3 4 5 6 7 8 9 10

Simply adding one, chooses the smaller of the ranks for ties, which
would not be correct:
(1&+@:I.~  (/:~))~ (1,3,3,5)
1 2 2 4

vs:
(I.~ (/:~@:(-&0.00001))) (1,3,3,5)
1 3 3 4

With the correct rank, the rest of the fork takes the rank and divides
it by the count of items in the list

((I.~ % #@]) (/:~@:(-&0.00001)))~ (1,3,3,5)
0.25 0.75 0.75 1

Nicely done

The alternate way of thinking of this is (which I had done) is

Count the # of items in the list that are less than/equal to the
current number. Take that number and divide it by the total # of
items.

The interval index/rank concept is easier to understand in my opinion



On Wed, Aug 13, 2014 at 4:27 PM, Henry Rich <[email protected]> wrote:
> I don't understand what the input and outputs of ecdf are supposed to be.
>
> But, if x is an empirical dataset and y is an array of values for which you
> want the value of the ecdf, you could use
>
>    ecdf =: ((I.~ % #@]) (/:~@:(-&0.00001)))~
>    dataset ecdf 2 2 2 4 4 6 6 8
>
> 0.375 0.375 0.375 0.625 0.625 0.875 0.875 1
>    dataset ecdf i. 9
> 0 0 0.375 0.375 0.625 0.625 0.875 0.875 1
>
> The 0.00001 bit is only if you really insist on P(X<=x) rather than P(X<x).
>
>
>
> Henry Rich
>
>
> On 8/13/2014 3:06 PM, Joe Bogner wrote:
>>
>> I'm looking for any ideas to speed up this function.
>>
>> I patched together this ecdf function from a few different ideas:
>>
>> NB. > v<- c(2,2,2,4,4,6,6,8)
>> NB. > ecdf(v)(v)
>> NB. [1] 0.375 0.375 0.375 0.625 0.625 0.875 0.875 1.000
>>
>> ecdf=: 3 : 0
>>    valsct=. # y
>>    tbl=:y,.(valsct %~ #) \ y
>>    max=:(0{"1 tbl) (>./)/. tbl
>>    , 1{"1 (({."1 max) i. y) { max
>> )
>>
>> (0.375 0.375 0.375 0.625 0.625 0.875 0.875 1.000) -: ecdf
>> (2,2,2,4,4,6,6,8)
>> 1
>>
>>
>> timespacex 'ecdf i. 1e5'
>> 8.84599 1.15392e7
>>
>> The r function is nearly instantaneous
>>
>> I need to run this on a 1m+ array
>>
>> Thank you for any suggestions
>>
>> http://en.wikipedia.org/wiki/Empirical_distribution_function
>> https://github.com/jstac/edtc-code/blob/master/python_code/ecdf.py
>> https://github.com/dmbates/ecdfExample
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to