El jueves, 4 de septiembre de 2014 23:42:20 UTC-5, paul analyst escribió:
>
> julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in 
> unique(s)]);
>
> julia> s=rand(10^3);
>
> julia> @time entropy(s)
> elapsed time: 0.167097546 seconds (20255140 bytes allocated)
> 9.965784284662059
>
> julia> s=rand(10^4);
>
> julia> @time entropy(s)
> elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc 
> time)
> 13.287712379549843
>
> julia> s=rand(10^5);
>
> julia> @time entropy(s)
> elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89% 
> gc time)
> 16.609640474434073
>
>
You can see from these last two that the time is multiplied by 100 when the 
length of the vector is multiplied by 10, i.e. your method has O(n^2) 
complexity. This is due to the way that you are counting repeats. What you 
are basically doing is a histogram. 

If your data are really floats, then in any case some binning is required. 
If they are ints, you could use a dictionary. I think there's even a 
counting object already implemented (but I don't remember what it's called).

How about this:

function entropy(s)
           N = length(s)
           num_bins = 10000
           h = hist(s, num_bins)
           p = h[2] ./ N  # probabilities
           -sum([x * log(2,x) for x in p])
end

julia> @time entropy(rand(10^6))
elapsed time: 0.199634039 seconds (79424624 bytes allocated, 31.51% gc time)
13.28044771568381

julia> @time entropy(rand(10^7))
elapsed time: 1.710673571 seconds (792084208 bytes allocated, 26.20% gc 
time)
13.286992511965552

julia> @time entropy(rand(10^8))
elapsed time: 18.20088571 seconds (7918627344 bytes allocated, 24.03% gc 
time)
13.28764216804997

The calculation is now O(n) instead.

 

> julia> s=rand(10^6);
>
> julia> @time entropy(s)
> ................................
> After 12 h not yet counted :/
>
> Paul
>

Reply via email to