El jueves, 4 de septiembre de 2014 23:42:20 UTC-5, paul analyst escribió:
>
> julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in
> unique(s)]);
>
> julia> s=rand(10^3);
>
> julia> @time entropy(s)
> elapsed time: 0.167097546 seconds (20255140 bytes allocated)
> 9.965784284662059
>
> julia> s=rand(10^4);
>
> julia> @time entropy(s)
> elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc
> time)
> 13.287712379549843
>
> julia> s=rand(10^5);
>
> julia> @time entropy(s)
> elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89%
> gc time)
> 16.609640474434073
>
>
You can see from these last two that the time is multiplied by 100 when the
length of the vector is multiplied by 10, i.e. your method has O(n^2)
complexity. This is due to the way that you are counting repeats. What you
are basically doing is a histogram.
If your data are really floats, then in any case some binning is required.
If they are ints, you could use a dictionary. I think there's even a
counting object already implemented (but I don't remember what it's called).
How about this:
function entropy(s)
N = length(s)
num_bins = 10000
h = hist(s, num_bins)
p = h[2] ./ N # probabilities
-sum([x * log(2,x) for x in p])
end
julia> @time entropy(rand(10^6))
elapsed time: 0.199634039 seconds (79424624 bytes allocated, 31.51% gc time)
13.28044771568381
julia> @time entropy(rand(10^7))
elapsed time: 1.710673571 seconds (792084208 bytes allocated, 26.20% gc
time)
13.286992511965552
julia> @time entropy(rand(10^8))
elapsed time: 18.20088571 seconds (7918627344 bytes allocated, 24.03% gc
time)
13.28764216804997
The calculation is now O(n) instead.
> julia> s=rand(10^6);
>
> julia> @time entropy(s)
> ................................
> After 12 h not yet counted :/
>
> Paul
>