You get NaN because the binning has left some bin empty and x * log(2, x)
for x = 0.0 results in 0.0 * -Inf which is NaN. You can avoid this
particular problem by replacing x * log(2,x) with something that special
cases x = 0.0 to return 0.0 rather than NaN.
Note that the result you get from this operation will depend on the number
of bins and will typically be different from the result of your first
approach.
Den fredagen den 5:e september 2014 kl. 09:16:18 UTC+2 skrev paul analyst:
>
> THX, work but:
> julia> t=h5read("FMLo2_z_reversem.h5","FMLo2",(:,1));
>
> julia> eltype(t)
> Float64
>
> julia> t
> 5932868x1 Array{Float64,2}:
> 0.0181719
> 0.303473
> -0.526979
> ?
> -0.526979
> 0.912295
> -0.0281875
>
> julia> entropy(t)
> NaN
>
> julia> entropy(vec(t))
> NaN
> Why ?
> julia> s
>
> 1000000-element Array{Float64,1}:
> 0.737228
> 0.162308
> 0.688503
> 0.108727
> 0.40552
> 0.654883
> 0.618194
> 0.137147
> 0.934373
> 0.077236
> ?
> 0.413675
> 0.463914
> 0.504321
> 0.976408
> 0.998662
> 0.343927
> 0.477739
> 0.660533
> 0.918326
> 0.579264
>
> julia> entropy(s)
> 13.280438296634793
>
>
> julia>
>
> julia> entropy(t[:,1])
> NaN
>
> ???
> Paul
>
>
> W dniu 2014-09-05 07:34, David P. Sanders pisze:
>
>
>
> El jueves, 4 de septiembre de 2014 23:42:20 UTC-5, paul analyst escribió:
>>
>> julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c
>> in unique(s)]);
>>
>> julia> s=rand(10^3);
>>
>> julia> @time entropy(s)
>> elapsed time: 0.167097546 seconds (20255140 bytes allocated)
>> 9.965784284662059
>>
>> julia> s=rand(10^4);
>>
>> julia> @time entropy(s)
>> elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc
>> time)
>> 13.287712379549843
>>
>> julia> s=rand(10^5);
>>
>> julia> @time entropy(s)
>> elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89%
>> gc time)
>> 16.609640474434073
>>
>>
> You can see from these last two that the time is multiplied by 100 when
> the length of the vector is multiplied by 10, i.e. your method has O(n^2)
> complexity. This is due to the way that you are counting repeats. What you
> are basically doing is a histogram.
>
> If your data are really floats, then in any case some binning is
> required. If they are ints, you could use a dictionary. I think there's
> even a counting object already implemented (but I don't remember what it's
> called).
>
> How about this:
>
> function entropy(s)
> N = length(s)
> num_bins = 10000
> h = hist(s, num_bins)
> p = h[2] ./ N # probabilities
> -sum([x * log(2,x) for x in p])
> end
>
> julia> @time entropy(rand(10^6))
> elapsed time: 0.199634039 seconds (79424624 bytes allocated, 31.51% gc
> time)
> 13.28044771568381
>
> julia> @time entropy(rand(10^7))
> elapsed time: 1.710673571 seconds (792084208 bytes allocated, 26.20% gc
> time)
> 13.286992511965552
>
> julia> @time entropy(rand(10^8))
> elapsed time: 18.20088571 seconds (7918627344 bytes allocated, 24.03% gc
> time)
> 13.28764216804997
>
> The calculation is now O(n) instead.
>
>
>
>> julia> s=rand(10^6);
>>
>> julia> @time entropy(s)
>> ................................
>> After 12 h not yet counted :/
>>
>> Paul
>>
>
>