Le mercredi 19 novembre 2014 à 17:05 -0800, Greg Lee a écrit :
> Is there a faster way to do the following, which builds a dictionary
> of unique tokens and counts?
>         function unigrams(fn::String)
>             grams = Dict{String,Int32}()
>             f = open(fn)
>             for line in eachline(f)
>                 for t in split(line)
>                     i = get(grams,t,0)
>                     grams[t] = i+1
>                 end
>             end
>             close(f)
>             return grams 
>         end
> 
> 
> On a file with 1.9M unique tokens, this is 8x slower than Python
> written in the same style.  The big hit comes from string keys; using
> int keys is closer to Python's performance.  Timings:  Julia 1083s,
> Python 126s, c++ 80s. 
At least you can avoid doing the dict lookup twice (once to get the
value, once to set it) for values that are alreadu present in the
dictionary, by calling the unexported function Base.ht_keyindex(), like
this:
        for t in split(line)
            index = Base.ht_keyindex(grams, t)
            if index > 0
                grams.vals[index] += 1
            else
                grams[t] = 1
            end
        end

(BTW, this is a trick I'm using to compute frequency tables in
Tables.jl, you may also want yo use that package directly.)

Of course this will break if the internal structure of dictionaries
changes. Work is going on to allow performing this kind of optimization
in a reliable way:
https://github.com/JuliaLang/julia/issues/8826#issuecomment-62062525


Regards

Reply via email to