Re: [julia-users] Dict performance with String keys

Stefan Karpinski Thu, 20 Nov 2014 10:42:18 -0800

I'm currently working on an overhaul
<https://github.com/JuliaLang/julia/pull/8964> of byte vectors and strings,
which will be followed by an overhaul of I/O (how one typically gets byte
vectors). It will take a bit of time but all things string and I/O related
should be much more efficient once I'm done. There's a lot of work to do
though...


On Thu, Nov 20, 2014 at 12:36 PM, Greg Lee <[email protected]> wrote:

> I see from the analysis and comments on issue 8826
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FJuliaLang%2Fjulia%2Fissues%2F8826%23issuecomment-62062525&sa=D&sntz=1&usg=AFQjCNEEUiw7K2MbV7zclPvU4GwVyQE9Tg>that
> I am not to first to stumble on this performance gap and that improvements
> are coming.  I understand that the initial emphasis for julia has been on
> numerical calculations, but it would be nice to Have It All and be able to
> use julia confidently in problems that mix numerics and string manipulation.
>
>
> On Thursday, November 20, 2014 2:11:57 AM UTC-8, Milan Bouchet-Valat wrote:
>>
>> Le mercredi 19 novembre 2014 à 17:05 -0800, Greg Lee a écrit :
>> > Is there a faster way to do the following, which builds a dictionary
>> > of unique tokens and counts?
>> >         function unigrams(fn::String)
>> >             grams = Dict{String,Int32}()
>> >             f = open(fn)
>> >             for line in eachline(f)
>> >                 for t in split(line)
>> >                     i = get(grams,t,0)
>> >                     grams[t] = i+1
>> >                 end
>> >             end
>> >             close(f)
>> >             return grams
>> >         end
>> >
>> >
>> > On a file with 1.9M unique tokens, this is 8x slower than Python
>> > written in the same style.  The big hit comes from string keys; using
>> > int keys is closer to Python's performance.  Timings:  Julia 1083s,
>> > Python 126s, c++ 80s.
>> At least you can avoid doing the dict lookup twice (once to get the
>> value, once to set it) for values that are alreadu present in the
>> dictionary, by calling the unexported function Base.ht_keyindex(), like
>> this:
>>         for t in split(line)
>>             index = Base.ht_keyindex(grams, t)
>>             if index > 0
>>                 grams.vals[index] += 1
>>             else
>>                 grams[t] = 1
>>             end
>>         end
>>
>> (BTW, this is a trick I'm using to compute frequency tables in
>> Tables.jl, you may also want yo use that package directly.)
>>
>> Of course this will break if the internal structure of dictionaries
>> changes. Work is going on to allow performing this kind of optimization
>> in a reliable way:
>> https://github.com/JuliaLang/julia/issues/8826#issuecomment-62062525
>>
>>
>> Regards
>>
>

Re: [julia-users] Dict performance with String keys

Reply via email to