So I realized it is the same problem as 
https://github.com/JuliaLang/julia/issues/1423
Python switched to SipHash for 
strings: http://legacy.python.org/dev/peps/pep-0456/

On Wednesday, March 5, 2014 6:19:20 AM UTC-8, Pierre-Yves Gérardy wrote:
>
> On Wednesday, March 5, 2014 4:52:19 AM UTC+1, Keith Campbell wrote:
>
>> Using Symbols seems to help, ie: 
>>
>
> Symbols are interned, making comparison and hashing very fast, but they 
> are not garbage collected <https://github.com/JuliaLang/julia/issues/5880>
> .
>
> According to your use case, it may be problematic.
>
> —Pierre-Yves
>   
>
>> using DataStructures
>> function wordcounter_sym(filename)
>>     counts = counter(Symbol)
>>     words=split(readall(filename), Set([' 
>> ','\n','\r','\t','-','.',',',':','_','"',';','!']),false)
>>      for w in words
>>         add!(counts,symbol(w))
>>      end          
>>     return counts
>> end
>>
>> On my system this cut the time from 0.67 sec to 0.48 sec, about 30% less.
>> Memory use is also quite a bit lower.
>>
>> On Tuesday, March 4, 2014 8:58:51 PM UTC-5, Roman Sinayev wrote:
>>>
>>> I updated the gist with times and code snippets
>>> https://gist.github.com/lqdc/9342237
>>>
>>> On Tuesday, March 4, 2014 5:15:29 PM UTC-8, Steven G. Johnson wrote:
>>>>
>>>> It's odd that the performance gain that you see is so much less than 
>>>> the gain on my machine.
>>>>
>>>> Try putting @time in front of "for w in words" and also in front of 
>>>> "words=...".   That will tell you how much time is being spent in each, 
>>>> and 
>>>> whether the limitation is really hashing performance.
>>>>
>>>> On Tuesday, March 4, 2014 7:55:12 PM UTC-5, Roman Sinayev wrote:
>>>>>
>>>>> I got to about 0.55 seconds with the above suggestions. Still about 2x 
>>>>> slower than Python unfortunately.
>>>>> The reason I find it necessary for hashing to work quickly is that I 
>>>>> use it heavily for both NLP and when serving data on a Julia webserver.
>>>>>
>>>>

Reply via email to