Interesting. I've been inclined to randomize our hashing for quite some
time now and have been keeping that option in mind during my work on
numeric hashing. I know Marek, who wrote
csiphash<https://github.com/majek/csiphash/> which
is apparently the standard C implementation of SipHash – he was at Hacker
School at the same time as Leah, Daniel and Chuck, when they originally
wrote the Julia web stack.


On Mon, Mar 10, 2014 at 6:19 AM, Roman Sinayev <lqd...@gmail.com> wrote:

> So I realized it is the same problem as
> https://github.com/JuliaLang/julia/issues/1423
> Python switched to SipHash for strings:
> http://legacy.python.org/dev/peps/pep-0456/
>
> On Wednesday, March 5, 2014 6:19:20 AM UTC-8, Pierre-Yves Gérardy wrote:
>>
>> On Wednesday, March 5, 2014 4:52:19 AM UTC+1, Keith Campbell wrote:
>>
>>> Using Symbols seems to help, ie:
>>>
>>
>> Symbols are interned, making comparison and hashing very fast, but they
>> are not garbage collected<https://github.com/JuliaLang/julia/issues/5880>
>> .
>>
>> According to your use case, it may be problematic.
>>
>> —Pierre-Yves
>>
>>
>>> using DataStructures
>>> function wordcounter_sym(filename)
>>>     counts = counter(Symbol)
>>>     words=split(readall(filename), Set([' ','\n','\r','\t','-','.',',','
>>> :','_','"',';','!']),false)
>>>      for w in words
>>>         add!(counts,symbol(w))
>>>      end
>>>     return counts
>>> end
>>>
>>> On my system this cut the time from 0.67 sec to 0.48 sec, about 30% less.
>>> Memory use is also quite a bit lower.
>>>
>>> On Tuesday, March 4, 2014 8:58:51 PM UTC-5, Roman Sinayev wrote:
>>>>
>>>> I updated the gist with times and code snippets
>>>> https://gist.github.com/lqdc/9342237
>>>>
>>>> On Tuesday, March 4, 2014 5:15:29 PM UTC-8, Steven G. Johnson wrote:
>>>>>
>>>>> It's odd that the performance gain that you see is so much less than
>>>>> the gain on my machine.
>>>>>
>>>>> Try putting @time in front of "for w in words" and also in front of
>>>>> "words=...".   That will tell you how much time is being spent in each, 
>>>>> and
>>>>> whether the limitation is really hashing performance.
>>>>>
>>>>> On Tuesday, March 4, 2014 7:55:12 PM UTC-5, Roman Sinayev wrote:
>>>>>>
>>>>>> I got to about 0.55 seconds with the above suggestions. Still about
>>>>>> 2x slower than Python unfortunately.
>>>>>> The reason I find it necessary for hashing to work quickly is that I
>>>>>> use it heavily for both NLP and when serving data on a Julia webserver.
>>>>>>
>>>>>

Reply via email to