Interesting. I've been inclined to randomize our hashing for quite some time now and have been keeping that option in mind during my work on numeric hashing. I know Marek, who wrote csiphash<https://github.com/majek/csiphash/> which is apparently the standard C implementation of SipHash – he was at Hacker School at the same time as Leah, Daniel and Chuck, when they originally wrote the Julia web stack.
On Mon, Mar 10, 2014 at 6:19 AM, Roman Sinayev <lqd...@gmail.com> wrote: > So I realized it is the same problem as > https://github.com/JuliaLang/julia/issues/1423 > Python switched to SipHash for strings: > http://legacy.python.org/dev/peps/pep-0456/ > > On Wednesday, March 5, 2014 6:19:20 AM UTC-8, Pierre-Yves Gérardy wrote: >> >> On Wednesday, March 5, 2014 4:52:19 AM UTC+1, Keith Campbell wrote: >> >>> Using Symbols seems to help, ie: >>> >> >> Symbols are interned, making comparison and hashing very fast, but they >> are not garbage collected<https://github.com/JuliaLang/julia/issues/5880> >> . >> >> According to your use case, it may be problematic. >> >> —Pierre-Yves >> >> >>> using DataStructures >>> function wordcounter_sym(filename) >>> counts = counter(Symbol) >>> words=split(readall(filename), Set([' ','\n','\r','\t','-','.',',',' >>> :','_','"',';','!']),false) >>> for w in words >>> add!(counts,symbol(w)) >>> end >>> return counts >>> end >>> >>> On my system this cut the time from 0.67 sec to 0.48 sec, about 30% less. >>> Memory use is also quite a bit lower. >>> >>> On Tuesday, March 4, 2014 8:58:51 PM UTC-5, Roman Sinayev wrote: >>>> >>>> I updated the gist with times and code snippets >>>> https://gist.github.com/lqdc/9342237 >>>> >>>> On Tuesday, March 4, 2014 5:15:29 PM UTC-8, Steven G. Johnson wrote: >>>>> >>>>> It's odd that the performance gain that you see is so much less than >>>>> the gain on my machine. >>>>> >>>>> Try putting @time in front of "for w in words" and also in front of >>>>> "words=...". That will tell you how much time is being spent in each, >>>>> and >>>>> whether the limitation is really hashing performance. >>>>> >>>>> On Tuesday, March 4, 2014 7:55:12 PM UTC-5, Roman Sinayev wrote: >>>>>> >>>>>> I got to about 0.55 seconds with the above suggestions. Still about >>>>>> 2x slower than Python unfortunately. >>>>>> The reason I find it necessary for hashing to work quickly is that I >>>>>> use it heavily for both NLP and when serving data on a Julia webserver. >>>>>> >>>>>