So I realized it is the same problem as https://github.com/JuliaLang/julia/issues/1423 Python switched to SipHash for strings: http://legacy.python.org/dev/peps/pep-0456/
On Wednesday, March 5, 2014 6:19:20 AM UTC-8, Pierre-Yves Gérardy wrote: > > On Wednesday, March 5, 2014 4:52:19 AM UTC+1, Keith Campbell wrote: > >> Using Symbols seems to help, ie: >> > > Symbols are interned, making comparison and hashing very fast, but they > are not garbage collected <https://github.com/JuliaLang/julia/issues/5880> > . > > According to your use case, it may be problematic. > > —Pierre-Yves > > >> using DataStructures >> function wordcounter_sym(filename) >> counts = counter(Symbol) >> words=split(readall(filename), Set([' >> ','\n','\r','\t','-','.',',',':','_','"',';','!']),false) >> for w in words >> add!(counts,symbol(w)) >> end >> return counts >> end >> >> On my system this cut the time from 0.67 sec to 0.48 sec, about 30% less. >> Memory use is also quite a bit lower. >> >> On Tuesday, March 4, 2014 8:58:51 PM UTC-5, Roman Sinayev wrote: >>> >>> I updated the gist with times and code snippets >>> https://gist.github.com/lqdc/9342237 >>> >>> On Tuesday, March 4, 2014 5:15:29 PM UTC-8, Steven G. Johnson wrote: >>>> >>>> It's odd that the performance gain that you see is so much less than >>>> the gain on my machine. >>>> >>>> Try putting @time in front of "for w in words" and also in front of >>>> "words=...". That will tell you how much time is being spent in each, >>>> and >>>> whether the limitation is really hashing performance. >>>> >>>> On Tuesday, March 4, 2014 7:55:12 PM UTC-5, Roman Sinayev wrote: >>>>> >>>>> I got to about 0.55 seconds with the above suggestions. Still about 2x >>>>> slower than Python unfortunately. >>>>> The reason I find it necessary for hashing to work quickly is that I >>>>> use it heavily for both NLP and when serving data on a Julia webserver. >>>>> >>>>