I'm currently working on an overhaul <https://github.com/JuliaLang/julia/pull/8964> of byte vectors and strings, which will be followed by an overhaul of I/O (how one typically gets byte vectors). It will take a bit of time but all things string and I/O related should be much more efficient once I'm done. There's a lot of work to do though...
On Thu, Nov 20, 2014 at 12:36 PM, Greg Lee <[email protected]> wrote: > I see from the analysis and comments on issue 8826 > <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FJuliaLang%2Fjulia%2Fissues%2F8826%23issuecomment-62062525&sa=D&sntz=1&usg=AFQjCNEEUiw7K2MbV7zclPvU4GwVyQE9Tg>that > I am not to first to stumble on this performance gap and that improvements > are coming. I understand that the initial emphasis for julia has been on > numerical calculations, but it would be nice to Have It All and be able to > use julia confidently in problems that mix numerics and string manipulation. > > > On Thursday, November 20, 2014 2:11:57 AM UTC-8, Milan Bouchet-Valat wrote: >> >> Le mercredi 19 novembre 2014 à 17:05 -0800, Greg Lee a écrit : >> > Is there a faster way to do the following, which builds a dictionary >> > of unique tokens and counts? >> > function unigrams(fn::String) >> > grams = Dict{String,Int32}() >> > f = open(fn) >> > for line in eachline(f) >> > for t in split(line) >> > i = get(grams,t,0) >> > grams[t] = i+1 >> > end >> > end >> > close(f) >> > return grams >> > end >> > >> > >> > On a file with 1.9M unique tokens, this is 8x slower than Python >> > written in the same style. The big hit comes from string keys; using >> > int keys is closer to Python's performance. Timings: Julia 1083s, >> > Python 126s, c++ 80s. >> At least you can avoid doing the dict lookup twice (once to get the >> value, once to set it) for values that are alreadu present in the >> dictionary, by calling the unexported function Base.ht_keyindex(), like >> this: >> for t in split(line) >> index = Base.ht_keyindex(grams, t) >> if index > 0 >> grams.vals[index] += 1 >> else >> grams[t] = 1 >> end >> end >> >> (BTW, this is a trick I'm using to compute frequency tables in >> Tables.jl, you may also want yo use that package directly.) >> >> Of course this will break if the internal structure of dictionaries >> changes. Work is going on to allow performing this kind of optimization >> in a reliable way: >> https://github.com/JuliaLang/julia/issues/8826#issuecomment-62062525 >> >> >> Regards >> >
