Hi, I'm the author of the repository and this whole thing started as a pet project and is still purely for fun. Here is a blog post about it if you're interested: http://juditacs.github.io/2015/11/26/wordcount.html
Although the results are not yet included in the repository's leaderboard, I do run tests on even bigger dataset, namely the full Hungarian Wikipedia which is 65 million lines, but I only include the fastest languages (C++, go, Python2 and Java right now). I'm not sure that the current julia script would fit into 16GB of memory, based on the fact the Python2 version uses 8.7GB when run on the full huwiki. I merged getzdan's pull request and the new wordcount.jl is twice as fast as the previous one (340 s -> 175 s). Thank you very much for the improved solution. Best, Judit On Monday, November 30, 2015 at 5:54:59 PM UTC+1, Attila Zséder wrote: > > > Hi, > > following your suggestions: > > - FastAnonymous without string types made my code slower > - UTF8String for string types made the code also slower > - using both of them resulted in the significant 30% increase > - using Dan's customlt instead of my (FastAnonymous-ized) lambda function > resulted in another 10% improvement > - using Tim's docount2 did fasten up the reading, but only about 10%, > compared to UTF8String typed reading of docount1 > > If you think, I can open an issue for this. > > > Attila > > On Mon, Nov 30, 2015 at 4:38 PM, Dan <get...@gmail.com <javascript:>> > wrote: > >> My suggestions would be to replace >> for t in sort(collect(wc), by=x -> (-x.second, x.first)) >> println(t.first, "\t", t.second) >> end >> >> with >> customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ? >> a.first >> < b.first : false >> >> function main() >> : >> : >> for t in sort(collect(wc), lt=customlt) >> println(t.first, "\t", t.second) >> end >> end >> >> >> >> On Monday, November 30, 2015 at 5:08:56 PM UTC+2, Dan wrote: >>> >>> Can you provide the comparable python code? Perhaps even the data used >>> for testing? >>> >>> Since you are evaluating Julia, there are two important points to >>> remember: >>> 1) In Julia because the language is fast enough to implement basic >>> functionality in Julia, then the distinction between Base Julia and >>> additional packages is small. Opting to use 'just' the core makes less >>> sense - the core is just a pre-compiled package. >>> 2) The community is part of the language, so it should be regarded when >>> making considerations. >>> >>> On Monday, November 30, 2015 at 4:21:51 PM UTC+2, Attila Zséder wrote: >>>> >>>> Hi, >>>> >>>> Thank you all for the responses. >>>> >>>> 1. I tried simple profiling, but its output was difficult me to >>>> interpret, maybe if i put more time in it. I will try ProfileView later. >>>> 2. FastAnonymous gave me a serious speedup (20-30%). (But since it is >>>> an external dependency, it's kind of cheating, seeing the purpose of this >>>> small word count test) >>>> 3. Using ASCIIString is not a good option right now, since there are >>>> unicode characters there. I am trying with both UTF8String and >>>> AbstractString, I don't see any difference in performance right now. >>>> 4. Using ht_keyindex() is out of scope for me right now, because this >>>> is a pet project, I just wanted to see how fast current implementation is, >>>> without these kind of solutions. >>>> >>>> I think I will keep trying with later versions of julia, but with >>>> sticking to the standard library only, without using any external packages. >>>> >>>> Attila >>>> >>>> 2015. november 29., vasárnap 17:59:42 UTC+1 időpontban Yichao Yu a >>>> következőt írta: >>>>> >>>>> On Sun, Nov 29, 2015 at 11:42 AM, Milan Bouchet-Valat <nali...@club.fr> >>>>> wrote: >>>>> > Le dimanche 29 novembre 2015 à 08:28 -0800, Cedric St-Jean a écrit : >>>>> >> What I would try: >>>>> >> >>>>> >> 1. ProfileView to pinpoint the bottleneck further >>>>> >> 2. FastAnonymous to fix the lambda >>>>> >> 3. >>>>> http://julia-demo.readthedocs.org/en/latest/manual/performance-tip >>>>> >> s.html In particular, you may check `code_typed`. I don't have >>>>> >> experience with `split` and `eachline`. It's possible that they are >>>>> >> not type stable (the compiler can't predict their output's type). I >>>>> >> would try `for w::ASCIIString in ...` >>>>> >> 4. Dict{ASCIIString, Int}() >>>>> >> 5. Your loop will hash each string twice. I don't know how to fix >>>>> >> that, anyone? >>>>> > You can use the unexported Base.ht_keyindex() function like this: >>>>> > >>>>> https://github.com/nalimilan/FreqTables.jl/blob/7884c000e6797d7ec621e07 >>>>> > b8da58e7939e39867/src/freqtable.jl#L36 >>>>> > >>>>> > But this is at your own risk, as it may change without warning in a >>>>> > future Julia release. >>>>> > >>>>> > We really need a public API for it. >>>>> >>>>> IIUC, https://github.com/JuliaLang/julia/issues/12157 >>>>> >>>>> > >>>>> > >>>>> > Regards >>>>> > >>>>> >> >>>>> >> Good luck, >>>>> >> >>>>> >> Cédric >>>>> >> >>>>> >> On Saturday, November 28, 2015 at 8:08:49 PM UTC-5, Lampkld wrote: >>>>> >> > Maybe it's the lambda? These are slow in julia right now. >>>>> >>>> >