For me, docount2 is about 1.7x faster than docount1 (on my old laptop, 28s vs 
47s) for your "Hungarian Wikipedia" test dataset. We might want to implement 
some of these tweaks in the standard library.

--Tim

import Base: start, next, done, eltype, readuntil

function docount1(io)
    wc = Dict{AbstractString,Int64}()
    for l in eachline(io)
        for w in split(l)
            wc[w]=get(wc, w, 0) + 1
        end
    end
    wc
end

type EachLn{T}
    stream::IO
end

start(itr::EachLn) = nothing
function done(itr::EachLn, nada)
    if !eof(itr.stream)
        return false
    end
    true
end
next{T}(itr::EachLn{T}, nada) = (readuntil(T, itr.stream, '\n'), nothing)
eltype{T}(::Type{EachLn{T}}) = T

function readuntil{T}(::Type{T}, s::IO, delim::Char)
    if delim < Char(0x80)
        data = readuntil(s, delim%UInt8)
        return T(data)
    end
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        write(out, c)
        if c == delim
            break
        end
    end
    T(takebuf_array(out))
end


function docount2(io)
    wc = Dict{SubString{UTF8String},Int64}()
    for l in EachLn{UTF8String}(io)
        for w in split(l)
            wc[w]=get(wc, w, 0) + 1
        end
    end
    wc
end



On Monday, November 30, 2015 08:29:28 AM Dan wrote:
> and replacing
>         println("$(t.first)\t$(t.second)")
> 
> with
>         @printf("%s\t%d\n",t.first,t.second)
> 
> also halves the print time (which might or might not make a big difference
> but definitely > 1 second)
> 
> On Monday, November 30, 2015 at 6:20:54 PM UTC+2, Dan wrote:
> > Using the `customlt` function for the sort order cut the time in half on a
> > test file. So try:
> > 
> > customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ?
> > a.first < b.first : false
> > 
> > function main()
> > 
> >     wc = Dict{UTF8String,Int64}()
> >     for l in eachline(STDIN)
> >     
> >         for w in split(l)
> >         
> >             wc[w]=get(wc, w, 0) + 1
> >         
> >         end
> >     
> >     end
> >     
> >     v = collect(wc)
> >     sort!(v,lt=customlt) # in-place sort saves a memory copy
> >     for t in v
> >     
> >         println("$(t.first)\t$(t.second)")
> >     
> >     end
> > 
> > end
> > 
> > main()
> > 
> > On Monday, November 30, 2015 at 5:31:20 PM UTC+2, Attila Zséder wrote:
> >> Hi,
> >> 
> >> 
> >> The data I'm using is part of (Hungarian) Wikipedia dump with 5M lines of
> >> text. On this data, python runs for 65 seconds, cpp for 35 seconds, julia
> >> baseline for 340 seconds, julia with FastAnonymous.jl for 280 seconds.
> >> (See
> >> https://github.com/juditacs/wordcount#leaderboard for details)
> >> 
> >> Dan:
> >> I can use external packages, it's not a big issue. However, FastAnonymous
> >> didn't give results comparable to python.
> >> The baseline python code I compare to is here:
> >> https://github.com/juditacs/wordcount/blob/master/python/wordcount_py2.py
> >> 
> >> 2) The community is part of the language, so it should be regarded when
> >> making considerations.
> >> What do you mean by this?
> >> My (our) purpose is not to evaluate this language or that one is
> >> better/faster/etc because it is faster in unoptimized word counting. So I
> >> don't want to make any judgements, considerations and anything like this.
> >> This is just for fun. And even though it looks like _my_ julia
> >> implementation of wc is not fast right now, I didn't lose interest in
> >> following what's going on with this language.
> >> 
> >> 
> >> Your other points:
> >> 1) I do this with all the other languages as well. The test runs for
> >> about 30-300 seconds. If Julia load time or any other thing takes serious
> >> amount of time, then it does. This test is not precise, I didn't include
> >> c++ compile time for example, but it took less than a second. But I felt
> >> like my implementation is dummy, and other things take my time, not Julia
> >> load.
> >> 2) What if my test is about IO + dictionary storage? Then I have to
> >> include printouts into my test.
> >> 3) I think 5m lines of text file is enough to avoid this noises.
> >> 
> >> 
> >> 
> >> Tim:
> >> Yes, I did this code split, and with larger files it looked like after
> >> sorting, dictionary manipulation (including hashes) took most of the
> >> time,
> >> and printing was less of an issue. But I do have to analyze this more
> >> precisely, seeing your numbers.
> >> 
> >> 
> >> Thank you all for your help!
> >> 
> >> Attila
> >> 
> >> On Mon, Nov 30, 2015 at 4:20 PM, Tim Holy <tim....@gmail.com> wrote:
> >>> If you don't want to figure out how to use the profiler, your next best
> >>> bet is
> >>> to split out the pieces so you can understand where the bottleneck is.
> >>> For
> >>> example:
> >>> 
> >>> function docount(io)
> >>> 
> >>>     wc = Dict{AbstractString,Int64}()
> >>>     for l in eachline(io)
> >>>     
> >>>         for w in split(l)
> >>>         
> >>>             wc[w]=get(wc, w, 0) + 1
> >>>         
> >>>         end
> >>>     
> >>>     end
> >>>     wc
> >>> 
> >>> end
> >>> 
> >>> @time open("somefile.tex") do io
> >>> 
> >>>            docount(io)
> >>>        
> >>>        end;
> >>>   
> >>>   0.010617 seconds (27.70 k allocations: 1.459 MB)
> >>> 
> >>> vs
> >>> 
> >>> @time open("somefile.tex") do io
> >>> 
> >>>            main(io)
> >>>        
> >>>        end;
> >>> 
> >>> # < lots of printed output >
> >>> 
> >>>   1.233154 seconds (330.59 k allocations: 10.829 MB, 1.53% gc time)
> >>> 
> >>> (I modified your `main` to take an io input.)
> >>> 
> >>> So it's the sorting and printing which is taking 99% of the time. Most
> >>> of that
> >>> turns out to be the printing.
> >>> 
> >>> --Tim

Reply via email to