Re: [julia-users] Slow reading of file

Jacob Quinn Sat, 14 May 2016 07:46:15 -0700

I'm actually one who is trying to increase overall things like integer
parsing. Currently in the CSV.jl package, we have some beta-mode fast
parsing functions. I generated a file with:


open("test_integers","w") do f
    for i = 1:7_068_650
        println(f, rand(Int16))
    end
end

And reading it with the CSV.jl package gives me:

julia> @time csv = CSV.csv("test_integers";types=[Int])
  1.156055 seconds (7.07 M allocations: 168.536 MB, 6.68% gc time)
Data.Table:
7068649x1 Data.Schema:
 19591
 Int64

Column Data:
[-2,-23575,-6091,-1421,-4229,27266,-15925,20891,19254,4630  …
 25060,-20681,2218,-16672,14473,-14427,4868,14841,7874,6445]


On Sat, May 14, 2016 at 7:22 AM, Yichao Yu <[email protected]> wrote:

> On Sat, May 14, 2016 at 8:31 AM, Ford Ox <[email protected]> wrote:
> > type Tokenizer
> >     tokens::Array{ASCIIString, 1}
> >     index::Int
> >     Tokenizer(s::ASCIIString) = new(split(strip(s)), 0)
> > end
> >
> > Julia still runs 11 seconds...
>
> The main cost is not coming from the dynamic dispatch,but the
> allocation of strings and arrays
>
> The `new(split(strip(s)), 0)` above allocates a new string (`strip`)
> and allocates an array of `SubString`'s (`split`) and then converting
> it to (and therefore allocate) an array of `ASCIIString`. I believe
> the java tokenizer is likely much more efficient than this.
>
> Changing it to sth like
>
> ```
> using Compat
>
> ##    Tokenizer ##
>
> type Tokenizer
>     string::Compat.ASCIIString
>     index::Int
>     len::Int
>     function Tokenizer(s::Compat.ASCIIString)
>         i = 1
>         len = length(s)
>         while i <= len && isspace(s[i])
>             i += 1
>         end
>         new(s, i - 1, len)
>     end
> end
>
> isempty(t::Tokenizer) = t.len == t.index
>
> function next!(t::Tokenizer)
>     i = j = t.index + 1
>     len = t.len
>     s = t.string
>     while i <= len && !isspace(s[i])
>         i += 1
>     end
>     subs = SubString(s, j, i - 1)
>     while i <= len && isspace(s[i])
>         i += 1
>     end
>     t.index = i - 1
>     subs
> end
> ```
>
> reduces the time from 17s to 3s for me
>
> changing the `i += 1` to the more general version `i = nextind(s, i)`
> increases the runtime to ~4s and I think improving this is one of the
> reason Stefan is working on the String stuff.
>
> The integer parsing also need some work, removing it reduces the
> runtime to 1.7s and I believe at least @simonbyrne (and maybe many
> others) are working on that.
>
>
> >
> > Dne sobota 14. května 2016 14:08:48 UTC+2 Milan Bouchet-Valat napsal(a):
> >>
> >> Le samedi 14 mai 2016 à 05:01 -0700, Ford Ox a écrit :
> >> > Fixed. Julia now takes 11 seconds to finish
> >> > type Tokenizer
> >> >     tokens::Array{AbstractString, 1}
> >> >     index::Int
> >> >     Tokenizer(s::AbstractString) = new(split(strip(s)), 0)
> >> > end
> >> >
> >> > type Buffer
> >> >     stream::IOStream
> >> >     tokenizer::Tokenizer
> >> >     Buffer(stream) = new(stream, Tokenizer(""))
> >> > end
> >> AbstractString is still not a concrete type. Use
> >> UTF8String/ASCIIString, or do this instead:
> >>
> >> type Tokenizer{T<:AbstractString}
> >>      tokens::Array{T, 1}
> >>      index::Int
> >>      Tokenizer(s::AbstractString) = new(split(strip(s)), 0)
> >> end
> >>
> >> type Buffer{T<:AbstractString}
> >>     stream::IOStream
> >>     tokenizer::Tokenizer{T}
> >>     Buffer(stream) = new(stream, Tokenizer(""))
> >> end
> >>
> >> (Note that "" will create an ASCIIString, use UTF8String("") if you need
> >> to support non-ASCII chars.)
> >>
> >>
> >> Regards
> >>
> >> >
> >> >
> >> > > Your types have totally untyped fields – the compiler has to emit
> >> > > very pessimistic code about this. Rule of thumb: locations (fields,
> >> > > collections) should be as concretely typed as possible; parameters
> >> > > don't need to be.
> >> > >
> >> > > On Sat, May 14, 2016 at 1:36 PM, Ford Ox <[email protected]> wrote:
> >> > > > I have written exact same code in java and julia for reading
> >> > > > integers from file.
> >> > > > Julia code was A LOT slower. (12 seconds vs 1.16 seconds)
> >> > > >
> >> > > > import Base.isempty, Base.close
> >> > > >
> >> > > > ##    Tokenizer ##
> >> > > >
> >> > > > type Tokenizer
> >> > > >     tokens
> >> > > >     index
> >> > > >     Tokenizer(s::AbstractString) = new(split(strip(s)), 0)
> >> > > > end
> >> > > >
> >> > > > isempty(t::Tokenizer) = length(t.tokens) == t.index
> >> > > >
> >> > > > function next!(t::Tokenizer)
> >> > > >     t.index += 1
> >> > > >     t.tokens[t.index]
> >> > > > end
> >> > > >
> >> > > > ## Buffer ##
> >> > > >
> >> > > > type Buffer
> >> > > >     stream
> >> > > >     tokenizer
> >> > > >     Buffer(stream) = new(stream, [])
> >> > > > end
> >> > > >
> >> > > > function next!(b::Buffer)
> >> > > >     if isempty(b.tokenizer)
> >> > > >         b.tokenizer = Tokenizer(readline(b.stream))
> >> > > >     end
> >> > > >     next!(b.tokenizer)
> >> > > > end
> >> > > >
> >> > > > close!(b::Buffer) = close(b.stream)
> >> > > > nexttype!(t, b::Buffer) = parse(t, next!(b))
> >> > > > nextint!(b::Buffer) = nexttype!(Int, b)
> >> > > >
> >> > > > cd("pathToMyFile")
> >> > > > b = Buffer(open("File"))
> >> > > >
> >> > > > function readall!(b::Buffer)
> >> > > >     for _ in 1:nextint!(b)
> >> > > >         nextint!(b)
> >> > > >     end
> >> > > >     close!(b)
> >> > > > end
> >> > > >
> >> > > > @time readall!(b)
> >> > > >
> >> > > >
> >> > > > > 12.314114 seconds (84.84 M allocations: 3.793 GB, 11.47% gc
> >> > > > > time)
> >> > > > package alg;
> >> > > >
> >> > > > import java.io.*;
> >> > > > import java.util.StringTokenizer;
> >> > > >
> >> > > > public class Try {
> >> > > >     StringTokenizer tokenizer;
> >> > > >     BufferedReader reader;
> >> > > >
> >> > > >     public static void main(String[] args) throws IOException {
> >> > > >         String name = "fileName";
> >> > > >         Try reader = new Try(new File(name));
> >> > > >
> >> > > >         long itime = System.nanoTime();
> >> > > >         int N = reader.nextInt();
> >> > > >         for(int n=0; n < N; n++)
> >> > > >             reader.nextInt();
> >> > > >         System.out.println((double) (System.nanoTime() - itime) /
> >> > > > 1000000000);
> >> > > >
> >> > > >     }
> >> > > >
> >> > > >     Try(File f) throws FileNotFoundException {
> >> > > >         tokenizer = new StringTokenizer("");
> >> > > >         reader = new BufferedReader(new FileReader(f));
> >> > > >     }
> >> > > >
> >> > > >     String next() throws IOException {
> >> > > >         if(!tokenizer.hasMoreTokens()) tokenize();
> >> > > >         return tokenizer.nextToken();
> >> > > >     }
> >> > > >
> >> > > >     void tokenize() throws IOException {
> >> > > >         tokenizer = new StringTokenizer(reader.readLine());
> >> > > >     }
> >> > > >
> >> > > >     int nextInt() throws IOException {
> >> > > >         return Integer.parseInt(next());
> >> > > >     }
> >> > > > }
> >> > > > >  1.169884868
> >> > > >
> >> > > > The file has 7 068 650 lines. On each line is one integer that is
> >> > > > not bigger than 2^16.
> >> > > >
> >> > >
>

Re: [julia-users] Slow reading of file

Reply via email to