Thanks everyone. The isna(DataArray, Integer) method will certainly be useful.
On 02 Feb 2014, at 01:19, John Myles White <[email protected]> wrote: > Hi Joosep, > > The current way to get the best performance for your example is going to look > something like the following code: > > a, b, c = df[:a], df[:b], df[:c] > T = eltype(a) # Assumes eltype(a) == eltype(b) == eltype(c) > for i in 1:size(df, 1) > if !(isna(a, i) || isna(b, i) || isna(c, i)) > x = a[i]::T * b[i]::T + c[i]::T > end > end > > This should resolve all type inference problems and get you the highest > performance possible given our current codebase. > > In general, there are three main bottlenecks when working with > DataArrays/DataFrames that we’re trying to find ways to work around: > > (1) Indexing into the full DataFrame on each iteration is slightly wasteful, > since you can cache the identity of the “:a” column before starting the > loop’s body. This is a fairly minor speedup, but it’s still worth doing. > > (2) Checking whether entries are NA is currently very wasteful because the > entries, absent type assertions, are of uncertain type. The new > isna(DataArray, Integer) method lets you check, per-entry, whether that > specific entry is NA. This is much faster because it’s completely well-typed > as only Bool’s ever result from this computation. > > (3) Indexing into a DataArray right now isn’t type-stable, because the > outcome might be a value of type T or it might be an NA of type NAType. > Adding the type assertions mentioned by Ivar will help a lot with this, once > you guarantee that there are no NA’s in the DataArray. > > - John > >> f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] >> > >> > for i=1:nrow(df) >> > x = f(df, i) >> > end >> > ~~~ >> > >> > which according to Profile creates a major bottleneck. >> > >> > Would it make sense to somehow pre-create an immutable type corresponding >> > to a single row (my data are BitsKind), and run a compiled function on >> > these row-objects with strong typing? >> > >> > Thanks in advance for any advice, >> > Joosep >> > > — John > > > On Feb 1, 2014, at 6:28 AM, David van Leeuwen <[email protected]> > wrote: > >> Hi, >> >> I saw you define a function f(::DataFrameRow) inside the timing loop. I >> wonder whether the Julia JIT re-compiles this local function each time, or >> whether it caches the compiled version. I don't really know. >> >> Apparently there is a performance penalty for anonymous functions, as in >> map(x->x*x, i:10), but I don't know if this extends to locally defined >> functions. >> >> Cheers, >> >> ---david >> >> On Saturday, February 1, 2014 3:08:18 PM UTC+1, Joosep Pata wrote: >> Thanks! >> >> I wasn’t aware of eachrow, this seems quite close to what I had in mind. I >> ran some simplistic timing checks [1], and the eachrow method is 2-3x >> faster. I also tried the type asserts, byt they didn’t seem to make a >> difference. I forgot to mention earlier that my data can also be NA, so it’s >> not that easy for the compiler. >> >> [1] >> http://nbviewer.ipython.org/urls/dl.dropbox.com/s/mj8g1s0ewmpd1b6/dataframe_iter_speed.ipynb?create=1 >> >> >> Cheers, >> Joosep >> >> On 01 Feb 2014, at 15:11, David van Leeuwen <[email protected]> wrote: >> >> > Hi, >> > >> > There now is the eachrow iterator which might do what you want more >> > efficiently. >> > >> > df = DataFrame(a=1:2, b=2:3) >> > func(r::DataFrameRow) = r["a"] * r["b"] >> > for r in eachrow(df) >> > println(func(r)) >> > end >> > you can also use integer indices for the dataframerow r, r[1] * r[2] >> > >> > Cheers, >> > >> > ---david >> > >> > On Saturday, February 1, 2014 1:25:04 PM UTC+1, Joosep Pata wrote: >> > I would like to do an explicit loop over a large DataFrame and evaluate a >> > function which depends on a subset of the columns in an arbitrary way. >> > What would be the fastest way to accomplish this? Presently, I’m doing >> > something like >> > >> > ~~~ >> > f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] >> > >> > for i=1:nrow(df) >> > x = f(df, i) >> > end >> > ~~~ >> > >> > which according to Profile creates a major bottleneck. >> > >> > Would it make sense to somehow pre-create an immutable type corresponding >> > to a single row (my data are BitsKind), and run a compiled function on >> > these row-objects with strong typing? >> > >> > Thanks in advance for any advice, >> > Joosep >> >
