Hi Joosep,
The current way to get the best performance for your example is going to look
something like the following code:
a, b, c = df[:a], df[:b], df[:c]
T = eltype(a) # Assumes eltype(a) == eltype(b) == eltype(c)
for i in 1:size(df, 1)
if !(isna(a, i) || isna(b, i) || isna(c, i))
x = a[i]::T * b[i]::T + c[i]::T
end
end
This should resolve all type inference problems and get you the highest
performance possible given our current codebase.
In general, there are three main bottlenecks when working with
DataArrays/DataFrames that we’re trying to find ways to work around:
(1) Indexing into the full DataFrame on each iteration is slightly wasteful,
since you can cache the identity of the “:a” column before starting the loop’s
body. This is a fairly minor speedup, but it’s still worth doing.
(2) Checking whether entries are NA is currently very wasteful because the
entries, absent type assertions, are of uncertain type. The new isna(DataArray,
Integer) method lets you check, per-entry, whether that specific entry is NA.
This is much faster because it’s completely well-typed as only Bool’s ever
result from this computation.
(3) Indexing into a DataArray right now isn’t type-stable, because the outcome
might be a value of type T or it might be an NA of type NAType. Adding the type
assertions mentioned by Ivar will help a lot with this, once you guarantee that
there are no NA’s in the DataArray.
- John
> f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c]
> >
> > for i=1:nrow(df)
> > x = f(df, i)
> > end
> > ~~~
> >
> > which according to Profile creates a major bottleneck.
> >
> > Would it make sense to somehow pre-create an immutable type corresponding
> > to a single row (my data are BitsKind), and run a compiled function on
> > these row-objects with strong typing?
> >
> > Thanks in advance for any advice,
> > Joosep
>
— John
On Feb 1, 2014, at 6:28 AM, David van Leeuwen <[email protected]>
wrote:
> Hi,
>
> I saw you define a function f(::DataFrameRow) inside the timing loop. I
> wonder whether the Julia JIT re-compiles this local function each time, or
> whether it caches the compiled version. I don't really know.
>
> Apparently there is a performance penalty for anonymous functions, as in
> map(x->x*x, i:10), but I don't know if this extends to locally defined
> functions.
>
> Cheers,
>
> ---david
>
> On Saturday, February 1, 2014 3:08:18 PM UTC+1, Joosep Pata wrote:
> Thanks!
>
> I wasn’t aware of eachrow, this seems quite close to what I had in mind. I
> ran some simplistic timing checks [1], and the eachrow method is 2-3x faster.
> I also tried the type asserts, byt they didn’t seem to make a difference. I
> forgot to mention earlier that my data can also be NA, so it’s not that easy
> for the compiler.
>
> [1]
> http://nbviewer.ipython.org/urls/dl.dropbox.com/s/mj8g1s0ewmpd1b6/dataframe_iter_speed.ipynb?create=1
>
>
> Cheers,
> Joosep
>
> On 01 Feb 2014, at 15:11, David van Leeuwen <[email protected]> wrote:
>
> > Hi,
> >
> > There now is the eachrow iterator which might do what you want more
> > efficiently.
> >
> > df = DataFrame(a=1:2, b=2:3)
> > func(r::DataFrameRow) = r["a"] * r["b"]
> > for r in eachrow(df)
> > println(func(r))
> > end
> > you can also use integer indices for the dataframerow r, r[1] * r[2]
> >
> > Cheers,
> >
> > ---david
> >
> > On Saturday, February 1, 2014 1:25:04 PM UTC+1, Joosep Pata wrote:
> > I would like to do an explicit loop over a large DataFrame and evaluate a
> > function which depends on a subset of the columns in an arbitrary way. What
> > would be the fastest way to accomplish this? Presently, I’m doing something
> > like
> >
> > ~~~
> > f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c]
> >
> > for i=1:nrow(df)
> > x = f(df, i)
> > end
> > ~~~
> >
> > which according to Profile creates a major bottleneck.
> >
> > Would it make sense to somehow pre-create an immutable type corresponding
> > to a single row (my data are BitsKind), and run a compiled function on
> > these row-objects with strong typing?
> >
> > Thanks in advance for any advice,
> > Joosep
>