Hi Joosep, Your question prompted a more extensive discussion on the DataFrames issues tracker:
https://github.com/JuliaStats/DataFrames.jl/issues/523 You might find some of the discussion there useful for figuring out ways to optimize performance going forward. The quick summary is that all of the approaches that were outlined in this thread are slower than the best available approaches, but those approaches require some non-obvious coding strategies and knowledge of DataFrames internals. Hopefully we’ll be able to expose some better interfaces for this kind of work in the near-term future. — John On Feb 2, 2014, at 7:24 AM, Joosep Pata <[email protected]> wrote: > Thanks everyone. The isna(DataArray, Integer) method will certainly be useful. > > On 02 Feb 2014, at 01:19, John Myles White <[email protected]> wrote: > >> Hi Joosep, >> >> The current way to get the best performance for your example is going to >> look something like the following code: >> >> a, b, c = df[:a], df[:b], df[:c] >> T = eltype(a) # Assumes eltype(a) == eltype(b) == eltype(c) >> for i in 1:size(df, 1) >> if !(isna(a, i) || isna(b, i) || isna(c, i)) >> x = a[i]::T * b[i]::T + c[i]::T >> end >> end >> >> This should resolve all type inference problems and get you the highest >> performance possible given our current codebase. >> >> In general, there are three main bottlenecks when working with >> DataArrays/DataFrames that we’re trying to find ways to work around: >> >> (1) Indexing into the full DataFrame on each iteration is slightly wasteful, >> since you can cache the identity of the “:a” column before starting the >> loop’s body. This is a fairly minor speedup, but it’s still worth doing. >> >> (2) Checking whether entries are NA is currently very wasteful because the >> entries, absent type assertions, are of uncertain type. The new >> isna(DataArray, Integer) method lets you check, per-entry, whether that >> specific entry is NA. This is much faster because it’s completely well-typed >> as only Bool’s ever result from this computation. >> >> (3) Indexing into a DataArray right now isn’t type-stable, because the >> outcome might be a value of type T or it might be an NA of type NAType. >> Adding the type assertions mentioned by Ivar will help a lot with this, once >> you guarantee that there are no NA’s in the DataArray. >> >> - John >> >>> f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] >>>> >>>> for i=1:nrow(df) >>>> x = f(df, i) >>>> end >>>> ~~~ >>>> >>>> which according to Profile creates a major bottleneck. >>>> >>>> Would it make sense to somehow pre-create an immutable type corresponding >>>> to a single row (my data are BitsKind), and run a compiled function on >>>> these row-objects with strong typing? >>>> >>>> Thanks in advance for any advice, >>>> Joosep >>> >> >> — John >> >> >> On Feb 1, 2014, at 6:28 AM, David van Leeuwen <[email protected]> >> wrote: >> >>> Hi, >>> >>> I saw you define a function f(::DataFrameRow) inside the timing loop. I >>> wonder whether the Julia JIT re-compiles this local function each time, or >>> whether it caches the compiled version. I don't really know. >>> >>> Apparently there is a performance penalty for anonymous functions, as in >>> map(x->x*x, i:10), but I don't know if this extends to locally defined >>> functions. >>> >>> Cheers, >>> >>> ---david >>> >>> On Saturday, February 1, 2014 3:08:18 PM UTC+1, Joosep Pata wrote: >>> Thanks! >>> >>> I wasn’t aware of eachrow, this seems quite close to what I had in mind. I >>> ran some simplistic timing checks [1], and the eachrow method is 2-3x >>> faster. I also tried the type asserts, byt they didn’t seem to make a >>> difference. I forgot to mention earlier that my data can also be NA, so >>> it’s not that easy for the compiler. >>> >>> [1] >>> http://nbviewer.ipython.org/urls/dl.dropbox.com/s/mj8g1s0ewmpd1b6/dataframe_iter_speed.ipynb?create=1 >>> >>> >>> Cheers, >>> Joosep >>> >>> On 01 Feb 2014, at 15:11, David van Leeuwen <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> There now is the eachrow iterator which might do what you want more >>>> efficiently. >>>> >>>> df = DataFrame(a=1:2, b=2:3) >>>> func(r::DataFrameRow) = r["a"] * r["b"] >>>> for r in eachrow(df) >>>> println(func(r)) >>>> end >>>> you can also use integer indices for the dataframerow r, r[1] * r[2] >>>> >>>> Cheers, >>>> >>>> ---david >>>> >>>> On Saturday, February 1, 2014 1:25:04 PM UTC+1, Joosep Pata wrote: >>>> I would like to do an explicit loop over a large DataFrame and evaluate a >>>> function which depends on a subset of the columns in an arbitrary way. >>>> What would be the fastest way to accomplish this? Presently, I’m doing >>>> something like >>>> >>>> ~~~ >>>> f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] >>>> >>>> for i=1:nrow(df) >>>> x = f(df, i) >>>> end >>>> ~~~ >>>> >>>> which according to Profile creates a major bottleneck. >>>> >>>> Would it make sense to somehow pre-create an immutable type corresponding >>>> to a single row (my data are BitsKind), and run a compiled function on >>>> these row-objects with strong typing? >>>> >>>> Thanks in advance for any advice, >>>> Joosep >>> >> >
