Re: [julia-users] evaluate function on DataFrame row

Joosep Pata Sun, 02 Feb 2014 07:24:41 -0800

Thanks everyone. The isna(DataArray, Integer) method will certainly be useful.


On 02 Feb 2014, at 01:19, John Myles White <[email protected]> wrote:

> Hi Joosep,
> 
> The current way to get the best performance for your example is going to look 
> something like the following code:
> 
> a, b, c = df[:a], df[:b], df[:c]
> T = eltype(a) # Assumes eltype(a) == eltype(b) == eltype(c)
> for i in 1:size(df, 1)
>       if !(isna(a, i) || isna(b, i) || isna(c, i))
>               x = a[i]::T * b[i]::T + c[i]::T
>       end
> end
> 
> This should resolve all type inference problems and get you the highest 
> performance possible given our current codebase.
> 
> In general, there are three main bottlenecks when working with 
> DataArrays/DataFrames that we’re trying to find ways to work around:
> 
> (1) Indexing into the full DataFrame on each iteration is slightly wasteful, 
> since you can cache the identity of the “:a” column before starting the 
> loop’s body. This is a fairly minor speedup, but it’s still worth doing.
> 
> (2) Checking whether entries are NA is currently very wasteful because the 
> entries, absent type assertions, are of uncertain type. The new 
> isna(DataArray, Integer) method lets you check, per-entry, whether that 
> specific entry is NA. This is much faster because it’s completely well-typed 
> as only Bool’s ever result from this computation.
> 
> (3) Indexing into a DataArray right now isn’t type-stable, because the 
> outcome might be a value of type T or it might be an NA of type NAType. 
> Adding the type assertions mentioned by Ivar will help a lot with this, once 
> you guarantee that there are no NA’s in the DataArray.
> 
>  - John
> 
>>  f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] 
>> > 
>> > for i=1:nrow(df) 
>> >         x = f(df, i) 
>> > end 
>> > ~~~ 
>> > 
>> > which according to Profile creates a major bottleneck. 
>> > 
>> > Would it make sense to somehow pre-create an immutable type corresponding 
>> > to a single row (my data are BitsKind), and run a compiled function on 
>> > these row-objects with strong typing? 
>> > 
>> > Thanks in advance for any advice, 
>> > Joosep 
>> 
> 
>  — John
> 
> 
> On Feb 1, 2014, at 6:28 AM, David van Leeuwen <[email protected]> 
> wrote:
> 
>> Hi, 
>> 
>> I saw you define a function f(::DataFrameRow) inside the timing loop.  I 
>> wonder whether the Julia JIT re-compiles this local function each time, or 
>> whether it caches the compiled version.  I don't really know. 
>> 
>> Apparently there is a performance penalty for anonymous functions, as in 
>> map(x->x*x, i:10), but I don't know if this extends to locally defined 
>> functions.  
>> 
>> Cheers, 
>> 
>> ---david
>> 
>> On Saturday, February 1, 2014 3:08:18 PM UTC+1, Joosep Pata wrote:
>> Thanks! 
>> 
>> I wasn’t aware of eachrow, this seems quite close to what I had in mind. I 
>> ran some simplistic timing checks [1], and the eachrow method is 2-3x 
>> faster. I also tried the type asserts, byt they didn’t seem to make a 
>> difference. I forgot to mention earlier that my data can also be NA, so it’s 
>> not that easy for the compiler. 
>> 
>> [1] 
>> http://nbviewer.ipython.org/urls/dl.dropbox.com/s/mj8g1s0ewmpd1b6/dataframe_iter_speed.ipynb?create=1
>>  
>> 
>> Cheers, 
>> Joosep 
>> 
>> On 01 Feb 2014, at 15:11, David van Leeuwen <[email protected]> wrote: 
>> 
>> > Hi, 
>> > 
>> > There now is the eachrow iterator which might do what you want more 
>> > efficiently. 
>> > 
>> > df = DataFrame(a=1:2, b=2:3) 
>> > func(r::DataFrameRow) = r["a"] * r["b"] 
>> > for r in eachrow(df) 
>> >        println(func(r)) 
>> > end 
>> > you can also use integer indices for the dataframerow r, r[1] * r[2] 
>> > 
>> > Cheers, 
>> > 
>> > ---david 
>> > 
>> > On Saturday, February 1, 2014 1:25:04 PM UTC+1, Joosep Pata wrote: 
>> > I would like to do an explicit loop over a large DataFrame and evaluate a 
>> > function which depends on a subset of the columns in an arbitrary way. 
>> > What would be the fastest way to accomplish this? Presently, I’m doing 
>> > something like 
>> > 
>> > ~~~ 
>> > f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] 
>> > 
>> > for i=1:nrow(df) 
>> >         x = f(df, i) 
>> > end 
>> > ~~~ 
>> > 
>> > which according to Profile creates a major bottleneck. 
>> > 
>> > Would it make sense to somehow pre-create an immutable type corresponding 
>> > to a single row (my data are BitsKind), and run a compiled function on 
>> > these row-objects with strong typing?  
>> > 
>> > Thanks in advance for any advice, 
>> > Joosep 
>> 
>

Re: [julia-users] evaluate function on DataFrame row

Reply via email to