Thanks! I wasn’t aware of eachrow, this seems quite close to what I had in mind. I ran some simplistic timing checks [1], and the eachrow method is 2-3x faster. I also tried the type asserts, byt they didn’t seem to make a difference. I forgot to mention earlier that my data can also be NA, so it’s not that easy for the compiler.
[1] http://nbviewer.ipython.org/urls/dl.dropbox.com/s/mj8g1s0ewmpd1b6/dataframe_iter_speed.ipynb?create=1 Cheers, Joosep On 01 Feb 2014, at 15:11, David van Leeuwen <[email protected]> wrote: > Hi, > > There now is the eachrow iterator which might do what you want more > efficiently. > > df = DataFrame(a=1:2, b=2:3) > func(r::DataFrameRow) = r["a"] * r["b"] > for r in eachrow(df) > println(func(r)) > end > you can also use integer indices for the dataframerow r, r[1] * r[2] > > Cheers, > > ---david > > On Saturday, February 1, 2014 1:25:04 PM UTC+1, Joosep Pata wrote: > I would like to do an explicit loop over a large DataFrame and evaluate a > function which depends on a subset of the columns in an arbitrary way. What > would be the fastest way to accomplish this? Presently, I’m doing something > like > > ~~~ > f(df::DataFrame, i::Integer) = df[i, :a] * df[i, :b] + df[i, :c] > > for i=1:nrow(df) > x = f(df, i) > end > ~~~ > > which according to Profile creates a major bottleneck. > > Would it make sense to somehow pre-create an immutable type corresponding to > a single row (my data are BitsKind), and run a compiled function on these > row-objects with strong typing? > > Thanks in advance for any advice, > Joosep
