Link: http://stackoverflow.com/questions/23806758/julia-dataframes-problems-with-split-apply-combine-strategy
I definitely agree that having a greater presence on SO would be useful, so it might be best to answer there (sorry I can't be more directly helpful, OP) On 22 May 2014 13:56, Paulo Castro <p.oliveira.cas...@gmail.com> wrote: > *I made this question on StackOverflow, but I think I will get better > results posting it here. We should use that platform more, so Julia is more > exposed to R/Python/Matlab users needing something like it.* > > I have some data (from a R course assignment, but that doesn't matter) > that I want to use split-apply-combine strategy, but I'm having some > problems. The data is on a DataFrame, called outcome, and each line > represents a Hospital. Each column has an information about that hospital, > like name, location, rates, etc. > > *My objective is to obtain the Hospital with the lowest "Mortality by > Heart Attack Rate" of each State.* > > I was playing around with some strategies, and got a problem using the > byfunction: > > best_heart_rate(df) = sort(df, cols = :Mortality)[end,:] > > best_hospitals = by(hospitals, :State, best_heart_rate) > > The idea was to split the hospitals DataFrame by State, sort each of the > SubDataFrames by Mortality Rate, get the lowest one, and combine the lines > in a new DataFrame > > But when I used this strategy, I got: > > ERROR: no method nrow(SubDataFrame{Array{Int64,1}}) > > in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311 > > in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296 > > in f at none:1 > in based_on at > /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144 > > in by at > /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202 > > I suppose the nrow function is not implemented for SubDataFrames for a > good reason, so I gave up from this strategy. Then I used a nastier code: > > best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:] > > best_hospitals = by(hospitals, :State, best_heart_rate) > > Seems to work. But now there is a NA problem: how can I remove the rows > from the SubDataFrames that have NA on the Mortality column? Is there a > better strategy to accomplish my objective? >