*I made this question on StackOverflow, but I think I will get better 
results posting it here. We should use that platform more, so Julia is more 
exposed to R/Python/Matlab users needing something like it.*

I have some data (from a R course assignment, but that doesn't matter) that 
I want to use split-apply-combine strategy, but I'm having some problems. 
The data is on a DataFrame, called outcome, and each line represents a 
Hospital. Each column has an information about that hospital, like name, 
location, rates, etc.

*My objective is to obtain the Hospital with the lowest "Mortality by Heart 
Attack Rate" of each State.*

I was playing around with some strategies, and got a problem using the 
byfunction:

best_heart_rate(df) = sort(df, cols = :Mortality)[end,:] 
best_hospitals = by(hospitals, :State, best_heart_rate)

 The idea was to split the hospitals DataFrame by State, sort each of the 
SubDataFrames by Mortality Rate, get the lowest one, and combine the lines 
in a new DataFrame

But when I used this strategy, I got:

ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
 in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
 in f at none:1
 in based_on at 
/home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
 in by at 
/home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202

I suppose the nrow function is not implemented for SubDataFrames for a good 
reason, so I gave up from this strategy. Then I used a nastier code:

best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)

Seems to work. But now there is a NA problem: how can I remove the rows 
from the SubDataFrames that have NA on the Mortality column? Is there a 
better strategy to accomplish my objective?

Reply via email to