Re: [julia-users] DataFrames: Problems with Split-Apply-Combine strategy

Mike Innes Thu, 22 May 2014 06:04:10 -0700

Link:
http://stackoverflow.com/questions/23806758/julia-dataframes-problems-with-split-apply-combine-strategy


I definitely agree that having a greater presence on SO would be useful, so
it might be best to answer there (sorry I can't be more directly helpful,
OP)


On 22 May 2014 13:56, Paulo Castro <p.oliveira.cas...@gmail.com> wrote:

>  *I made this question on StackOverflow, but I think I will get better
> results posting it here. We should use that platform more, so Julia is more
> exposed to R/Python/Matlab users needing something like it.*
>
> I have some data (from a R course assignment, but that doesn't matter)
> that I want to use split-apply-combine strategy, but I'm having some
> problems. The data is on a DataFrame, called outcome, and each line
> represents a Hospital. Each column has an information about that hospital,
> like name, location, rates, etc.
>
> *My objective is to obtain the Hospital with the lowest "Mortality by
> Heart Attack Rate" of each State.*
>
> I was playing around with some strategies, and got a problem using the 
> byfunction:
>
> best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
>
> best_hospitals = by(hospitals, :State, best_heart_rate)
>
>  The idea was to split the hospitals DataFrame by State, sort each of the
> SubDataFrames by Mortality Rate, get the lowest one, and combine the lines
> in a new DataFrame
>
> But when I used this strategy, I got:
>
> ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
>
>  in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
>
>  in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
>
>  in f at none:1
>  in based_on at 
> /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
>
>  in by at 
> /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
>
> I suppose the nrow function is not implemented for SubDataFrames for a
> good reason, so I gave up from this strategy. Then I used a nastier code:
>
> best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
>
> best_hospitals = by(hospitals, :State, best_heart_rate)
>
> Seems to work. But now there is a NA problem: how can I remove the rows
> from the SubDataFrames that have NA on the Mortality column? Is there a
> better strategy to accomplish my objective?
>

Re: [julia-users] DataFrames: Problems with Split-Apply-Combine strategy

Reply via email to