[julia-users] Re: Some DataFrames questions

David Gold Wed, 20 May 2015 14:26:01 -0700

Whoops, should be

for i in firstyear:lastyear
    for row in 1:n
        (df[row, :col1_*i] .<= 2*(df[row, :colx_*i]) | df[row, :col1_*i] .> 
400*(df[row, :colx_*i]) |
        df[row, :col2_*i] .> 5000) && (df[row, :col_*i] = NA)    
    end
end




On Wednesday, May 20, 2015 at 4:21:34 PM UTC-4, David Gold wrote:
>
> I don't think the @where macro will help you in this case, since it 
> creates a new dataframe out of the selected subsets. If there is a way to 
> use the macro as-is actually to modify the input dataframe, I don't see it. 
>
> However, I don't know if you really need macros here. First, you can use 
> string interpolation to avoid writing out (convert(Symbol, 
> "col1_"*string(i)):
>
> julia> i=1
> 1
>
> julia> symbol("col1_$i")
> :col1_1
>
> We can do even better by defining a shorthand. * is used for string 
> concatenation, so why not symbol concatenation? 
>
> julia> *(a::Symbol, i::Int)=symbol("$a"*"$i")
> * (generic function with 133 methods)
>
> julia> :col1_*i
> :col1_1
>
> Now you can write a loop like the following, where n is the number of rows 
> in df:
>
> for i in firstyear:lastyear
>     for row in 1:n
>         (df[:col1_*i] .<= 2*(df[:colx_*i]) | df[:col1_*i] .> 
> 400*(df[:colx_*i]) |
>         df[:col2_*i] .> 5000) && (df[row, :col_*i] = NA)    
>     end
> end
>
> Does that work for you? Let me know. I agree it's not as clean as the 
> Stata version, but I don't think it's hopeless. 
>
> Also, why do you have your year numbers in your column names? Maybe if you 
> had a single "Year" column that would then determine values for col, col1, 
> col2 and colx then you would be better off.
>
> On Wednesday, May 20, 2015 at 12:57:27 PM UTC-4, Nils Gudat wrote:
>>
>> I think I have to give up and grudgingly revert to pandas/R - I just 
>> tried to do this within a loop, dropping observations based on comparisons 
>> of a number of columns numbered by years with some transformations of other 
>> columns in the corresponding year. This is my (failed) attempt:
>>
>> for i = firstyear:lastyear
>>     @where(df, array((convert(Symbol, "col1_"*string(i)) .<= 
>> 2*convert(Symbol, "colx_"*string(i))) | 
>>                      (convert(Symbol, "col1_"*string(i)) .> 
>> 400*convert(Symbol, "colx_"*string(i))) |
>>                      (convert(Symbol, "col2_"*string(i)) .> 5000) | 
>> (convert(Symbol, "col2_"*string(i)) .< 500), false))[:col] = NA
>> end
>>
>> I think this is beyond salvation and maybe not really feasible with 
>> DataFrames at the moment. 
>> For comparison, this would be the Stata command:
>>
>> replace col`i'=. if col1_`i'<= 2*colx_`i' | col1_`i' > 400*colx_`i' | 
>> col2_`i' > 5000 | col2_`i' < 500
>>
>> Of course a highly optimized software package like Stata is an unfair 
>> comparison, but still the difference is pretty striking...
>>
>

[julia-users] Re: Some DataFrames questions

Reply via email to