I don't think the @where macro will help you in this case, since it creates
a new dataframe out of the selected subsets. If there is a way to use the
macro as-is actually to modify the input dataframe, I don't see it.
However, I don't know if you really need macros here. First, you can use
string interpolation to avoid writing out (convert(Symbol,
"col1_"*string(i)):
julia> i=1
1
julia> symbol("col1_$i")
:col1_1
We can do even better by defining a shorthand. * is used for string
concatenation, so why not symbol concatenation?
julia> *(a::Symbol, i::Int)=symbol("$a"*"$i")
* (generic function with 133 methods)
julia> :col1_*i
:col1_1
Now you can write a loop like the following, where n is the number of rows
in df:
for i in firstyear:lastyear
for row in 1:n
(df[:col1_*i] .<= 2*(df[:colx_*i]) | df[:col1_*i] .>
400*(df[:colx_*i]) |
df[:col2_*i] .> 5000) && (df[row, :col_*i] = NA)
end
end
Does that work for you? Let me know. I agree it's not as clean as the Stata
version, but I don't think it's hopeless.
Also, why do you have your year numbers in your column names? Maybe if you
had a single "Year" column that would then determine values for col, col1,
col2 and colx then you would be better off.
On Wednesday, May 20, 2015 at 12:57:27 PM UTC-4, Nils Gudat wrote:
>
> I think I have to give up and grudgingly revert to pandas/R - I just tried
> to do this within a loop, dropping observations based on comparisons of a
> number of columns numbered by years with some transformations of other
> columns in the corresponding year. This is my (failed) attempt:
>
> for i = firstyear:lastyear
> @where(df, array((convert(Symbol, "col1_"*string(i)) .<=
> 2*convert(Symbol, "colx_"*string(i))) |
> (convert(Symbol, "col1_"*string(i)) .>
> 400*convert(Symbol, "colx_"*string(i))) |
> (convert(Symbol, "col2_"*string(i)) .> 5000) |
> (convert(Symbol, "col2_"*string(i)) .< 500), false))[:col] = NA
> end
>
> I think this is beyond salvation and maybe not really feasible with
> DataFrames at the moment.
> For comparison, this would be the Stata command:
>
> replace col`i'=. if col1_`i'<= 2*colx_`i' | col1_`i' > 400*colx_`i' |
> col2_`i' > 5000 | col2_`i' < 500
>
> Of course a highly optimized software package like Stata is an unfair
> comparison, but still the difference is pretty striking...
>