"Do blocks" are one of my favourite things about Julia, they're explained in
the docs
<http://docs.julialang.org/en/release-0.4/manual/functions/#do-block-syntax-for-function-arguments>.
Basically it's just a convenient way of defining and passing a function
(the code that comes after `do`) to another function (in this case, `by`).
`by` goes over the dataframe, splits it into 3 subdataframes (one for each
Species in the iris dataset), and calls the do-block for each of them. Then
their return values (the last line in the do-block) gets concatenated
together to form the final result. The code I really wanted to write is:
using RDatasets
df = dataset("datasets", "iris")
# For each species
df2 = by(df, :Species) do sub_df
sub_df = copy(sub_df) # don't modify the original dataframe
# Add a :cumulative_PetalLength column
sub_df[:cumulative_PetalLength] = cumsum(sub_df[:PetalLength])
# Return the new sub-dataframe
sub_df
end
but unfortunately, this code doesn't work with DataFrames.jl
On Wednesday, May 4, 2016 at 4:42:41 AM UTC-4, Ben Southwood wrote:
>
> Thanks Cedric, that worked very well. I'm having a little trouble
> following the documentation as to how the "by ... do ..." structure
> actually works. Would you mind explaining what the code is doing?
>
> On Tuesday, May 3, 2016 at 10:07:10 PM UTC-4, Cedric St-Jean wrote:
>>
>> Something like
>>
>> using RDatasets
>> df = dataset("datasets", "iris")
>> df[:cumulative_PetalLength] = 0.0
>> by(df, :Species) do sub_df
>> sub_df[:cumulative_PetalLength] = cumsum(sub_df[:PetalLength])
>> sub_df
>> end
>>
>> though I hope someone can provide a more elegant solution. `sub_df` a
>> SubDataFrame, and those objects can neither have a new column nor be
>> converted to DataFrame.
>>
>> On Tuesday, May 3, 2016 at 4:22:29 PM UTC-4, Ben Southwood wrote:
>>>
>>> I have the following dataframe with values of the form
>>>
>>> date1,label1,qty1_1
>>> date2,label1,qty1_2
>>> date3,label1,qty1_3
>>> ....
>>> dateN,label1,qty1_N
>>> date1,label2,qty2_1
>>> date2,label2,qty2_2
>>> date3,label2,qty2_3
>>> ....
>>> dateN,label2,qty1_N
>>> ....
>>>
>>>
>>>
>>> I would like to cumulative sum the qtys such that the value of the
>>> cumulative sum only increases for each label. And then i'd have
>>>
>>> date1,label1,cuml1_1
>>> date2,label1,cuml1_2
>>> date3,label1,cuml1_3
>>> ....
>>> dateN,label1,cuml1_N
>>> date1,label2,cuml2_1
>>>
>>>
>>>
>>> This way I can use gadfly and run the following plot
>>>
>>>
>>> plot(x=grouped[:date],y=grouped[:cuml_sum],color=grouped[:label],Geom.line)
>>>
>>>
>>> and have each cuml sum have it's own colouring by date. I'm stuck on
>>> how to do this simply without creating lookups. Any help? Thanks!
>>>
>>>
>>>