Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Kevin Squire Wed, 22 Jan 2014 15:19:20 -0800

Got it.  I was thinking of the more verbose (but still useful)

df[(df["colA"] > 4) & !isna(df["colB"]), :]


Kevin


On Wed, Jan 22, 2014 at 3:10 PM, John Myles White
<[email protected]>wrote:

> The idealized expression interface offers things like (up to reordering):
>
> with(df, a + b * x)
>
> where a and b are variables in the caller's scope and x is a column of df.
>
> In practice, we've had to hack this sort of thing together to offer things
> like
>
> with(df, :($a + $b * x))
>
> That's because we need to pass quoted strings and we also need to tell the
> system which variables are in the caller's cope.
>
> More generally, I'd refer to any operation that passes expressions around
> and asks other functions to evaluate them with an ad hoc scope as
> expression-based operations.
>
> R offers very deep support for this in the language.
>
>  -- John
>
> On Jan 22, 2014, at 2:48 PM, Kevin Squire <[email protected]> wrote:
>
> Maybe I misinterpreted the term "expression-based interface".
>
>
> On Wed, Jan 22, 2014 at 2:33 PM, John Myles White <
> [email protected]> wrote:
>
>> My impression is that Pandas didn't support anything like delayed
>> evaluation. Is that wrong?
>>
>> I'm aware that the resulting expressions are a lot more verbose. That
>> definitely sucks.
>>
>> I'd love to see strong proposals for how we're going to do a better job
>> of making code shorter going forward. But too much of our current codebase
>> is buggy, unable to handle edge cases, slow and undocumented. I think it's
>> much more important that we have one way of doing things that actually
>> works as advertised for every Julia user than two ways of doing things,
>> each of which is slightly broken and performs worse than R and Pandas.
>>
>> As I've been saying lately, I'm burning out on maintaing so much Julia
>> code. If someone else wants to take charge of my projects, I'm ok with
>> that. But if I'm going to be doing the work going forward, I need to devote
>> my energies to making a small number of things work really well. Once we
>> get our core functionality solid, I'll be comfortable getting fancier stuff
>> working again.
>>
>>  -- John
>>
>> On Jan 22, 2014, at 1:06 PM, Kevin Squire <[email protected]> wrote:
>>
>> I'm also a fan of the expression-based interface (mostly because I'm used
>> to similar things in Pandas).  I haven't looked at that code, though, so I
>> can't comment on the complexity.
>>
>> Kevin
>>
>>
>> On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson <[email protected]
>> > wrote:
>>
>>> Sure, but the resulting expression is *much* more verbose. I just
>>> noticed that all expression-based indexing was on the chopping block. What
>>> is left after all this?
>>>
>>> I can see how axing these features would make DataFrames.jl easier to
>>> maintain, but I found the expression stuff to present a rather nice
>>> interface.
>>>
>>> --Blake
>>>
>>>
>>> On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
>>>
>>>> Can you do something like df[“ColA”] = f(df)?
>>>>
>>>>  — John
>>>>
>>>>
>>>> On Jan 21, 2014, at 8:48 AM, Blake Johnson <[email protected]>
>>>> wrote:
>>>>
>>>> I use within! pretty frequently. What should I be using instead if that
>>>> is on the chopping block?
>>>>
>>>> --Blake
>>>>
>>>> On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
>>>>>
>>>>> I also agree with your approach, John. Based on your criteria, here
>>>>> are some other things to consider for the chopping block.
>>>>>
>>>>> - expression-based indexing
>>>>> - NamedArray (you already have an issue on this)
>>>>> - with, within, based_on and variants
>>>>> - @transform, @DataFrame
>>>>> - select, filter
>>>>> - DataStream
>>>>>
>>>>> Many of these were attempts to ease syntax via delayed evaluation. We
>>>>> can either do without or try to implement something like LINQ.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire <[email protected]>
>>>>> wrote:
>>>>> > Hi John,
>>>>> >
>>>>> > I agree with pretty much everything you have written here, and really
>>>>>
>>>>> > appreciate that you've taken the lead in cleaning things up and
>>>>> getting us
>>>>> > on track.
>>>>> >
>>>>> > Cheers!
>>>>> >    Kevin
>>>>> >
>>>>> >
>>>>> > On Mon, Jan 20, 2014 at 1:57 PM, John Myles White <johnmyl...@
>>>>> gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> As I said in another thread recently, I am currently the lead
>>>>> maintainer
>>>>> >> of more packages than I can keep up with. I think it’s been useful
>>>>> for me to
>>>>> >> start so many different projects, but I can’t keep maintaining most
>>>>> of my
>>>>> >> packages given my current work schedule.
>>>>> >>
>>>>> >> Without Simon Kornblith, Kevin Squire, Sean Garborg and several
>>>>> others
>>>>> >> doing amazing work to keep DataArrays and DataFrames going, much of
>>>>> our
>>>>> >> basic data infrastructure would have already become completely
>>>>> unusable. But
>>>>> >> even with the great work that’s been done on those package
>>>>> recently, there’s
>>>>> >> still lot of additional design work required. I’d like to free up
>>>>> some of my
>>>>> >> time to do that work.
>>>>> >>
>>>>> >> To keep things moving forward, I’d like to propose a couple of
>>>>> radical New
>>>>> >> Year’s resolutions for the packages I work on.
>>>>> >>
>>>>> >> (1) We need to stop adding functionality and focus entirely on
>>>>> improving
>>>>> >> the quality and documentation of our existing functionality. We
>>>>> have way too
>>>>> >> much prototype code in DataFrames that I can’t keep up with. I’m
>>>>> about to
>>>>> >> make a pull request for DataFrames that will remove everything
>>>>> related to
>>>>> >> column groupings, database-style indexing and Blocks.jl support. I
>>>>> >> absolutely want to see us push all of those ideas forward in the
>>>>> future, but
>>>>> >> they need to happen in unmerged forks or separate packages until we
>>>>> have the
>>>>> >> resources needed to support them. Right now, they make an
>>>>> overwhelming
>>>>> >> maintenance challenge even more onerous.
>>>>> >>
>>>>> >> (2) We can’t support anything other than the master branch of most
>>>>> >> JuliaStats packages except possibly for Distributions. I personally
>>>>> don’t
>>>>> >> have the time to simultaneously keep stuff working with Julia 0.2
>>>>> and Julia
>>>>> >> 0.3. Moreover, many of our basic packages aren’t mature enough to
>>>>> justify
>>>>> >> supporting older versions. We should do a better job of supporting
>>>>> our
>>>>> >> master releases and not invest precious time trying to support older
>>>>>
>>>>> >> releases.
>>>>> >>
>>>>> >> (3) We need to make more of DataArrays and DataFrames reflect the
>>>>> Julian
>>>>> >> worldview. Lots of our code uses an interface that is incongruous
>>>>> with the
>>>>> >> interfaces found in Base. Even worse, a large chunk of code has
>>>>> >> type-stability problems that makes it very slow, when comparable
>>>>> code that
>>>>> >> uses normal Arrays is 100x faster. We need to develop new idioms
>>>>> and new
>>>>> >> strategies for making code that interacts with type-destabilizing
>>>>> NA’s
>>>>> >> faster. More generally, we need to make DataArrays and DataFrames
>>>>> fit in
>>>>> >> better with Julia when Julia and R disagree. Following R’s lead has
>>>>> often
>>>>> >> lead us astray because R doesn’t share Julia’s strenths or
>>>>> weaknesses.
>>>>> >>
>>>>> >> (4) Going forward, there should be exactly one way to do most
>>>>> things. The
>>>>> >> worst part of our current codebase is that there are multiple ways
>>>>> to
>>>>> >> express the same computation, but (a) some of them are unusably
>>>>> slow and (b)
>>>>> >> some of them don’t ever get tested or maintained properly. This is
>>>>> closely
>>>>> >> linked to the excess proliferation of functionality described in
>>>>> Resolution
>>>>> >> 1 above. We need to start removing stuff from our packages and
>>>>> making the
>>>>> >> parts we keep both reliable and fast.
>>>>> >>
>>>>> >> I think we can push DataArrays and DataFrames to 1.0 status by the
>>>>> end of
>>>>> >> this year. But I think we need to adopt a new approach if we’re
>>>>> going to get
>>>>> >> there. Lots of stuff needs to get deprecated and what remains needs
>>>>> a lot
>>>>> >> more testing, benchmarking and documentation.
>>>>> >>
>>>>> >>  — John
>>>>> >>
>>>>> >
>>>>
>>>>
>>>>
>>
>>
>
>

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to