Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

John Myles White Wed, 22 Jan 2014 15:54:47 -0800

That's exactly the kind of indexing I'd like to encourage using until we get 
our core functionality cleaned up. Nothing special required except Boolean 
indexing, which is easy to make fast and doesn't have weird scoping issues.


 -- John

On Jan 22, 2014, at 3:18 PM, Kevin Squire <[email protected]> wrote:

> Got it.  I was thinking of the more verbose (but still useful)
> 
> df[(df["colA"] > 4) & !isna(df["colB"]), :]
> 
> Kevin
> 
> 
> On Wed, Jan 22, 2014 at 3:10 PM, John Myles White <[email protected]> 
> wrote:
> The idealized expression interface offers things like (up to reordering):
> 
> with(df, a + b * x)
> 
> where a and b are variables in the caller's scope and x is a column of df.
> 
> In practice, we've had to hack this sort of thing together to offer things 
> like
> 
> with(df, :($a + $b * x))
> 
> That's because we need to pass quoted strings and we also need to tell the 
> system which variables are in the caller's cope.
> 
> More generally, I'd refer to any operation that passes expressions around and 
> asks other functions to evaluate them with an ad hoc scope as 
> expression-based operations.
> 
> R offers very deep support for this in the language.
> 
>  -- John
> 
> On Jan 22, 2014, at 2:48 PM, Kevin Squire <[email protected]> wrote:
> 
>> Maybe I misinterpreted the term "expression-based interface".
>> 
>> 
>> On Wed, Jan 22, 2014 at 2:33 PM, John Myles White <[email protected]> 
>> wrote:
>> My impression is that Pandas didn't support anything like delayed 
>> evaluation. Is that wrong?
>> 
>> I'm aware that the resulting expressions are a lot more verbose. That 
>> definitely sucks.
>> 
>> I'd love to see strong proposals for how we're going to do a better job of 
>> making code shorter going forward. But too much of our current codebase is 
>> buggy, unable to handle edge cases, slow and undocumented. I think it's much 
>> more important that we have one way of doing things that actually works as 
>> advertised for every Julia user than two ways of doing things, each of which 
>> is slightly broken and performs worse than R and Pandas.
>> 
>> As I've been saying lately, I'm burning out on maintaing so much Julia code. 
>> If someone else wants to take charge of my projects, I'm ok with that. But 
>> if I'm going to be doing the work going forward, I need to devote my 
>> energies to making a small number of things work really well. Once we get 
>> our core functionality solid, I'll be comfortable getting fancier stuff 
>> working again.
>> 
>>  -- John
>> 
>> On Jan 22, 2014, at 1:06 PM, Kevin Squire <[email protected]> wrote:
>> 
>>> I'm also a fan of the expression-based interface (mostly because I'm used 
>>> to similar things in Pandas).  I haven't looked at that code, though, so I 
>>> can't comment on the complexity.
>>> 
>>> Kevin
>>> 
>>> 
>>> On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson <[email protected]> 
>>> wrote:
>>> Sure, but the resulting expression is much more verbose. I just noticed 
>>> that all expression-based indexing was on the chopping block. What is left 
>>> after all this?
>>> 
>>> I can see how axing these features would make DataFrames.jl easier to 
>>> maintain, but I found the expression stuff to present a rather nice 
>>> interface.
>>> 
>>> --Blake
>>> 
>>> 
>>> On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
>>> Can you do something like df[“ColA”] = f(df)?
>>> 
>>>  — John
>>> 
>>> 
>>> On Jan 21, 2014, at 8:48 AM, Blake Johnson <[email protected]> wrote:
>>> 
>>>> I use within! pretty frequently. What should I be using instead if that is 
>>>> on the chopping block?
>>>> 
>>>> --Blake
>>>> 
>>>> On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
>>>> I also agree with your approach, John. Based on your criteria, here 
>>>> are some other things to consider for the chopping block. 
>>>> 
>>>> - expression-based indexing 
>>>> - NamedArray (you already have an issue on this) 
>>>> - with, within, based_on and variants 
>>>> - @transform, @DataFrame 
>>>> - select, filter 
>>>> - DataStream 
>>>> 
>>>> Many of these were attempts to ease syntax via delayed evaluation. We 
>>>> can either do without or try to implement something like LINQ. 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire <[email protected]> wrote: 
>>>> > Hi John, 
>>>> > 
>>>> > I agree with pretty much everything you have written here, and really 
>>>> > appreciate that you've taken the lead in cleaning things up and getting 
>>>> > us 
>>>> > on track. 
>>>> > 
>>>> > Cheers! 
>>>> >    Kevin 
>>>> > 
>>>> > 
>>>> > On Mon, Jan 20, 2014 at 1:57 PM, John Myles White <[email protected]> 
>>>> > wrote: 
>>>> >> 
>>>> >> As I said in another thread recently, I am currently the lead 
>>>> >> maintainer 
>>>> >> of more packages than I can keep up with. I think it’s been useful for 
>>>> >> me to 
>>>> >> start so many different projects, but I can’t keep maintaining most of 
>>>> >> my 
>>>> >> packages given my current work schedule. 
>>>> >> 
>>>> >> Without Simon Kornblith, Kevin Squire, Sean Garborg and several others 
>>>> >> doing amazing work to keep DataArrays and DataFrames going, much of our 
>>>> >> basic data infrastructure would have already become completely 
>>>> >> unusable. But 
>>>> >> even with the great work that’s been done on those package recently, 
>>>> >> there’s 
>>>> >> still lot of additional design work required. I’d like to free up some 
>>>> >> of my 
>>>> >> time to do that work. 
>>>> >> 
>>>> >> To keep things moving forward, I’d like to propose a couple of radical 
>>>> >> New 
>>>> >> Year’s resolutions for the packages I work on. 
>>>> >> 
>>>> >> (1) We need to stop adding functionality and focus entirely on 
>>>> >> improving 
>>>> >> the quality and documentation of our existing functionality. We have 
>>>> >> way too 
>>>> >> much prototype code in DataFrames that I can’t keep up with. I’m about 
>>>> >> to 
>>>> >> make a pull request for DataFrames that will remove everything related 
>>>> >> to 
>>>> >> column groupings, database-style indexing and Blocks.jl support. I 
>>>> >> absolutely want to see us push all of those ideas forward in the 
>>>> >> future, but 
>>>> >> they need to happen in unmerged forks or separate packages until we 
>>>> >> have the 
>>>> >> resources needed to support them. Right now, they make an overwhelming 
>>>> >> maintenance challenge even more onerous. 
>>>> >> 
>>>> >> (2) We can’t support anything other than the master branch of most 
>>>> >> JuliaStats packages except possibly for Distributions. I personally 
>>>> >> don’t 
>>>> >> have the time to simultaneously keep stuff working with Julia 0.2 and 
>>>> >> Julia 
>>>> >> 0.3. Moreover, many of our basic packages aren’t mature enough to 
>>>> >> justify 
>>>> >> supporting older versions. We should do a better job of supporting our 
>>>> >> master releases and not invest precious time trying to support older 
>>>> >> releases. 
>>>> >> 
>>>> >> (3) We need to make more of DataArrays and DataFrames reflect the 
>>>> >> Julian 
>>>> >> worldview. Lots of our code uses an interface that is incongruous with 
>>>> >> the 
>>>> >> interfaces found in Base. Even worse, a large chunk of code has 
>>>> >> type-stability problems that makes it very slow, when comparable code 
>>>> >> that 
>>>> >> uses normal Arrays is 100x faster. We need to develop new idioms and 
>>>> >> new 
>>>> >> strategies for making code that interacts with type-destabilizing NA’s 
>>>> >> faster. More generally, we need to make DataArrays and DataFrames fit 
>>>> >> in 
>>>> >> better with Julia when Julia and R disagree. Following R’s lead has 
>>>> >> often 
>>>> >> lead us astray because R doesn’t share Julia’s strenths or weaknesses. 
>>>> >> 
>>>> >> (4) Going forward, there should be exactly one way to do most things. 
>>>> >> The 
>>>> >> worst part of our current codebase is that there are multiple ways to 
>>>> >> express the same computation, but (a) some of them are unusably slow 
>>>> >> and (b) 
>>>> >> some of them don’t ever get tested or maintained properly. This is 
>>>> >> closely 
>>>> >> linked to the excess proliferation of functionality described in 
>>>> >> Resolution 
>>>> >> 1 above. We need to start removing stuff from our packages and making 
>>>> >> the 
>>>> >> parts we keep both reliable and fast. 
>>>> >> 
>>>> >> I think we can push DataArrays and DataFrames to 1.0 status by the end 
>>>> >> of 
>>>> >> this year. But I think we need to adopt a new approach if we’re going 
>>>> >> to get 
>>>> >> there. Lots of stuff needs to get deprecated and what remains needs a 
>>>> >> lot 
>>>> >> more testing, benchmarking and documentation. 
>>>> >> 
>>>> >>  — John 
>>>> >> 
>>>> >
>>> 
>>> 
>> 
>> 
> 
>

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to