Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Blake Johnson Wed, 22 Jan 2014 11:22:50 -0800

Sure, but the resulting expression is *much* more verbose. I just noticed 
that all expression-based indexing was on the chopping block. What is left 
after all this?


I can see how axing these features would make DataFrames.jl easier to 
maintain, but I found the expression stuff to present a rather nice 
interface.

--Blake

On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
>
> Can you do something like df[“ColA”] = f(df)?
>
>  — John
>
> On Jan 21, 2014, at 8:48 AM, Blake Johnson 
> <[email protected]<javascript:>> 
> wrote:
>
> I use within! pretty frequently. What should I be using instead if that is 
> on the chopping block?
>
> --Blake
>
> On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
>>
>> I also agree with your approach, John. Based on your criteria, here 
>> are some other things to consider for the chopping block. 
>>
>> - expression-based indexing 
>> - NamedArray (you already have an issue on this) 
>> - with, within, based_on and variants 
>> - @transform, @DataFrame 
>> - select, filter 
>> - DataStream 
>>
>> Many of these were attempts to ease syntax via delayed evaluation. We 
>> can either do without or try to implement something like LINQ. 
>>
>>
>>
>> On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire <[email protected]> 
>> wrote: 
>> > Hi John, 
>> > 
>> > I agree with pretty much everything you have written here, and really 
>> > appreciate that you've taken the lead in cleaning things up and getting 
>> us 
>> > on track. 
>> > 
>> > Cheers! 
>> >    Kevin 
>> > 
>> > 
>> > On Mon, Jan 20, 2014 at 1:57 PM, John Myles White <[email protected]
>> > 
>> > wrote: 
>> >> 
>> >> As I said in another thread recently, I am currently the lead 
>> maintainer 
>> >> of more packages than I can keep up with. I think it’s been useful for 
>> me to 
>> >> start so many different projects, but I can’t keep maintaining most of 
>> my 
>> >> packages given my current work schedule. 
>> >> 
>> >> Without Simon Kornblith, Kevin Squire, Sean Garborg and several others
>>  
>> >> doing amazing work to keep DataArrays and DataFrames going, much of our
>>  
>> >> basic data infrastructure would have already become completely 
>> unusable. But 
>> >> even with the great work that’s been done on those package recently, 
>> there’s 
>> >> still lot of additional design work required. I’d like to free up some 
>> of my 
>> >> time to do that work. 
>> >> 
>> >> To keep things moving forward, I’d like to propose a couple of radical 
>> New 
>> >> Year’s resolutions for the packages I work on. 
>> >> 
>> >> (1) We need to stop adding functionality and focus entirely on 
>> improving 
>> >> the quality and documentation of our existing functionality. We have 
>> way too 
>> >> much prototype code in DataFrames that I can’t keep up with. I’m about 
>> to 
>> >> make a pull request for DataFrames that will remove everything related 
>> to 
>> >> column groupings, database-style indexing and Blocks.jl support. I 
>> >> absolutely want to see us push all of those ideas forward in the 
>> future, but 
>> >> they need to happen in unmerged forks or separate packages until we 
>> have the 
>> >> resources needed to support them. Right now, they make an overwhelming
>>  
>> >> maintenance challenge even more onerous. 
>> >> 
>> >> (2) We can’t support anything other than the master branch of most 
>> >> JuliaStats packages except possibly for Distributions. I personally 
>> don’t 
>> >> have the time to simultaneously keep stuff working with Julia 0.2 and 
>> Julia 
>> >> 0.3. Moreover, many of our basic packages aren’t mature enough to 
>> justify 
>> >> supporting older versions. We should do a better job of supporting our
>>  
>> >> master releases and not invest precious time trying to support older 
>> >> releases. 
>> >> 
>> >> (3) We need to make more of DataArrays and DataFrames reflect the 
>> Julian 
>> >> worldview. Lots of our code uses an interface that is incongruous with 
>> the 
>> >> interfaces found in Base. Even worse, a large chunk of code has 
>> >> type-stability problems that makes it very slow, when comparable code 
>> that 
>> >> uses normal Arrays is 100x faster. We need to develop new idioms and 
>> new 
>> >> strategies for making code that interacts with type-destabilizing NA’s
>>  
>> >> faster. More generally, we need to make DataArrays and DataFrames fit 
>> in 
>> >> better with Julia when Julia and R disagree. Following R’s lead has 
>> often 
>> >> lead us astray because R doesn’t share Julia’s strenths or weaknesses.
>>  
>> >> 
>> >> (4) Going forward, there should be exactly one way to do most things. 
>> The 
>> >> worst part of our current codebase is that there are multiple ways to 
>> >> express the same computation, but (a) some of them are unusably slow 
>> and (b) 
>> >> some of them don’t ever get tested or maintained properly. This is 
>> closely 
>> >> linked to the excess proliferation of functionality described in 
>> Resolution 
>> >> 1 above. We need to start removing stuff from our packages and making 
>> the 
>> >> parts we keep both reliable and fast. 
>> >> 
>> >> I think we can push DataArrays and DataFrames to 1.0 status by the end 
>> of 
>> >> this year. But I think we need to adopt a new approach if we’re going 
>> to get 
>> >> there. Lots of stuff needs to get deprecated and what remains needs a 
>> lot 
>> >> more testing, benchmarking and documentation. 
>> >> 
>> >>  — John 
>> >> 
>> >
>
>
>

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to