[julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

John Myles White Mon, 20 Jan 2014 13:58:10 -0800

As I said in another thread recently, I am currently the lead maintainer of 
more packages than I can keep up with. I think it’s been useful for me to start 
so many different projects, but I can’t keep maintaining most of my packages 
given my current work schedule.


Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing 
amazing work to keep DataArrays and DataFrames going, much of our basic data 
infrastructure would have already become completely unusable. But even with the 
great work that’s been done on those package recently, there’s still lot of 
additional design work required. I’d like to free up some of my time to do that 
work.

To keep things moving forward, I’d like to propose a couple of radical New 
Year’s resolutions for the packages I work on.

(1) We need to stop adding functionality and focus entirely on improving the 
quality and documentation of our existing functionality. We have way too much 
prototype code in DataFrames that I can’t keep up with. I’m about to make a 
pull request for DataFrames that will remove everything related to column 
groupings, database-style indexing and Blocks.jl support. I absolutely want to 
see us push all of those ideas forward in the future, but they need to happen 
in unmerged forks or separate packages until we have the resources needed to 
support them. Right now, they make an overwhelming maintenance challenge even 
more onerous.

(2) We can’t support anything other than the master branch of most JuliaStats 
packages except possibly for Distributions. I personally don’t have the time to 
simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many 
of our basic packages aren’t mature enough to justify supporting older 
versions. We should do a better job of supporting our master releases and not 
invest precious time trying to support older releases.

(3) We need to make more of DataArrays and DataFrames reflect the Julian 
worldview. Lots of our code uses an interface that is incongruous with the 
interfaces found in Base. Even worse, a large chunk of code has type-stability 
problems that makes it very slow, when comparable code that uses normal Arrays 
is 100x faster. We need to develop new idioms and new strategies for making 
code that interacts with type-destabilizing NA’s faster. More generally, we 
need to make DataArrays and DataFrames fit in better with Julia when Julia and 
R disagree. Following R’s lead has often lead us astray because R doesn’t share 
Julia’s strenths or weaknesses.

(4) Going forward, there should be exactly one way to do most things. The worst 
part of our current codebase is that there are multiple ways to express the 
same computation, but (a) some of them are unusably slow and (b) some of them 
don’t ever get tested or maintained properly. This is closely linked to the 
excess proliferation of functionality described in Resolution 1 above. We need 
to start removing stuff from our packages and making the parts we keep both 
reliable and fast.

I think we can push DataArrays and DataFrames to 1.0 status by the end of this 
year. But I think we need to adopt a new approach if we’re going to get there. 
Lots of stuff needs to get deprecated and what remains needs a lot more 
testing, benchmarking and documentation.

 — John

[julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to