[julia-stats] ANN: DataFrames 0.9.0 Planned for February

Milan Bouchet-Valat Sat, 12 Nov 2016 06:15:26 -0800

    This announcement has been made on the Domains/Data category of the
new Discourse forum. Please go to there for questions and discussions:h
ttps://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-
for-february/266



Towards DataFrames 0.9.0The DataFrames package and the surrounding ecosystem 
are currently 
undergoing a deep refactoring in development branches, based on a 
framework developed over the last two years. This work aims to 
dramatically improve performance by replacing the DataArray type (and its NA 
value representing missingness) with the new Nullable, NullableArray (see this 
blog post) and CategoricalArray types. Please refer to this blog post for an 
explanation of the limitations of the current design based on DataArray. The 
new framework is planned to be released as version 0.9.0 in early February 2017.

New APIs and Compatibility BreaksDespite our efforts to preserve backward 
compatibility, this change 
will likely break some existing workflows. The standard indexing 
approach (inherited from R) will no longer be the recommended interface.
 Instead, convenient, flexible and efficient high-level APIs inspired by
 the dplyr R package, by 
SQL or by LINQ will be preferred. Users are encouraged to experiment 
with these approaches even with the current stable DataFrames release 
(0.8.x series), via the DataFramesMeta and Query packages. Eventually, an API 
based on the StructuredQueries package (see this blog post),
 which is still in development, will be provided. Among other 
advantages, these high-level APIs will eventually support different data
 sources, from in-memory data frames to out-of-core databases, with very
 little code changes.

The new DataFrames release will require adjustments from all packages
 depending on DataFrames. Until then, development will continue to 
happen on the master branch of the git repository.  In many
 cases, both the new and the old frameworks can be supported in parallel
 (by supporting both DataArray and NullableArray): when possible, package 
authors are encouraged to start porting as soon as possible. The porting work 
is tracked in a GitHub issue; take inspiration from existing pull requests, and 
do not hesitate to ask for help there if needed.

Motivated users can also experiment with the development version, 
though be warned that the user experience can currently be frustrating 
due to incomplete support for Nullable in Julia and in high-level APIs. This 
issue, known as "lifting" (see this discussion and this one, as well as linked 
pages), still requires fundamental changes. We expect these to be complete by 
early January 2017 to allow for a progressive migration; users are not advised 
to upgrade to the development version for actual work until then.

More ChangesThe above changes will be coordinated with a related refactoring of 
the DataFrames codebase to increase modularity and :
 * CSV reading and writing support (readtable and writetable) will be 
deprecated in favor of the CSV package. Data importation and exportation should 
more generally be done via the DataStreams package (see this blog post).
 * Functions translating model formulas into model matrices will be moved to a 
separate StatsModels package, with the goal of eventually supporting any kind 
of AbstractTable (including DataFrame),
 and will also include model-related functions currently in StatsBase. 
Though this will not happen in the first release, in the end modeling 
packages should only need to depend on that package, and no longer on 
DataFrames.
 * A new AbstractTable interface will be progressively developed in the 
eponymous package to allow writing generic code supporting any kind of tabular 
data, including DataFrame, without depending on the DataFrames package.
 * Packages strongly tied to DataFrames (including that package itself) will be 
moved to the JuliaData organization to keep JuliaStats focused on actual 
statistics.

We are aware that the transition will certainly be disruptive for 
users. But we are confident the advantages of the new framework will 
greatly offset its costs, following state-of-the-art designs like R's dplyr and 
Python's Pandas 2.0, and taking full advantage of Julia's flexibility and 
performance. Your help is welcome to push forward with this roadmap!

      


















    

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[julia-stats] ANN: DataFrames 0.9.0 Planned for February

Reply via email to