FWIW, I don’t think overhead is the right concept here: DataFrames and Arrays
are almost almost totally dissimilar data structures. (DataFrames are arguably
much more like Dict’s than Array’s.)
If Arrays are appropriate, use those. DataFrames are designed for use in cases
where Arrays are clearly not a meaningful data structure to apply to your
problems because Arrays don’t maintain any of the invariants that a DataFrame
must maintain — column homogeneity coupled with row heterogeneity.
DataArrays are also totally dissimilar from DataFrames — in fact, they’re
exactly like Arrays with the option of storing a singleton value called NA.
Right now they have a non-trivial performance overhead relative to Arrays, but
that will decrease over time. What won’t decrease over time is the cognitive
overhead of using DataArrays — they impose a lot more complexity because of the
uncertainty about what you’ll get from a DataArray when you index in it. That
complexity's only appropriate if you’re working with missing values. In many
applications there are no missing values, so DataArrays are needless complexity.
Here’s how I think of doing data analysis:
(1) Gather data
(2) Store a data in a tabular data structure
(3) Apply transformations (like those found in GLM) to transform a tabular data
structure into an Matrix{Float64}
(4) Do numerical computations on Matrix{Float64}
— John
On Oct 25, 2014, at 3:19 PM, Kaj Wiik <[email protected]> wrote:
> A followup from a fellow astronomer: what is the overhead of data frames
> compared to plain arrays, are there any benchmarks available? When I should
> avoid of using data arrays or should I use them always :-)?
>
> Cheers,
> Kaj
>
> On Saturday, October 25, 2014 3:37:18 PM UTC+3, Daniel Carrera wrote:
> Hello,
>
> This is a fairly naive question. I have observed for the last two years that
> many people really like data frames. R users obviously like them, and the
> Python and Julia communities thought it was worth adding that feature to
> their languages too. However, as an astronomer, I have not yet had a problem
> that would be solved by data frames. I use Julia to analyze hydrodynamic
> simulations. I can imagine that data frames could have a role in photographic
> data where some pixels are missing.
>
> Are you a scientist or engineer currently using data frames to solve a
> problem? I would love to hear about what you do with data frames and why you
> find them useful.
>
> Cheers,
> Daniel.