A couple of points that expand on Tom’s comments:
(1) We need to add Tom’s definition of countna(a::Array) = 0 to show() wide
DataFrame’s that contain any columns that are Vector’s. I never use DataFrame’s
like that, so I forgot that others might. It’s also impossible to produce such
a DataFrame using our current I/O routines.
(2) The constructor you’re using does exist, Jacob, but you should typically
pass in a Vector{Any}, each element of which is either a DataVector or
PooledDataVector. See Point (3) for why, at the moment, using a Vector as a
column is subtly broken.
(3) If people are going to put Vector’s in DataFrames for performance reasons,
all of our setindex!() functions for DataFrames need to add methods that
automatically convert Vector’s to DataVector’s if an NA is inserted in a
Vector. Right now that kind of insertion is just going to error out. Ths check
isn’t too hard, but it’s totally missing from our current codebase.
Personally, I would prefer that we not allow any of the columns of a DataFrame
to be Vector's. It’s a weird edge case that doesn’t actually offer reliable
high performance, because the potential performance improvements relies on the
unsafe assumption that a DataFame won’t contain any columns with NA’s in it.
— John
On Jan 23, 2014, at 1:33 PM, Tom Short <[email protected]> wrote:
> That works, but columns will be Arrays instead of DataArrays. That's
> the way it's always worked. If you want them to be DataArrays, then
> convert to DataArrays right at the end.
>
> To fix show to support columns that are arrays, we probably need (at
> least) to define the following:
>
> countna(da::Array) = 0
>
>
>
> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <[email protected]> wrote:
>> Great investigative work. Is
>> DataFrames( array_of_arrays, Index(column_names_array) )
>> not the right way to hand construct DataFrames any more? I think I can
>> allocate DataArrays instead, but at every step of the way, I was trying to
>> hand-optimize the result fetching process, which resulted in not creating a
>> DataArray or DataFrame until right before we return to the user.
>>
>> -Jacob
>>
>>
>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <[email protected]> wrote:
>>>
>>> To check Jacob's suggestion about versions mismatch I completely removed
>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted the
>>> directories from disk. I then added them via Pkg.add and Pkg,update.
>>>
>>> I am running the julia nightlies build.
>>> julia> versioninfo()
>>> Julia Version 0.3.0-prerelease+1127
>>> Commit bc73674* (2014-01-22 20:09 UTC)
>>>
>>> Pkg.status()
>>> - DataFrames 0.5.1
>>> - ODBC 0.3.5
>>>
>>> Pkg.checkout("ODBC")
>>> INFO: Checking out ODBC master...
>>> INFO: Pulling ODBC latest master...
>>> INFO: No packages to install, update or remove
>>>
>>> julia> Pkg.checkout("DataFrames")
>>> INFO: Checking out DataFrames master...
>>> INFO: Pulling DataFrames latest master...
>>> INFO: No packages to install, update or remove
>>>
>>> I did some digging. It looks like there is a mismatch in that countna
>>> expects DataFrame columns to be DataArrays. However the ODBC package returns
>>> DataFrames that have array columns (using the first constructor in
>>> dataframe.jl). You guys would know better as to whether a change is needed
>>> in the constructor or if countna should also accept Array columns.
>>>
>>>
>>> I made some local changes to work around the issue.
>>>
>>> show.jl:
>>> line 42: if isna(col, i) changed to if isna(col[i])
>>> line 322: missing[j] = countna(adf[j]) changed to missing[j] =
>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j]))
>>>
>>> These work great for me.
>>>
>>