I would be a lot happier with that feature if we followed the lead of 
traditional databases and constantly reminded users which columns are “NOT 
NULL”. As it stands, the “types” of a DataFrame don’t tell you whether a column 
could contain NA’s or not. If we exposed functionality through something like a 
hypothetical nullable(df, colindex), my resistance to that feature would start 
to go away

 — John

On Jan 23, 2014, at 6:48 PM, Tom Short <[email protected]> wrote:

> I think of item #3 as a feature, not a bug. I don't like the idea of
> auto-conversion. If I choose Vectors, I should not expect them to
> support missing values. R sometimes irritates me by adding NA's when I
> don't expect it. I'd rather have the error than have NA's sneak in
> there. Also, there may be other types of AbstractDataFrames where we
> don't have the ability to assign missing values. HDF5 tables are one
> example I can think of. We wouldn't want to try to autoconvert a huge
> HDF5 column to a DataVector.
> 
> 
> 
> On Thu, Jan 23, 2014 at 8:58 PM, John Myles White
> <[email protected]> wrote:
>> A couple of points that expand on Tom’s comments:
>> 
>> (1) We need to add Tom’s definition of countna(a::Array) = 0 to show() wide 
>> DataFrame’s that contain any columns that are Vector’s. I never use 
>> DataFrame’s like that, so I forgot that others might. It’s also impossible 
>> to produce such a DataFrame using our current I/O routines.
>> 
>> (2) The constructor you’re using does exist, Jacob, but you should typically 
>> pass in a Vector{Any}, each element of which is either a DataVector or 
>> PooledDataVector. See Point (3) for why, at the moment, using a Vector as a 
>> column is subtly broken.
>> 
>> (3) If people are going to put Vector’s in DataFrames for performance 
>> reasons, all of our setindex!() functions for DataFrames need to add methods 
>> that automatically convert Vector’s to DataVector’s if an NA is inserted in 
>> a Vector. Right now that kind of insertion is just going to error out. Ths 
>> check isn’t too hard, but it’s totally missing from our current codebase.
>> 
>> Personally, I would prefer that we not allow any of the columns of a 
>> DataFrame to be Vector's. It’s a weird edge case that doesn’t actually offer 
>> reliable high performance, because the potential performance improvements 
>> relies on the unsafe assumption that a DataFame won’t contain any columns 
>> with NA’s in it.
>> 
>> — John
>> 
>> On Jan 23, 2014, at 1:33 PM, Tom Short <[email protected]> wrote:
>> 
>>> That works, but columns will be Arrays instead of DataArrays. That's
>>> the way it's always worked. If you want them to be DataArrays, then
>>> convert to DataArrays right at the end.
>>> 
>>> To fix show to support columns that are arrays, we probably need (at
>>> least) to define the following:
>>> 
>>> countna(da::Array) = 0
>>> 
>>> 
>>> 
>>> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <[email protected]> wrote:
>>>> Great investigative work. Is
>>>> DataFrames( array_of_arrays, Index(column_names_array) )
>>>> not the right way to hand construct DataFrames any more? I think I can
>>>> allocate DataArrays instead, but at every step of the way, I was trying to
>>>> hand-optimize the result fetching process, which resulted in not creating a
>>>> DataArray or DataFrame until right before we return to the user.
>>>> 
>>>> -Jacob
>>>> 
>>>> 
>>>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <[email protected]> wrote:
>>>>> 
>>>>> To check Jacob's suggestion about versions mismatch I completely removed
>>>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted the
>>>>> directories from disk. I then added them via Pkg.add and Pkg,update.
>>>>> 
>>>>> I am running the julia nightlies build.
>>>>> julia> versioninfo()
>>>>> Julia Version 0.3.0-prerelease+1127
>>>>> Commit bc73674* (2014-01-22 20:09 UTC)
>>>>> 
>>>>> Pkg.status()
>>>>> - DataFrames                  0.5.1
>>>>> - ODBC                          0.3.5
>>>>> 
>>>>> Pkg.checkout("ODBC")
>>>>> INFO: Checking out ODBC master...
>>>>> INFO: Pulling ODBC latest master...
>>>>> INFO: No packages to install, update or remove
>>>>> 
>>>>> julia> Pkg.checkout("DataFrames")
>>>>> INFO: Checking out DataFrames master...
>>>>> INFO: Pulling DataFrames latest master...
>>>>> INFO: No packages to install, update or remove
>>>>> 
>>>>> I did some digging. It looks like there is a mismatch in that countna
>>>>> expects DataFrame columns to be DataArrays. However the ODBC package 
>>>>> returns
>>>>> DataFrames that have array columns (using the first constructor in
>>>>> dataframe.jl). You guys would know better as to whether a change is needed
>>>>> in the constructor or if countna should also accept Array columns.
>>>>> 
>>>>> 
>>>>> I made some local changes to work around the issue.
>>>>> 
>>>>> show.jl:
>>>>> line 42:  if isna(col, i)     changed to  if isna(col[i])
>>>>> line 322:  missing[j] = countna(adf[j])     changed to    missing[j] =
>>>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j]))
>>>>> 
>>>>> These work great for me.
>>>>> 
>>>> 
>> 

Reply via email to