I would be a lot happier with that feature if we followed the lead of traditional databases and constantly reminded users which columns are “NOT NULL”. As it stands, the “types” of a DataFrame don’t tell you whether a column could contain NA’s or not. If we exposed functionality through something like a hypothetical nullable(df, colindex), my resistance to that feature would start to go away
— John On Jan 23, 2014, at 6:48 PM, Tom Short <[email protected]> wrote: > I think of item #3 as a feature, not a bug. I don't like the idea of > auto-conversion. If I choose Vectors, I should not expect them to > support missing values. R sometimes irritates me by adding NA's when I > don't expect it. I'd rather have the error than have NA's sneak in > there. Also, there may be other types of AbstractDataFrames where we > don't have the ability to assign missing values. HDF5 tables are one > example I can think of. We wouldn't want to try to autoconvert a huge > HDF5 column to a DataVector. > > > > On Thu, Jan 23, 2014 at 8:58 PM, John Myles White > <[email protected]> wrote: >> A couple of points that expand on Tom’s comments: >> >> (1) We need to add Tom’s definition of countna(a::Array) = 0 to show() wide >> DataFrame’s that contain any columns that are Vector’s. I never use >> DataFrame’s like that, so I forgot that others might. It’s also impossible >> to produce such a DataFrame using our current I/O routines. >> >> (2) The constructor you’re using does exist, Jacob, but you should typically >> pass in a Vector{Any}, each element of which is either a DataVector or >> PooledDataVector. See Point (3) for why, at the moment, using a Vector as a >> column is subtly broken. >> >> (3) If people are going to put Vector’s in DataFrames for performance >> reasons, all of our setindex!() functions for DataFrames need to add methods >> that automatically convert Vector’s to DataVector’s if an NA is inserted in >> a Vector. Right now that kind of insertion is just going to error out. Ths >> check isn’t too hard, but it’s totally missing from our current codebase. >> >> Personally, I would prefer that we not allow any of the columns of a >> DataFrame to be Vector's. It’s a weird edge case that doesn’t actually offer >> reliable high performance, because the potential performance improvements >> relies on the unsafe assumption that a DataFame won’t contain any columns >> with NA’s in it. >> >> — John >> >> On Jan 23, 2014, at 1:33 PM, Tom Short <[email protected]> wrote: >> >>> That works, but columns will be Arrays instead of DataArrays. That's >>> the way it's always worked. If you want them to be DataArrays, then >>> convert to DataArrays right at the end. >>> >>> To fix show to support columns that are arrays, we probably need (at >>> least) to define the following: >>> >>> countna(da::Array) = 0 >>> >>> >>> >>> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <[email protected]> wrote: >>>> Great investigative work. Is >>>> DataFrames( array_of_arrays, Index(column_names_array) ) >>>> not the right way to hand construct DataFrames any more? I think I can >>>> allocate DataArrays instead, but at every step of the way, I was trying to >>>> hand-optimize the result fetching process, which resulted in not creating a >>>> DataArray or DataFrame until right before we return to the user. >>>> >>>> -Jacob >>>> >>>> >>>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <[email protected]> wrote: >>>>> >>>>> To check Jacob's suggestion about versions mismatch I completely removed >>>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted the >>>>> directories from disk. I then added them via Pkg.add and Pkg,update. >>>>> >>>>> I am running the julia nightlies build. >>>>> julia> versioninfo() >>>>> Julia Version 0.3.0-prerelease+1127 >>>>> Commit bc73674* (2014-01-22 20:09 UTC) >>>>> >>>>> Pkg.status() >>>>> - DataFrames 0.5.1 >>>>> - ODBC 0.3.5 >>>>> >>>>> Pkg.checkout("ODBC") >>>>> INFO: Checking out ODBC master... >>>>> INFO: Pulling ODBC latest master... >>>>> INFO: No packages to install, update or remove >>>>> >>>>> julia> Pkg.checkout("DataFrames") >>>>> INFO: Checking out DataFrames master... >>>>> INFO: Pulling DataFrames latest master... >>>>> INFO: No packages to install, update or remove >>>>> >>>>> I did some digging. It looks like there is a mismatch in that countna >>>>> expects DataFrame columns to be DataArrays. However the ODBC package >>>>> returns >>>>> DataFrames that have array columns (using the first constructor in >>>>> dataframe.jl). You guys would know better as to whether a change is needed >>>>> in the constructor or if countna should also accept Array columns. >>>>> >>>>> >>>>> I made some local changes to work around the issue. >>>>> >>>>> show.jl: >>>>> line 42: if isna(col, i) changed to if isna(col[i]) >>>>> line 322: missing[j] = countna(adf[j]) changed to missing[j] = >>>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j])) >>>>> >>>>> These work great for me. >>>>> >>>> >>
