Yeah, that seems totally reasonable to me. If we do this in a more formal way, 
I’m now onboard.

Let’s add the idea of explicit restrictions on columns that can and can’t 
contain NA’s to the spec: https://github.com/JuliaStats/DataFrames.jl/issues/502

 — John

On Jan 23, 2014, at 8:21 PM, Sean Garborg <[email protected]> wrote:

> My first thought was a Vector{Bool}.
> 
> On Thursday, January 23, 2014 10:05:25 PM UTC-6, John Myles White wrote:
> Ok. I’m coming around to this.
> 
> How would you do I/O? If we make DataFrames expose a nullable property, we 
> could plausibly produce vectors instead of data vectors when parsing CSV 
> files.
> 
>  — John
> 
> On Jan 23, 2014, at 7:38 PM, Sean Garborg <[email protected]> wrote:
> 
>> I'd think of #3 as a feature, too.
>> 
>> Just to throw another use case in the ring, if DataFrames with a mix of 
>> Vectors and DataVectors (with NAs) were performant, my co-workers and I 
>> would usually pull in data marking all columns as Vectors, these columns 
>> would remain Vectors, and derived columns would be mostly DataVectors.
>> 
>> 
>> On Thursday, January 23, 2014 8:48:42 PM UTC-6, tshort wrote:
>> I think of item #3 as a feature, not a bug. I don't like the idea of 
>> auto-conversion. If I choose Vectors, I should not expect them to 
>> support missing values. R sometimes irritates me by adding NA's when I 
>> don't expect it. I'd rather have the error than have NA's sneak in 
>> there. Also, there may be other types of AbstractDataFrames where we 
>> don't have the ability to assign missing values. HDF5 tables are one 
>> example I can think of. We wouldn't want to try to autoconvert a huge 
>> HDF5 column to a DataVector. 
>> 
>> 
>> 
>> On Thu, Jan 23, 2014 at 8:58 PM, John Myles White 
>> <[email protected]> wrote: 
>> > A couple of points that expand on Tom’s comments: 
>> > 
>> > (1) We need to add Tom’s definition of countna(a::Array) = 0 to show() 
>> > wide DataFrame’s that contain any columns that are Vector’s. I never use 
>> > DataFrame’s like that, so I forgot that others might. It’s also impossible 
>> > to produce such a DataFrame using our current I/O routines. 
>> > 
>> > (2) The constructor you’re using does exist, Jacob, but you should 
>> > typically pass in a Vector{Any}, each element of which is either a 
>> > DataVector or PooledDataVector. See Point (3) for why, at the moment, 
>> > using a Vector as a column is subtly broken. 
>> > 
>> > (3) If people are going to put Vector’s in DataFrames for performance 
>> > reasons, all of our setindex!() functions for DataFrames need to add 
>> > methods that automatically convert Vector’s to DataVector’s if an NA is 
>> > inserted in a Vector. Right now that kind of insertion is just going to 
>> > error out. Ths check isn’t too hard, but it’s totally missing from our 
>> > current codebase. 
>> > 
>> > Personally, I would prefer that we not allow any of the columns of a 
>> > DataFrame to be Vector's. It’s a weird edge case that doesn’t actually 
>> > offer reliable high performance, because the potential performance 
>> > improvements relies on the unsafe assumption that a DataFame won’t contain 
>> > any columns with NA’s in it. 
>> > 
>> >  — John 
>> > 
>> > On Jan 23, 2014, at 1:33 PM, Tom Short <[email protected]> wrote: 
>> > 
>> >> That works, but columns will be Arrays instead of DataArrays. That's 
>> >> the way it's always worked. If you want them to be DataArrays, then 
>> >> convert to DataArrays right at the end. 
>> >> 
>> >> To fix show to support columns that are arrays, we probably need (at 
>> >> least) to define the following: 
>> >> 
>> >> countna(da::Array) = 0 
>> >> 
>> >> 
>> >> 
>> >> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <[email protected]> wrote: 
>> >>> Great investigative work. Is 
>> >>> DataFrames( array_of_arrays, Index(column_names_array) ) 
>> >>> not the right way to hand construct DataFrames any more? I think I can 
>> >>> allocate DataArrays instead, but at every step of the way, I was trying 
>> >>> to 
>> >>> hand-optimize the result fetching process, which resulted in not 
>> >>> creating a 
>> >>> DataArray or DataFrame until right before we return to the user. 
>> >>> 
>> >>> -Jacob 
>> >>> 
>> >>> 
>> >>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <[email protected]> wrote: 
>> >>>> 
>> >>>> To check Jacob's suggestion about versions mismatch I completely 
>> >>>> removed 
>> >>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted 
>> >>>> the 
>> >>>> directories from disk. I then added them via Pkg.add and Pkg,update. 
>> >>>> 
>> >>>> I am running the julia nightlies build. 
>> >>>> julia> versioninfo() 
>> >>>> Julia Version 0.3.0-prerelease+1127 
>> >>>> Commit bc73674* (2014-01-22 20:09 UTC) 
>> >>>> 
>> >>>> Pkg.status() 
>> >>>> - DataFrames                  0.5.1 
>> >>>> - ODBC                          0.3.5 
>> >>>> 
>> >>>> Pkg.checkout("ODBC") 
>> >>>> INFO: Checking out ODBC master... 
>> >>>> INFO: Pulling ODBC latest master... 
>> >>>> INFO: No packages to install, update or remove 
>> >>>> 
>> >>>> julia> Pkg.checkout("DataFrames") 
>> >>>> INFO: Checking out DataFrames master... 
>> >>>> INFO: Pulling DataFrames latest master... 
>> >>>> INFO: No packages to install, update or remove 
>> >>>> 
>> >>>> I did some digging. It looks like there is a mismatch in that countna 
>> >>>> expects DataFrame columns to be DataArrays. However the ODBC package 
>> >>>> returns 
>> >>>> DataFrames that have array columns (using the first constructor in 
>> >>>> dataframe.jl). You guys would know better as to whether a change is 
>> >>>> needed 
>> >>>> in the constructor or if countna should also accept Array columns. 
>> >>>> 
>> >>>> 
>> >>>> I made some local changes to work around the issue. 
>> >>>> 
>> >>>> show.jl: 
>> >>>> line 42:  if isna(col, i)     changed to  if isna(col[i]) 
>> >>>> line 322:  missing[j] = countna(adf[j])     changed to    missing[j] = 
>> >>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j])) 
>> >>>> 
>> >>>> These work great for me. 
>> >>>> 
>> >>> 
>> >

Reply via email to