I was using the dataframes convert method that allows replacement of NA
with an arbitrary value. I thought I had it working, but maybe I forgot to
save and was running an old version.
Anyway, it appears I am using the method from the dataframes documentation,
but it results in a type error:
Using DataFrames
city = readtable(fname)
points = convert(Array, city[:,2:end], NaN) # converts NA values to
NaN == not a number
Results in:
*ERROR: MethodError: `convert` has no method matching
convert(::Type{Array{T,N}}, ::DataFrames.DataFrame, ::Float64)*
*This may have arisen from a call to the constructor Array{T,N}(...),*
*since type constructors fall back to convert methods.*
Closest candidates are:
convert{T,N}(::Type{Array{T,N}}, *::DataArrays.DataArray{T,N}*, ::Any)
convert{T,R,N}(::Type{Array{T,N}}, *::DataArrays.PooledDataArray{T,R,N}*,
::Any)
convert(::Type{Array{T,N}}, ::DataFrames.AbstractDataFrame)
...
in hcluster at /Users/lewislevinmbr/Dropbox/Online Coursework/MIT Intro
6002x/Assignments/Probset_6/df_hcluster.jl:85
DataFrames documentation shows:
dv = @data([NA, 3, 2, 5, 4])mean(convert(Array, dv, 11))
Seems like I am doing the same thing, just using the float value NaN. The
columns of city that are being sliced are indeed Float64.
This certainly works, but will fail if any value is NA (not a problem with
sample dataset, but I would like to generalize...):
points = Array{Float64}(city[:, 2:end]) # fails if any value is NA
>
Kept breaking this down and solved it. The convert with replacement of NA
values only works on type::DataArray, not the DataFrames type. So, first
convert to DataArray, then do the conversion with replacement of NA, thus:
city = readtable(fname)
>
> points = convert(Array{Float64,2}, DataArray(city[:, 2:end]), NaN) #
>> converts NA to NaN
>
>
That's what I wanted--and got. Works like a champ. I think replacing NaN
with NA is pretty useful. NaN's will propagate like NA's in a DataArray
type, but Array{Float64} is noticeably faster.
You could ask, "why are you doing this? ...like why even use a DataFrame
at all with its ability to handle NAs if you are just going to convert back
out of it....?"
Well, good question. The simple answer is for the simple data reading and
handling of row/col names and simple summary stats, etc. And then, since
the core data array has no NA's and is float, get the improved performance
for handling the data subset as Array{Float64,2}.
And, yes, I did the experiment with readcsv and this also works, but
provides no handling of NA.
So, I think the most general is loading the data as a DataFrame, deciding
what to do with NA, and then converting.
There is enough here to handle lots of different approaches. Just
experimenting.