Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
> To address Simon’s general points, which are really good reasons to
> avoid jumping on the Option{T} bandwagon too soon:
>
>
>
> * I agree that most languages use tagged union types for Option{T}
> rather than a wrapper type that contains a Boolean value. It’s also
> totally true that many compilers are able to make those constructs
> more efficient than Julia currently does. But what we should expect
> from Julia in the coming years isn’t so clear to me. (And I personally
> think we need to settle on a solution for representing missing data
> that’s viable in a year rather than viable in five years.) This is an
> issue that I’d really like to have input on from Jeff, Keno, Jameson
> or someone else involved with the internals of the compiler. Getting
> input from the broader community is the main reason I wanted to put a
> demo of OptionTypes.jl out in front of other folks.
>
>
> * I’m not clear how we could come to know that a datum is not missing
> without a resolution step that’s effectively equivalent to the get()
> function for Option{T}. I agree that the enforced use of get() means
> that you can’t hope to use generic functions like sum on collections
> of Option{T}. But I’m also not sure that’s such a bad thing: I think
> the easiest way to express to the compiler that you know that all of
> the entries of a DataArray are not NA is to convert the DataArray to a
> straight Array. But maybe you have other mechanisms for expressing
> this knowledge. Certainly my proposal to do conversions to Arrays
> isn’t the most elegant strategy. It’s just all that I’ve got so far.
>
>
> * I kind of like the idea of Option{T} standing outside of the main
> type system in a kind of mirror type system. I’m less happy about
> Union(NA, T) being a super type of T, even though there are some good
> reasons that you’d like to view T as a specialization of Union(NA, T).
> But I agree that I don’t have a good feel about where missing data
> belongs in the type hierarchy. This is another question for which I’d
> love to get input from others.
>
>
> In regard to Simon’s performance points:
>
>
> * Yes, memory usage alone argues strongly for working with
> DataArray{T} rather than Array{Option{T}}.
>
>
> * Exploting tricks that make operations like anyna() faster is another
> good argument for keeping DataArray{T} around.
>
>
> * I’m not sure how to deal with inlining concerns or the undefined
> reference checks. Do you have ideas for improving this within
> DataArrays or do we need supporting changes in the compiler?
Actually it seems it would be possible to make Array{Union(NAtype, T)}
more similar to and as efficient as DataArray{T}, by handling a few
things in the compiler. This would create a generalization of DataArray
to any kind of union type, which could be useful in other contexts. But
more importantly, it would make missing values integrate seamlessly into
Julia, getting rid of any hacks.
More specifically, the following features would need to be supported:
- a way of telling the compiler to store the data as two arrays of
concrete types (here T and NAtype), instead of as an array of boxed
values, so that:
* efficient operations can be performed on the T values (by skipping
the missing ones manually)
* T values are stored as a dense array and can be converted to
Array{T} without any copy or passed to BLAS when no missing values are
present
* NA values can be packed in a BitArray to save memory and make NA
detection faster (see below)
- a fonction to check whether a given element of the array is of type T
rather than of NAtype (generalization of isna())
- a fonction to check whether all elements of the array are of type T
rather than of NAtype (generalization of anyna(), more efficient than
calling the previous function on all elements thanks to the packing of
NAs in a BitArray)
In this scheme, what is missing is how to allow the compiler to pack NAs
in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
object. Maybe by making it an enum-like immutable with a 1-bit field
inside it?
How does it sound?