It is possible to do generic compiler improvements for Union types
(Jameson had a branch at some point that did callsite splitting if we
inferred a Union type). However, I think the best way to go here is to
maintain the current separation of two arrays (one of the values one
for the NAs), but give an option type on access. The option type would
then most likely be in memory and wouldn't have much overhead. Please
let me know if there's anything specific I should explain how the
compiler will handle it, I admit I have only skimmed this thread.

On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]> wrote:
> On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>>
>> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>>
>> To address Simon’s general points, which are really good reasons to avoid
>> jumping on the Option{T} bandwagon too soon:
>>
>>
>>
>> * I agree that most languages use tagged union types for Option{T} rather
>> than a wrapper type that contains a Boolean value. It’s also totally true
>> that many compilers are able to make those constructs more efficient than
>> Julia currently does. But what we should expect from Julia in the coming
>> years isn’t so clear to me. (And I personally think we need to settle on a
>> solution for representing missing data that’s viable in a year rather than
>> viable in five years.) This is an issue that I’d really like to have input
>> on from Jeff, Keno, Jameson or someone else involved with the internals of
>> the compiler. Getting input from the broader community is the main reason I
>> wanted to put a demo of OptionTypes.jl out in front of other folks.
>>
>>
>>
>> * I’m not clear how we could come to know that a datum is not missing
>> without a resolution step that’s effectively equivalent to the get()
>> function for Option{T}. I agree that the enforced use of get() means that
>> you can’t hope to use generic functions like sum on collections of
>> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>> easiest way to express to the compiler that you know that all of the entries
>> of a DataArray are not NA is to convert the DataArray to a straight Array.
>> But maybe you have other mechanisms for expressing this knowledge. Certainly
>> my proposal to do conversions to Arrays isn’t the most elegant strategy.
>> It’s just all that I’ve got so far.
>>
>>
>>
>> * I kind of like the idea of Option{T} standing outside of the main type
>> system in a kind of mirror type system. I’m less happy about Union(NA, T)
>> being a super type of T, even though there are some good reasons that you’d
>> like to view T as a specialization of Union(NA, T). But I agree that I don’t
>> have a good feel about where missing data belongs in the type hierarchy.
>> This is another question for which I’d love to get input from others.
>>
>>
>>
>> In regard to Simon’s performance points:
>>
>>
>>
>> * Yes, memory usage alone argues strongly for working with DataArray{T}
>> rather than Array{Option{T}}.
>>
>>
>>
>> * Exploting tricks that make operations like anyna() faster is another
>> good argument for keeping DataArray{T} around.
>>
>>
>>
>> * I’m not sure how to deal with inlining concerns or the undefined
>> reference checks. Do you have ideas for improving this within DataArrays or
>> do we need supporting changes in the compiler?
>>
>> Actually it seems it would be possible to make Array{Union(NAtype, T)}
>> more similar to and as efficient as DataArray{T}, by handling a few things
>> in the compiler. This would create a generalization of DataArray to any kind
>> of union type, which could be useful in other contexts. But more
>> importantly, it would make missing values integrate seamlessly into Julia,
>> getting rid of any hacks.
>>
>> More specifically, the following features would need to be supported:
>> - a way of telling the compiler to store the data as two arrays of
>> concrete types (here T and NAtype), instead of as an array of boxed values,
>> so that:
>>     * efficient operations can be performed on the T values (by skipping
>> the missing ones manually)
>>     * T values are stored as a dense array and can be converted to
>> Array{T} without any copy or passed to BLAS when no missing values are
>> present
>>    * NA values can be packed in a BitArray to save memory and make NA
>> detection faster (see below)
>> - a fonction to check whether a given element of the array is of type T
>> rather than of NAtype (generalization of isna())
>> - a fonction to check whether all elements of the array are of type T
>> rather than of NAtype (generalization of anyna(), more efficient than
>> calling the previous function on all elements thanks to the packing of NAs
>> in a BitArray)
>> In this scheme, what is missing is how to allow the compiler to pack NAs
>> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit object.
>> Maybe by making it an enum-like immutable with a 1-bit field inside it?
>>
>> How does it sound?
>
>
> I've thought a bit about this, but it seems like it would be too much
> complexity in the compiler. Storing arrays as something besides contiguous
> elements and interaction between the codegen in C/C++ and the BitArray code
> in Julia both seem likely to be painful, although Jeff, Keno, and Jameson
> would know better than I. Additionally, this optimization (of storage of
> arrays of unions of a singleton type and a bits type) seems pretty specific
> to DataArrays, but the actual advantages in terms of performance and
> expressibility would be small or non-existent. (This is in contrast to
> optimizing storage/dispatch with union types, which could benefit a lot of
> code and is something a lot of languages do.) Finally, there are cases where
> it is useful to have direct access to the na BitArray chunks beyond anyna,
> e.g. pairwise summation and reductions across the first dimension.
>
> Simon

Reply via email to