Is there a reason we can't change the way unions of small bits types are 
represented, so that if we know something is a Union(Float64,NA) it can 
live in registers or on the stack instead of having to be heap allocated?

On Friday, August 1, 2014 2:54:49 PM UTC-4, Jameson wrote:
>
> I could (and probably will, someday) revive that commit. At the time, 
> though, I seemed to find that it provided little performance benefit -- the 
> gc cost of allocating boxes was far greater (for type uncertainty involving 
> bitstypes) and the type dispatch wasn't as much of a performance impact as 
> I had previously assumed. 
>
>
> On Friday, August 1, 2014, Keno Fischer <[email protected] 
> <javascript:>> wrote:
>
>> It is possible to do generic compiler improvements for Union types
>> (Jameson had a branch at some point that did callsite splitting if we
>> inferred a Union type). However, I think the best way to go here is to
>> maintain the current separation of two arrays (one of the values one
>> for the NAs), but give an option type on access. The option type would
>> then most likely be in memory and wouldn't have much overhead. Please
>> let me know if there's anything specific I should explain how the
>> compiler will handle it, I admit I have only skimmed this thread.
>>
>> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]> 
>> wrote:
>> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>> >>
>> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>> >>
>> >> To address Simon’s general points, which are really good reasons to 
>> avoid
>> >> jumping on the Option{T} bandwagon too soon:
>> >>
>> >>
>> >>
>> >> * I agree that most languages use tagged union types for Option{T} 
>> rather
>> >> than a wrapper type that contains a Boolean value. It’s also totally 
>> true
>> >> that many compilers are able to make those constructs more efficient 
>> than
>> >> Julia currently does. But what we should expect from Julia in the 
>> coming
>> >> years isn’t so clear to me. (And I personally think we need to settle 
>> on a
>> >> solution for representing missing data that’s viable in a year rather 
>> than
>> >> viable in five years.) This is an issue that I’d really like to have 
>> input
>> >> on from Jeff, Keno, Jameson or someone else involved with the 
>> internals of
>> >> the compiler. Getting input from the broader community is the main 
>> reason I
>> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>> >>
>> >>
>> >>
>> >> * I’m not clear how we could come to know that a datum is not missing
>> >> without a resolution step that’s effectively equivalent to the get()
>> >> function for Option{T}. I agree that the enforced use of get() means 
>> that
>> >> you can’t hope to use generic functions like sum on collections of
>> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>> >> easiest way to express to the compiler that you know that all of the 
>> entries
>> >> of a DataArray are not NA is to convert the DataArray to a straight 
>> Array.
>> >> But maybe you have other mechanisms for expressing this knowledge. 
>> Certainly
>> >> my proposal to do conversions to Arrays isn’t the most elegant 
>> strategy.
>> >> It’s just all that I’ve got so far.
>> >>
>> >>
>> >>
>> >> * I kind of like the idea of Option{T} standing outside of the main 
>> type
>> >> system in a kind of mirror type system. I’m less happy about Union(NA, 
>> T)
>> >> being a super type of T, even though there are some good reasons that 
>> you’d
>> >> like to view T as a specialization of Union(NA, T). But I agree that I 
>> don’t
>> >> have a good feel about where missing data belongs in the type 
>> hierarchy.
>> >> This is another question for which I’d love to get input from others.
>> >>
>> >>
>> >>
>> >> In regard to Simon’s performance points:
>> >>
>> >>
>> >>
>> >> * Yes, memory usage alone argues strongly for working with DataArray{T}
>> >> rather than Array{Option{T}}.
>> >>
>> >>
>> >>
>> >> * Exploting tricks that make operations like anyna() faster is another
>> >> good argument for keeping DataArray{T} around.
>> >>
>> >>
>> >>
>> >> * I’m not sure how to deal with inlining concerns or the undefined
>> >> reference checks. Do you have ideas for improving this within 
>> DataArrays or
>> >> do we need supporting changes in the compiler?
>> >>
>> >> Actually it seems it would be possible to make Array{Union(NAtype, T)}
>> >> more similar to and as efficient as DataArray{T}, by handling a few 
>> things
>> >> in the compiler. This would create a generalization of DataArray to 
>> any kind
>> >> of union type, which could be useful in other contexts. But more
>> >> importantly, it would make missing values integrate seamlessly into 
>> Julia,
>> >> getting rid of any hacks.
>> >>
>> >> More specifically, the following features would need to be supported:
>> >> - a way of telling the compiler to store the data as two arrays of
>> >> concrete types (here T and NAtype), instead of as an array of boxed 
>> values,
>> >> so that:
>> >>     * efficient operations can be performed on the T values (by 
>> skipping
>> >> the missing ones manually)
>> >>     * T values are stored as a dense array and can be converted to
>> >> Array{T} without any copy or passed to BLAS when no missing values are
>> >> present
>> >>    * NA values can be packed in a BitArray to save memory and make NA
>> >> detection faster (see below)
>> >> - a fonction to check whether a given element of the array is of type T
>> >> rather than of NAtype (generalization of isna())
>> >> - a fonction to check whether all elements of the array are of type T
>> >> rather than of NAtype (generalization of anyna(), more efficient than
>> >> calling the previous function on all elements thanks to the packing of 
>> NAs
>> >> in a BitArray)
>> >> In this scheme, what is missing is how to allow the compiler to pack 
>> NAs
>> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit 
>> object.
>> >> Maybe by making it an enum-like immutable with a 1-bit field inside it?
>> >>
>> >> How does it sound?
>> >
>> >
>> > I've thought a bit about this, but it seems like it would be too much
>> > complexity in the compiler. Storing arrays as something besides 
>> contiguous
>> > elements and interaction between the codegen in C/C++ and the BitArray 
>> code
>> > in Julia both seem likely to be painful, although Jeff, Keno, and 
>> Jameson
>> > would know better than I. Additionally, this optimization (of storage of
>> > arrays of unions of a singleton type and a bits type) seems pretty 
>> specific
>> > to DataArrays, but the actual advantages in terms of performance and
>> > expressibility would be small or non-existent. (This is in contrast to
>> > optimizing storage/dispatch with union types, which could benefit a lot 
>> of
>> > code and is something a lot of languages do.) Finally, there are cases 
>> where
>> > it is useful to have direct access to the na BitArray chunks beyond 
>> anyna,
>> > e.g. pairwise summation and reductions across the first dimension.
>> >
>> > Simon
>>
>

Reply via email to