Nope -- that the right idea. Just pointing out that you get more value from
improving the storage representation than from specializing union call sites

On Friday, August 1, 2014, Simon Kornblith <[email protected]> wrote:

> Is there a reason we can't change the way unions of small bits types are
> represented, so that if we know something is a Union(Float64,NA) it can
> live in registers or on the stack instead of having to be heap allocated?
>
> On Friday, August 1, 2014 2:54:49 PM UTC-4, Jameson wrote:
>>
>> I could (and probably will, someday) revive that commit. At the time,
>> though, I seemed to find that it provided little performance benefit -- the
>> gc cost of allocating boxes was far greater (for type uncertainty involving
>> bitstypes) and the type dispatch wasn't as much of a performance impact as
>> I had previously assumed.
>>
>>
>> On Friday, August 1, 2014, Keno Fischer <[email protected]>
>> wrote:
>>
>>> It is possible to do generic compiler improvements for Union types
>>> (Jameson had a branch at some point that did callsite splitting if we
>>> inferred a Union type). However, I think the best way to go here is to
>>> maintain the current separation of two arrays (one of the values one
>>> for the NAs), but give an option type on access. The option type would
>>> then most likely be in memory and wouldn't have much overhead. Please
>>> let me know if there's anything specific I should explain how the
>>> compiler will handle it, I admit I have only skimmed this thread.
>>>
>>> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
>>> wrote:
>>> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>>> >>
>>> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>>> >>
>>> >> To address Simon’s general points, which are really good reasons to
>>> avoid
>>> >> jumping on the Option{T} bandwagon too soon:
>>> >>
>>> >>
>>> >>
>>> >> * I agree that most languages use tagged union types for Option{T}
>>> rather
>>> >> than a wrapper type that contains a Boolean value. It’s also totally
>>> true
>>> >> that many compilers are able to make those constructs more efficient
>>> than
>>> >> Julia currently does. But what we should expect from Julia in the
>>> coming
>>> >> years isn’t so clear to me. (And I personally think we need to settle
>>> on a
>>> >> solution for representing missing data that’s viable in a year rather
>>> than
>>> >> viable in five years.) This is an issue that I’d really like to have
>>> input
>>> >> on from Jeff, Keno, Jameson or someone else involved with the
>>> internals of
>>> >> the compiler. Getting input from the broader community is the main
>>> reason I
>>> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>>> >>
>>> >>
>>> >>
>>> >> * I’m not clear how we could come to know that a datum is not missing
>>> >> without a resolution step that’s effectively equivalent to the get()
>>> >> function for Option{T}. I agree that the enforced use of get() means
>>> that
>>> >> you can’t hope to use generic functions like sum on collections of
>>> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>>> >> easiest way to express to the compiler that you know that all of the
>>> entries
>>> >> of a DataArray are not NA is to convert the DataArray to a straight
>>> Array.
>>> >> But maybe you have other mechanisms for expressing this knowledge.
>>> Certainly
>>> >> my proposal to do conversions to Arrays isn’t the most elegant
>>> strategy.
>>> >> It’s just all that I’ve got so far.
>>> >>
>>> >>
>>> >>
>>> >> * I kind of like the idea of Option{T} standing outside of the main
>>> type
>>> >> system in a kind of mirror type system. I’m less happy about
>>> Union(NA, T)
>>> >> being a super type of T, even though there are some good reasons that
>>> you’d
>>> >> like to view T as a specialization of Union(NA, T). But I agree that
>>> I don’t
>>> >> have a good feel about where missing data belongs in the type
>>> hierarchy.
>>> >> This is another question for which I’d love to get input from others.
>>> >>
>>> >>
>>> >>
>>> >> In regard to Simon’s performance points:
>>> >>
>>> >>
>>> >>
>>> >> * Yes, memory usage alone argues strongly for working with
>>> DataArray{T}
>>> >> rather than Array{Option{T}}.
>>> >>
>>> >>
>>> >>
>>> >> * Exploting tricks that make operations like anyna() faster is another
>>> >> good argument for keeping DataArray{T} around.
>>> >>
>>> >>
>>> >>
>>> >> * I’m not sure how to deal with inlining concerns or the undefined
>>> >> reference checks. Do you have ideas for improving this within
>>> DataArrays or
>>> >> do we need supporting changes in the compiler?
>>> >>
>>> >> Actually it seems it would be possible to make Array{Union(NAtype, T)}
>>> >> more similar to and as efficient as DataArray{T}, by handling a few
>>> things
>>> >> in the compiler. This would create a generalization of DataArray to
>>> any kind
>>> >> of union type, which could be useful in other contexts. But more
>>> >> importantly, it would make missing values integrate seamlessly into
>>> Julia,
>>> >> getting rid of any hacks.
>>> >>
>>> >> More specifically, the following features would need to be supported:
>>> >> - a way of telling the compiler to store the data as two arrays of
>>> >> concrete types (here T and NAtype), instead of as an array of boxed
>>> values,
>>> >> so that:
>>> >>     * efficient operations can be performed on the T values (by
>>> skipping
>>> >> the missing ones manually)
>>> >>     * T values are stored as a dense array and can be converted to
>>> >> Array{T} without any copy or passed to BLAS when no missing values are
>>> >> present
>>> >>    * NA values can be packed in a BitArray to save memory and make NA
>>> >> detection faster (see below)
>>> >> - a fonction to check whether a given element of the array is of type
>>> T
>>> >> rather than of NAtype (generalization of isna())
>>> >> - a fonction to check whether all elements of the array are of type T
>>> >> rather than of NAtype (generalization of anyna(), more efficient than
>>> >> calling the previous function on all elements thanks to the packing
>>> of NAs
>>> >> in a BitArray)
>>> >> In this scheme, what is missing is how to allow the compiler to pack
>>> NAs
>>> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>>> object.
>>> >> Maybe by making it an enum-like immutable with a 1-bit field inside
>>> it?
>>> >>
>>> >> How does it sound?
>>> >
>>> >
>>> > I've thought a bit about this, but it seems like it would be too much
>>> > complexity in the compiler. Storing arrays as something besides
>>> contiguous
>>> > elements and interaction between the codegen in C/C++ and the BitArray
>>> code
>>> > in Julia both seem likely to be painful, although Jeff, Keno, and
>>> Jameson
>>> > would know better than I. Additionally, this optimization (of storage
>>> of
>>> > arrays of unions of a singleton type and a bits type) seems pretty
>>> specific
>>> > to DataArrays, but the actual advantages in terms of performance and
>>> > expressibility would be small or non-existent. (This is in contrast to
>>> > optimizing storage/dispatch with union types, which could benefit a
>>> lot of
>>> > code and is something a lot of languages do.) Finally, there are cases
>>> where
>>> > it is useful to have direct access to the na BitArray chunks beyond
>>> anyna,
>>> > e.g. pairwise summation and reductions across the first dimension.
>>> >
>>> > Simon
>>>
>>

Reply via email to