As usual, I agree with Keno :)

We could also implement optimizations for Union(Bits,OtherBits). In
theory this can be stack allocated along with a boolean flag that says
which one it is. However to take full advantage of this it seems you
need to generate lots of branches with code for each case. Possible
but tricky.

There is also a strong connection to the general
array-of-structs-to-struct-of-arrays optimization. It would not be
totally crazy to build this in to our object representation somehow,
or add hooks allowing customization of data representations, like
staged functions but for data instead of code.


On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
> I could (and probably will, someday) revive that commit. At the time,
> though, I seemed to find that it provided little performance benefit -- the
> gc cost of allocating boxes was far greater (for type uncertainty involving
> bitstypes) and the type dispatch wasn't as much of a performance impact as I
> had previously assumed.
>
>
> On Friday, August 1, 2014, Keno Fischer <[email protected]>
> wrote:
>>
>> It is possible to do generic compiler improvements for Union types
>> (Jameson had a branch at some point that did callsite splitting if we
>> inferred a Union type). However, I think the best way to go here is to
>> maintain the current separation of two arrays (one of the values one
>> for the NAs), but give an option type on access. The option type would
>> then most likely be in memory and wouldn't have much overhead. Please
>> let me know if there's anything specific I should explain how the
>> compiler will handle it, I admit I have only skimmed this thread.
>>
>> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
>> wrote:
>> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>> >>
>> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>> >>
>> >> To address Simon’s general points, which are really good reasons to
>> >> avoid
>> >> jumping on the Option{T} bandwagon too soon:
>> >>
>> >>
>> >>
>> >> * I agree that most languages use tagged union types for Option{T}
>> >> rather
>> >> than a wrapper type that contains a Boolean value. It’s also totally
>> >> true
>> >> that many compilers are able to make those constructs more efficient
>> >> than
>> >> Julia currently does. But what we should expect from Julia in the
>> >> coming
>> >> years isn’t so clear to me. (And I personally think we need to settle
>> >> on a
>> >> solution for representing missing data that’s viable in a year rather
>> >> than
>> >> viable in five years.) This is an issue that I’d really like to have
>> >> input
>> >> on from Jeff, Keno, Jameson or someone else involved with the internals
>> >> of
>> >> the compiler. Getting input from the broader community is the main
>> >> reason I
>> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>> >>
>> >>
>> >>
>> >> * I’m not clear how we could come to know that a datum is not missing
>> >> without a resolution step that’s effectively equivalent to the get()
>> >> function for Option{T}. I agree that the enforced use of get() means
>> >> that
>> >> you can’t hope to use generic functions like sum on collections of
>> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>> >> easiest way to express to the compiler that you know that all of the
>> >> entries
>> >> of a DataArray are not NA is to convert the DataArray to a straight
>> >> Array.
>> >> But maybe you have other mechanisms for expressing this knowledge.
>> >> Certainly
>> >> my proposal to do conversions to Arrays isn’t the most elegant
>> >> strategy.
>> >> It’s just all that I’ve got so far.
>> >>
>> >>
>> >>
>> >> * I kind of like the idea of Option{T} standing outside of the main
>> >> type
>> >> system in a kind of mirror type system. I’m less happy about Union(NA,
>> >> T)
>> >> being a super type of T, even though there are some good reasons that
>> >> you’d
>> >> like to view T as a specialization of Union(NA, T). But I agree that I
>> >> don’t
>> >> have a good feel about where missing data belongs in the type
>> >> hierarchy.
>> >> This is another question for which I’d love to get input from others.
>> >>
>> >>
>> >>
>> >> In regard to Simon’s performance points:
>> >>
>> >>
>> >>
>> >> * Yes, memory usage alone argues strongly for working with DataArray{T}
>> >> rather than Array{Option{T}}.
>> >>
>> >>
>> >>
>> >> * Exploting tricks that make operations like anyna() faster is another
>> >> good argument for keeping DataArray{T} around.
>> >>
>> >>
>> >>
>> >> * I’m not sure how to deal with inlining concerns or the undefined
>> >> reference checks. Do you have ideas for improving this within
>> >> DataArrays or
>> >> do we need supporting changes in the compiler?
>> >>
>> >> Actually it seems it would be possible to make Array{Union(NAtype, T)}
>> >> more similar to and as efficient as DataArray{T}, by handling a few
>> >> things
>> >> in the compiler. This would create a generalization of DataArray to any
>> >> kind
>> >> of union type, which could be useful in other contexts. But more
>> >> importantly, it would make missing values integrate seamlessly into
>> >> Julia,
>> >> getting rid of any hacks.
>> >>
>> >> More specifically, the following features would need to be supported:
>> >> - a way of telling the compiler to store the data as two arrays of
>> >> concrete types (here T and NAtype), instead of as an array of boxed
>> >> values,
>> >> so that:
>> >>     * efficient operations can be performed on the T values (by
>> >> skipping
>> >> the missing ones manually)
>> >>     * T values are stored as a dense array and can be converted to
>> >> Array{T} without any copy or passed to BLAS when no missing values are
>> >> present
>> >>    * NA values can be packed in a BitArray to save memory and make NA
>> >> detection faster (see below)
>> >> - a fonction to check whether a given element of the array is of type T
>> >> rather than of NAtype (generalization of isna())
>> >> - a fonction to check whether all elements of the array are of type T
>> >> rather than of NAtype (generalization of anyna(), more efficient than
>> >> calling the previous function on all elements thanks to the packing of
>> >> NAs
>> >> in a BitArray)
>> >> In this scheme, what is missing is how to allow the compiler to pack
>> >> NAs
>> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>> >> object.
>> >> Maybe by making it an enum-like immutable with a 1-bit field inside it?
>> >>
>> >> How does it sound?
>> >
>> >
>> > I've thought a bit about this, but it seems like it would be too much
>> > complexity in the compiler. Storing arrays as something besides
>> > contiguous
>> > elements and interaction between the codegen in C/C++ and the BitArray
>> > code
>> > in Julia both seem likely to be painful, although Jeff, Keno, and
>> > Jameson
>> > would know better than I. Additionally, this optimization (of storage of
>> > arrays of unions of a singleton type and a bits type) seems pretty
>> > specific
>> > to DataArrays, but the actual advantages in terms of performance and
>> > expressibility would be small or non-existent. (This is in contrast to
>> > optimizing storage/dispatch with union types, which could benefit a lot
>> > of
>> > code and is something a lot of languages do.) Finally, there are cases
>> > where
>> > it is useful to have direct access to the na BitArray chunks beyond
>> > anyna,
>> > e.g. pairwise summation and reductions across the first dimension.
>> >
>> > Simon

Reply via email to