That also strikes me as the best approach for what it's worth – just use
option/maybe/nullable for what you return when indexing into a DataArray
but keep the DataArray storage as two separate arrays.


On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]>
wrote:

> As usual, I agree with Keno :)
>
> We could also implement optimizations for Union(Bits,OtherBits). In
> theory this can be stack allocated along with a boolean flag that says
> which one it is. However to take full advantage of this it seems you
> need to generate lots of branches with code for each case. Possible
> but tricky.
>
> There is also a strong connection to the general
> array-of-structs-to-struct-of-arrays optimization. It would not be
> totally crazy to build this in to our object representation somehow,
> or add hooks allowing customization of data representations, like
> staged functions but for data instead of code.
>
>
> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
> > I could (and probably will, someday) revive that commit. At the time,
> > though, I seemed to find that it provided little performance benefit --
> the
> > gc cost of allocating boxes was far greater (for type uncertainty
> involving
> > bitstypes) and the type dispatch wasn't as much of a performance impact
> as I
> > had previously assumed.
> >
> >
> > On Friday, August 1, 2014, Keno Fischer <[email protected]>
> > wrote:
> >>
> >> It is possible to do generic compiler improvements for Union types
> >> (Jameson had a branch at some point that did callsite splitting if we
> >> inferred a Union type). However, I think the best way to go here is to
> >> maintain the current separation of two arrays (one of the values one
> >> for the NAs), but give an option type on access. The option type would
> >> then most likely be in memory and wouldn't have much overhead. Please
> >> let me know if there's anything specific I should explain how the
> >> compiler will handle it, I admit I have only skimmed this thread.
> >>
> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
> >> wrote:
> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
> >> >>
> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
> >> >>
> >> >> To address Simon’s general points, which are really good reasons to
> >> >> avoid
> >> >> jumping on the Option{T} bandwagon too soon:
> >> >>
> >> >>
> >> >>
> >> >> * I agree that most languages use tagged union types for Option{T}
> >> >> rather
> >> >> than a wrapper type that contains a Boolean value. It’s also totally
> >> >> true
> >> >> that many compilers are able to make those constructs more efficient
> >> >> than
> >> >> Julia currently does. But what we should expect from Julia in the
> >> >> coming
> >> >> years isn’t so clear to me. (And I personally think we need to settle
> >> >> on a
> >> >> solution for representing missing data that’s viable in a year rather
> >> >> than
> >> >> viable in five years.) This is an issue that I’d really like to have
> >> >> input
> >> >> on from Jeff, Keno, Jameson or someone else involved with the
> internals
> >> >> of
> >> >> the compiler. Getting input from the broader community is the main
> >> >> reason I
> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
> >> >>
> >> >>
> >> >>
> >> >> * I’m not clear how we could come to know that a datum is not missing
> >> >> without a resolution step that’s effectively equivalent to the get()
> >> >> function for Option{T}. I agree that the enforced use of get() means
> >> >> that
> >> >> you can’t hope to use generic functions like sum on collections of
> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
> >> >> easiest way to express to the compiler that you know that all of the
> >> >> entries
> >> >> of a DataArray are not NA is to convert the DataArray to a straight
> >> >> Array.
> >> >> But maybe you have other mechanisms for expressing this knowledge.
> >> >> Certainly
> >> >> my proposal to do conversions to Arrays isn’t the most elegant
> >> >> strategy.
> >> >> It’s just all that I’ve got so far.
> >> >>
> >> >>
> >> >>
> >> >> * I kind of like the idea of Option{T} standing outside of the main
> >> >> type
> >> >> system in a kind of mirror type system. I’m less happy about
> Union(NA,
> >> >> T)
> >> >> being a super type of T, even though there are some good reasons that
> >> >> you’d
> >> >> like to view T as a specialization of Union(NA, T). But I agree that
> I
> >> >> don’t
> >> >> have a good feel about where missing data belongs in the type
> >> >> hierarchy.
> >> >> This is another question for which I’d love to get input from others.
> >> >>
> >> >>
> >> >>
> >> >> In regard to Simon’s performance points:
> >> >>
> >> >>
> >> >>
> >> >> * Yes, memory usage alone argues strongly for working with
> DataArray{T}
> >> >> rather than Array{Option{T}}.
> >> >>
> >> >>
> >> >>
> >> >> * Exploting tricks that make operations like anyna() faster is
> another
> >> >> good argument for keeping DataArray{T} around.
> >> >>
> >> >>
> >> >>
> >> >> * I’m not sure how to deal with inlining concerns or the undefined
> >> >> reference checks. Do you have ideas for improving this within
> >> >> DataArrays or
> >> >> do we need supporting changes in the compiler?
> >> >>
> >> >> Actually it seems it would be possible to make Array{Union(NAtype,
> T)}
> >> >> more similar to and as efficient as DataArray{T}, by handling a few
> >> >> things
> >> >> in the compiler. This would create a generalization of DataArray to
> any
> >> >> kind
> >> >> of union type, which could be useful in other contexts. But more
> >> >> importantly, it would make missing values integrate seamlessly into
> >> >> Julia,
> >> >> getting rid of any hacks.
> >> >>
> >> >> More specifically, the following features would need to be supported:
> >> >> - a way of telling the compiler to store the data as two arrays of
> >> >> concrete types (here T and NAtype), instead of as an array of boxed
> >> >> values,
> >> >> so that:
> >> >>     * efficient operations can be performed on the T values (by
> >> >> skipping
> >> >> the missing ones manually)
> >> >>     * T values are stored as a dense array and can be converted to
> >> >> Array{T} without any copy or passed to BLAS when no missing values
> are
> >> >> present
> >> >>    * NA values can be packed in a BitArray to save memory and make NA
> >> >> detection faster (see below)
> >> >> - a fonction to check whether a given element of the array is of
> type T
> >> >> rather than of NAtype (generalization of isna())
> >> >> - a fonction to check whether all elements of the array are of type T
> >> >> rather than of NAtype (generalization of anyna(), more efficient than
> >> >> calling the previous function on all elements thanks to the packing
> of
> >> >> NAs
> >> >> in a BitArray)
> >> >> In this scheme, what is missing is how to allow the compiler to pack
> >> >> NAs
> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
> >> >> object.
> >> >> Maybe by making it an enum-like immutable with a 1-bit field inside
> it?
> >> >>
> >> >> How does it sound?
> >> >
> >> >
> >> > I've thought a bit about this, but it seems like it would be too much
> >> > complexity in the compiler. Storing arrays as something besides
> >> > contiguous
> >> > elements and interaction between the codegen in C/C++ and the BitArray
> >> > code
> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
> >> > Jameson
> >> > would know better than I. Additionally, this optimization (of storage
> of
> >> > arrays of unions of a singleton type and a bits type) seems pretty
> >> > specific
> >> > to DataArrays, but the actual advantages in terms of performance and
> >> > expressibility would be small or non-existent. (This is in contrast to
> >> > optimizing storage/dispatch with union types, which could benefit a
> lot
> >> > of
> >> > code and is something a lot of languages do.) Finally, there are cases
> >> > where
> >> > it is useful to have direct access to the na BitArray chunks beyond
> >> > anyna,
> >> > e.g. pairwise summation and reductions across the first dimension.
> >> >
> >> > Simon
>

Reply via email to