Just wanted to come back to this thread now that I’m back from vacation. It 
sounds like the consensus of Jeff, Stefan, Keno and Jameson is that we’re 
better off working with an explicit Option{T} type than trying to get the 
compiler to handle Union(S, T) more efficiently.

If people are willing to accept that idea, I’d like to make the use of Option’s 
a priority for a release of DataArrays that would accompany Julia 0.4. There’s 
going to be a lot of changes to Julia’s core with that release, so it seems 
like the perfect time to make some breaking changes to JuliaData packages.

After thinking about the type hierarchy more, I’ve come to really like the 
interpretion of Option{T} as a 0-or-1 element container type. In that 
interpretation, Option{T} is to T exactly as Array{T} is to T, which is a 
relationship that has only horizontal links and no vertical links. I think that 
perspective simplifies things a lot, whie articulating the core issues involved 
with working with Option{T}.

For me, the main question is whether we want Option types to be a feature 
that’s shared by people who want to express NULL pointers (i.e. ontological 
missingness) or whether we want to customize things for statistical missingness 
(i.e. epistemological missingness). As it stands, OptionTypes.jl implements 
such a bare-bones version of missingness that it could be safely used for 
either purpose.

As for the naming debate, I think NullableTypes with a new type called 
Nullable{T} is the way to go.

 — John

On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]> wrote:

> That also strikes me as the best approach for what it's worth – just use 
> option/maybe/nullable for what you return when indexing into a DataArray but 
> keep the DataArray storage as two separate arrays.
> 
> 
> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]> wrote:
> As usual, I agree with Keno :)
> 
> We could also implement optimizations for Union(Bits,OtherBits). In
> theory this can be stack allocated along with a boolean flag that says
> which one it is. However to take full advantage of this it seems you
> need to generate lots of branches with code for each case. Possible
> but tricky.
> 
> There is also a strong connection to the general
> array-of-structs-to-struct-of-arrays optimization. It would not be
> totally crazy to build this in to our object representation somehow,
> or add hooks allowing customization of data representations, like
> staged functions but for data instead of code.
> 
> 
> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
> > I could (and probably will, someday) revive that commit. At the time,
> > though, I seemed to find that it provided little performance benefit -- the
> > gc cost of allocating boxes was far greater (for type uncertainty involving
> > bitstypes) and the type dispatch wasn't as much of a performance impact as I
> > had previously assumed.
> >
> >
> > On Friday, August 1, 2014, Keno Fischer <[email protected]>
> > wrote:
> >>
> >> It is possible to do generic compiler improvements for Union types
> >> (Jameson had a branch at some point that did callsite splitting if we
> >> inferred a Union type). However, I think the best way to go here is to
> >> maintain the current separation of two arrays (one of the values one
> >> for the NAs), but give an option type on access. The option type would
> >> then most likely be in memory and wouldn't have much overhead. Please
> >> let me know if there's anything specific I should explain how the
> >> compiler will handle it, I admit I have only skimmed this thread.
> >>
> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
> >> wrote:
> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
> >> >>
> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
> >> >>
> >> >> To address Simon’s general points, which are really good reasons to
> >> >> avoid
> >> >> jumping on the Option{T} bandwagon too soon:
> >> >>
> >> >>
> >> >>
> >> >> * I agree that most languages use tagged union types for Option{T}
> >> >> rather
> >> >> than a wrapper type that contains a Boolean value. It’s also totally
> >> >> true
> >> >> that many compilers are able to make those constructs more efficient
> >> >> than
> >> >> Julia currently does. But what we should expect from Julia in the
> >> >> coming
> >> >> years isn’t so clear to me. (And I personally think we need to settle
> >> >> on a
> >> >> solution for representing missing data that’s viable in a year rather
> >> >> than
> >> >> viable in five years.) This is an issue that I’d really like to have
> >> >> input
> >> >> on from Jeff, Keno, Jameson or someone else involved with the internals
> >> >> of
> >> >> the compiler. Getting input from the broader community is the main
> >> >> reason I
> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
> >> >>
> >> >>
> >> >>
> >> >> * I’m not clear how we could come to know that a datum is not missing
> >> >> without a resolution step that’s effectively equivalent to the get()
> >> >> function for Option{T}. I agree that the enforced use of get() means
> >> >> that
> >> >> you can’t hope to use generic functions like sum on collections of
> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
> >> >> easiest way to express to the compiler that you know that all of the
> >> >> entries
> >> >> of a DataArray are not NA is to convert the DataArray to a straight
> >> >> Array.
> >> >> But maybe you have other mechanisms for expressing this knowledge.
> >> >> Certainly
> >> >> my proposal to do conversions to Arrays isn’t the most elegant
> >> >> strategy.
> >> >> It’s just all that I’ve got so far.
> >> >>
> >> >>
> >> >>
> >> >> * I kind of like the idea of Option{T} standing outside of the main
> >> >> type
> >> >> system in a kind of mirror type system. I’m less happy about Union(NA,
> >> >> T)
> >> >> being a super type of T, even though there are some good reasons that
> >> >> you’d
> >> >> like to view T as a specialization of Union(NA, T). But I agree that I
> >> >> don’t
> >> >> have a good feel about where missing data belongs in the type
> >> >> hierarchy.
> >> >> This is another question for which I’d love to get input from others.
> >> >>
> >> >>
> >> >>
> >> >> In regard to Simon’s performance points:
> >> >>
> >> >>
> >> >>
> >> >> * Yes, memory usage alone argues strongly for working with DataArray{T}
> >> >> rather than Array{Option{T}}.
> >> >>
> >> >>
> >> >>
> >> >> * Exploting tricks that make operations like anyna() faster is another
> >> >> good argument for keeping DataArray{T} around.
> >> >>
> >> >>
> >> >>
> >> >> * I’m not sure how to deal with inlining concerns or the undefined
> >> >> reference checks. Do you have ideas for improving this within
> >> >> DataArrays or
> >> >> do we need supporting changes in the compiler?
> >> >>
> >> >> Actually it seems it would be possible to make Array{Union(NAtype, T)}
> >> >> more similar to and as efficient as DataArray{T}, by handling a few
> >> >> things
> >> >> in the compiler. This would create a generalization of DataArray to any
> >> >> kind
> >> >> of union type, which could be useful in other contexts. But more
> >> >> importantly, it would make missing values integrate seamlessly into
> >> >> Julia,
> >> >> getting rid of any hacks.
> >> >>
> >> >> More specifically, the following features would need to be supported:
> >> >> - a way of telling the compiler to store the data as two arrays of
> >> >> concrete types (here T and NAtype), instead of as an array of boxed
> >> >> values,
> >> >> so that:
> >> >>     * efficient operations can be performed on the T values (by
> >> >> skipping
> >> >> the missing ones manually)
> >> >>     * T values are stored as a dense array and can be converted to
> >> >> Array{T} without any copy or passed to BLAS when no missing values are
> >> >> present
> >> >>    * NA values can be packed in a BitArray to save memory and make NA
> >> >> detection faster (see below)
> >> >> - a fonction to check whether a given element of the array is of type T
> >> >> rather than of NAtype (generalization of isna())
> >> >> - a fonction to check whether all elements of the array are of type T
> >> >> rather than of NAtype (generalization of anyna(), more efficient than
> >> >> calling the previous function on all elements thanks to the packing of
> >> >> NAs
> >> >> in a BitArray)
> >> >> In this scheme, what is missing is how to allow the compiler to pack
> >> >> NAs
> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
> >> >> object.
> >> >> Maybe by making it an enum-like immutable with a 1-bit field inside it?
> >> >>
> >> >> How does it sound?
> >> >
> >> >
> >> > I've thought a bit about this, but it seems like it would be too much
> >> > complexity in the compiler. Storing arrays as something besides
> >> > contiguous
> >> > elements and interaction between the codegen in C/C++ and the BitArray
> >> > code
> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
> >> > Jameson
> >> > would know better than I. Additionally, this optimization (of storage of
> >> > arrays of unions of a singleton type and a bits type) seems pretty
> >> > specific
> >> > to DataArrays, but the actual advantages in terms of performance and
> >> > expressibility would be small or non-existent. (This is in contrast to
> >> > optimizing storage/dispatch with union types, which could benefit a lot
> >> > of
> >> > code and is something a lot of languages do.) Finally, there are cases
> >> > where
> >> > it is useful to have direct access to the na BitArray chunks beyond
> >> > anyna,
> >> > e.g. pairwise summation and reductions across the first dimension.
> >> >
> >> > Simon
> 

Reply via email to