Re: [julia-users] OptionTypes.jl

John Myles White Mon, 11 Aug 2014 22:10:35 -0700

Ok, I’ve cleaned this up and renamed it to NullableTypes.jl: 
https://github.com/johnmyleswhite/NullableTypes.jl


 — John

On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <[email protected]> wrote:

> Sounds good. We could maybe include the Nullable type in Base and thus avoid 
> the issue of what to call the module. If we need a module, how about 
> Nullables? Although, I have to say that name makes me think of The 
> Expendables.
> 
> <expendables.png>
> 
> 
> On Tue, Aug 12, 2014 at 12:10 AM, John Myles White <[email protected]> 
> wrote:
> Just wanted to come back to this thread now that I’m back from vacation. It 
> sounds like the consensus of Jeff, Stefan, Keno and Jameson is that we’re 
> better off working with an explicit Option{T} type than trying to get the 
> compiler to handle Union(S, T) more efficiently.
> 
> If people are willing to accept that idea, I’d like to make the use of 
> Option’s a priority for a release of DataArrays that would accompany Julia 
> 0.4. There’s going to be a lot of changes to Julia’s core with that release, 
> so it seems like the perfect time to make some breaking changes to JuliaData 
> packages.
> 
> After thinking about the type hierarchy more, I’ve come to really like the 
> interpretion of Option{T} as a 0-or-1 element container type. In that 
> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a 
> relationship that has only horizontal links and no vertical links. I think 
> that perspective simplifies things a lot, whie articulating the core issues 
> involved with working with Option{T}.
> 
> For me, the main question is whether we want Option types to be a feature 
> that’s shared by people who want to express NULL pointers (i.e. ontological 
> missingness) or whether we want to customize things for statistical 
> missingness (i.e. epistemological missingness). As it stands, OptionTypes.jl 
> implements such a bare-bones version of missingness that it could be safely 
> used for either purpose.
> 
> As for the naming debate, I think NullableTypes with a new type called 
> Nullable{T} is the way to go.
> 
>  — John
> 
> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]> wrote:
> 
>> That also strikes me as the best approach for what it's worth – just use 
>> option/maybe/nullable for what you return when indexing into a DataArray but 
>> keep the DataArray storage as two separate arrays.
>> 
>> 
>> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]> 
>> wrote:
>> As usual, I agree with Keno :)
>> 
>> We could also implement optimizations for Union(Bits,OtherBits). In
>> theory this can be stack allocated along with a boolean flag that says
>> which one it is. However to take full advantage of this it seems you
>> need to generate lots of branches with code for each case. Possible
>> but tricky.
>> 
>> There is also a strong connection to the general
>> array-of-structs-to-struct-of-arrays optimization. It would not be
>> totally crazy to build this in to our object representation somehow,
>> or add hooks allowing customization of data representations, like
>> staged functions but for data instead of code.
>> 
>> 
>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
>> > I could (and probably will, someday) revive that commit. At the time,
>> > though, I seemed to find that it provided little performance benefit -- the
>> > gc cost of allocating boxes was far greater (for type uncertainty involving
>> > bitstypes) and the type dispatch wasn't as much of a performance impact as 
>> > I
>> > had previously assumed.
>> >
>> >
>> > On Friday, August 1, 2014, Keno Fischer <[email protected]>
>> > wrote:
>> >>
>> >> It is possible to do generic compiler improvements for Union types
>> >> (Jameson had a branch at some point that did callsite splitting if we
>> >> inferred a Union type). However, I think the best way to go here is to
>> >> maintain the current separation of two arrays (one of the values one
>> >> for the NAs), but give an option type on access. The option type would
>> >> then most likely be in memory and wouldn't have much overhead. Please
>> >> let me know if there's anything specific I should explain how the
>> >> compiler will handle it, I admit I have only skimmed this thread.
>> >>
>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
>> >> wrote:
>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>> >> >>
>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>> >> >>
>> >> >> To address Simon’s general points, which are really good reasons to
>> >> >> avoid
>> >> >> jumping on the Option{T} bandwagon too soon:
>> >> >>
>> >> >>
>> >> >>
>> >> >> * I agree that most languages use tagged union types for Option{T}
>> >> >> rather
>> >> >> than a wrapper type that contains a Boolean value. It’s also totally
>> >> >> true
>> >> >> that many compilers are able to make those constructs more efficient
>> >> >> than
>> >> >> Julia currently does. But what we should expect from Julia in the
>> >> >> coming
>> >> >> years isn’t so clear to me. (And I personally think we need to settle
>> >> >> on a
>> >> >> solution for representing missing data that’s viable in a year rather
>> >> >> than
>> >> >> viable in five years.) This is an issue that I’d really like to have
>> >> >> input
>> >> >> on from Jeff, Keno, Jameson or someone else involved with the internals
>> >> >> of
>> >> >> the compiler. Getting input from the broader community is the main
>> >> >> reason I
>> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>> >> >>
>> >> >>
>> >> >>
>> >> >> * I’m not clear how we could come to know that a datum is not missing
>> >> >> without a resolution step that’s effectively equivalent to the get()
>> >> >> function for Option{T}. I agree that the enforced use of get() means
>> >> >> that
>> >> >> you can’t hope to use generic functions like sum on collections of
>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>> >> >> easiest way to express to the compiler that you know that all of the
>> >> >> entries
>> >> >> of a DataArray are not NA is to convert the DataArray to a straight
>> >> >> Array.
>> >> >> But maybe you have other mechanisms for expressing this knowledge.
>> >> >> Certainly
>> >> >> my proposal to do conversions to Arrays isn’t the most elegant
>> >> >> strategy.
>> >> >> It’s just all that I’ve got so far.
>> >> >>
>> >> >>
>> >> >>
>> >> >> * I kind of like the idea of Option{T} standing outside of the main
>> >> >> type
>> >> >> system in a kind of mirror type system. I’m less happy about Union(NA,
>> >> >> T)
>> >> >> being a super type of T, even though there are some good reasons that
>> >> >> you’d
>> >> >> like to view T as a specialization of Union(NA, T). But I agree that I
>> >> >> don’t
>> >> >> have a good feel about where missing data belongs in the type
>> >> >> hierarchy.
>> >> >> This is another question for which I’d love to get input from others.
>> >> >>
>> >> >>
>> >> >>
>> >> >> In regard to Simon’s performance points:
>> >> >>
>> >> >>
>> >> >>
>> >> >> * Yes, memory usage alone argues strongly for working with DataArray{T}
>> >> >> rather than Array{Option{T}}.
>> >> >>
>> >> >>
>> >> >>
>> >> >> * Exploting tricks that make operations like anyna() faster is another
>> >> >> good argument for keeping DataArray{T} around.
>> >> >>
>> >> >>
>> >> >>
>> >> >> * I’m not sure how to deal with inlining concerns or the undefined
>> >> >> reference checks. Do you have ideas for improving this within
>> >> >> DataArrays or
>> >> >> do we need supporting changes in the compiler?
>> >> >>
>> >> >> Actually it seems it would be possible to make Array{Union(NAtype, T)}
>> >> >> more similar to and as efficient as DataArray{T}, by handling a few
>> >> >> things
>> >> >> in the compiler. This would create a generalization of DataArray to any
>> >> >> kind
>> >> >> of union type, which could be useful in other contexts. But more
>> >> >> importantly, it would make missing values integrate seamlessly into
>> >> >> Julia,
>> >> >> getting rid of any hacks.
>> >> >>
>> >> >> More specifically, the following features would need to be supported:
>> >> >> - a way of telling the compiler to store the data as two arrays of
>> >> >> concrete types (here T and NAtype), instead of as an array of boxed
>> >> >> values,
>> >> >> so that:
>> >> >>     * efficient operations can be performed on the T values (by
>> >> >> skipping
>> >> >> the missing ones manually)
>> >> >>     * T values are stored as a dense array and can be converted to
>> >> >> Array{T} without any copy or passed to BLAS when no missing values are
>> >> >> present
>> >> >>    * NA values can be packed in a BitArray to save memory and make NA
>> >> >> detection faster (see below)
>> >> >> - a fonction to check whether a given element of the array is of type T
>> >> >> rather than of NAtype (generalization of isna())
>> >> >> - a fonction to check whether all elements of the array are of type T
>> >> >> rather than of NAtype (generalization of anyna(), more efficient than
>> >> >> calling the previous function on all elements thanks to the packing of
>> >> >> NAs
>> >> >> in a BitArray)
>> >> >> In this scheme, what is missing is how to allow the compiler to pack
>> >> >> NAs
>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>> >> >> object.
>> >> >> Maybe by making it an enum-like immutable with a 1-bit field inside it?
>> >> >>
>> >> >> How does it sound?
>> >> >
>> >> >
>> >> > I've thought a bit about this, but it seems like it would be too much
>> >> > complexity in the compiler. Storing arrays as something besides
>> >> > contiguous
>> >> > elements and interaction between the codegen in C/C++ and the BitArray
>> >> > code
>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
>> >> > Jameson
>> >> > would know better than I. Additionally, this optimization (of storage of
>> >> > arrays of unions of a singleton type and a bits type) seems pretty
>> >> > specific
>> >> > to DataArrays, but the actual advantages in terms of performance and
>> >> > expressibility would be small or non-existent. (This is in contrast to
>> >> > optimizing storage/dispatch with union types, which could benefit a lot
>> >> > of
>> >> > code and is something a lot of languages do.) Finally, there are cases
>> >> > where
>> >> > it is useful to have direct access to the na BitArray chunks beyond
>> >> > anyna,
>> >> > e.g. pairwise summation and reductions across the first dimension.
>> >> >
>> >> > Simon
>> 
> 
>

Re: [julia-users] OptionTypes.jl

Reply via email to