Agreed. I included it to allow users to make clear what the compiler ought to 
do.

 — John

On Aug 11, 2014, at 10:55 PM, Stefan Karpinski <[email protected]> wrote:

> That seems like a really good candidate for improving the compiler to make 
> sure that it can eliminate the second check.
> 
> 
> On Tue, Aug 12, 2014 at 1:54 AM, John Myles White <[email protected]> 
> wrote:
> Yup.
> 
> It’s unclear to me whether it should be included or not. It’s useful if 
> you’ve already done an explicit isnull(x) check so that you can save a second 
> check.
> 
>  — John
> 
> On Aug 11, 2014, at 10:52 PM, Stefan Karpinski <[email protected]> wrote:
> 
>> What's the deal with unsafe_get? Does that segfault if you try to access a 
>> null value reference value and give you random junk if you try to access an 
>> bits value?
>> 
>> 
>> On Tue, Aug 12, 2014 at 1:09 AM, John Myles White <[email protected]> 
>> wrote:
>> Ok, I’ve cleaned this up and renamed it to NullableTypes.jl: 
>> https://github.com/johnmyleswhite/NullableTypes.jl
>> 
>>  — John
>> 
>> On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <[email protected]> wrote:
>> 
>>> Sounds good. We could maybe include the Nullable type in Base and thus 
>>> avoid the issue of what to call the module. If we need a module, how about 
>>> Nullables? Although, I have to say that name makes me think of The 
>>> Expendables.
>>> 
>>> <expendables.png>
>>> 
>>> 
>>> On Tue, Aug 12, 2014 at 12:10 AM, John Myles White 
>>> <[email protected]> wrote:
>>> Just wanted to come back to this thread now that I’m back from vacation. It 
>>> sounds like the consensus of Jeff, Stefan, Keno and Jameson is that we’re 
>>> better off working with an explicit Option{T} type than trying to get the 
>>> compiler to handle Union(S, T) more efficiently.
>>> 
>>> If people are willing to accept that idea, I’d like to make the use of 
>>> Option’s a priority for a release of DataArrays that would accompany Julia 
>>> 0.4. There’s going to be a lot of changes to Julia’s core with that 
>>> release, so it seems like the perfect time to make some breaking changes to 
>>> JuliaData packages.
>>> 
>>> After thinking about the type hierarchy more, I’ve come to really like the 
>>> interpretion of Option{T} as a 0-or-1 element container type. In that 
>>> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a 
>>> relationship that has only horizontal links and no vertical links. I think 
>>> that perspective simplifies things a lot, whie articulating the core issues 
>>> involved with working with Option{T}.
>>> 
>>> For me, the main question is whether we want Option types to be a feature 
>>> that’s shared by people who want to express NULL pointers (i.e. ontological 
>>> missingness) or whether we want to customize things for statistical 
>>> missingness (i.e. epistemological missingness). As it stands, 
>>> OptionTypes.jl implements such a bare-bones version of missingness that it 
>>> could be safely used for either purpose.
>>> 
>>> As for the naming debate, I think NullableTypes with a new type called 
>>> Nullable{T} is the way to go.
>>> 
>>>  — John
>>> 
>>> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]> wrote:
>>> 
>>>> That also strikes me as the best approach for what it's worth – just use 
>>>> option/maybe/nullable for what you return when indexing into a DataArray 
>>>> but keep the DataArray storage as two separate arrays.
>>>> 
>>>> 
>>>> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]> 
>>>> wrote:
>>>> As usual, I agree with Keno :)
>>>> 
>>>> We could also implement optimizations for Union(Bits,OtherBits). In
>>>> theory this can be stack allocated along with a boolean flag that says
>>>> which one it is. However to take full advantage of this it seems you
>>>> need to generate lots of branches with code for each case. Possible
>>>> but tricky.
>>>> 
>>>> There is also a strong connection to the general
>>>> array-of-structs-to-struct-of-arrays optimization. It would not be
>>>> totally crazy to build this in to our object representation somehow,
>>>> or add hooks allowing customization of data representations, like
>>>> staged functions but for data instead of code.
>>>> 
>>>> 
>>>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
>>>> > I could (and probably will, someday) revive that commit. At the time,
>>>> > though, I seemed to find that it provided little performance benefit -- 
>>>> > the
>>>> > gc cost of allocating boxes was far greater (for type uncertainty 
>>>> > involving
>>>> > bitstypes) and the type dispatch wasn't as much of a performance impact 
>>>> > as I
>>>> > had previously assumed.
>>>> >
>>>> >
>>>> > On Friday, August 1, 2014, Keno Fischer <[email protected]>
>>>> > wrote:
>>>> >>
>>>> >> It is possible to do generic compiler improvements for Union types
>>>> >> (Jameson had a branch at some point that did callsite splitting if we
>>>> >> inferred a Union type). However, I think the best way to go here is to
>>>> >> maintain the current separation of two arrays (one of the values one
>>>> >> for the NAs), but give an option type on access. The option type would
>>>> >> then most likely be in memory and wouldn't have much overhead. Please
>>>> >> let me know if there's anything specific I should explain how the
>>>> >> compiler will handle it, I admit I have only skimmed this thread.
>>>> >>
>>>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]>
>>>> >> wrote:
>>>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote:
>>>> >> >>
>>>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>>>> >> >>
>>>> >> >> To address Simon’s general points, which are really good reasons to
>>>> >> >> avoid
>>>> >> >> jumping on the Option{T} bandwagon too soon:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I agree that most languages use tagged union types for Option{T}
>>>> >> >> rather
>>>> >> >> than a wrapper type that contains a Boolean value. It’s also totally
>>>> >> >> true
>>>> >> >> that many compilers are able to make those constructs more efficient
>>>> >> >> than
>>>> >> >> Julia currently does. But what we should expect from Julia in the
>>>> >> >> coming
>>>> >> >> years isn’t so clear to me. (And I personally think we need to settle
>>>> >> >> on a
>>>> >> >> solution for representing missing data that’s viable in a year rather
>>>> >> >> than
>>>> >> >> viable in five years.) This is an issue that I’d really like to have
>>>> >> >> input
>>>> >> >> on from Jeff, Keno, Jameson or someone else involved with the 
>>>> >> >> internals
>>>> >> >> of
>>>> >> >> the compiler. Getting input from the broader community is the main
>>>> >> >> reason I
>>>> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I’m not clear how we could come to know that a datum is not missing
>>>> >> >> without a resolution step that’s effectively equivalent to the get()
>>>> >> >> function for Option{T}. I agree that the enforced use of get() means
>>>> >> >> that
>>>> >> >> you can’t hope to use generic functions like sum on collections of
>>>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the
>>>> >> >> easiest way to express to the compiler that you know that all of the
>>>> >> >> entries
>>>> >> >> of a DataArray are not NA is to convert the DataArray to a straight
>>>> >> >> Array.
>>>> >> >> But maybe you have other mechanisms for expressing this knowledge.
>>>> >> >> Certainly
>>>> >> >> my proposal to do conversions to Arrays isn’t the most elegant
>>>> >> >> strategy.
>>>> >> >> It’s just all that I’ve got so far.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I kind of like the idea of Option{T} standing outside of the main
>>>> >> >> type
>>>> >> >> system in a kind of mirror type system. I’m less happy about 
>>>> >> >> Union(NA,
>>>> >> >> T)
>>>> >> >> being a super type of T, even though there are some good reasons that
>>>> >> >> you’d
>>>> >> >> like to view T as a specialization of Union(NA, T). But I agree that 
>>>> >> >> I
>>>> >> >> don’t
>>>> >> >> have a good feel about where missing data belongs in the type
>>>> >> >> hierarchy.
>>>> >> >> This is another question for which I’d love to get input from others.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> In regard to Simon’s performance points:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * Yes, memory usage alone argues strongly for working with 
>>>> >> >> DataArray{T}
>>>> >> >> rather than Array{Option{T}}.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * Exploting tricks that make operations like anyna() faster is 
>>>> >> >> another
>>>> >> >> good argument for keeping DataArray{T} around.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I’m not sure how to deal with inlining concerns or the undefined
>>>> >> >> reference checks. Do you have ideas for improving this within
>>>> >> >> DataArrays or
>>>> >> >> do we need supporting changes in the compiler?
>>>> >> >>
>>>> >> >> Actually it seems it would be possible to make Array{Union(NAtype, 
>>>> >> >> T)}
>>>> >> >> more similar to and as efficient as DataArray{T}, by handling a few
>>>> >> >> things
>>>> >> >> in the compiler. This would create a generalization of DataArray to 
>>>> >> >> any
>>>> >> >> kind
>>>> >> >> of union type, which could be useful in other contexts. But more
>>>> >> >> importantly, it would make missing values integrate seamlessly into
>>>> >> >> Julia,
>>>> >> >> getting rid of any hacks.
>>>> >> >>
>>>> >> >> More specifically, the following features would need to be supported:
>>>> >> >> - a way of telling the compiler to store the data as two arrays of
>>>> >> >> concrete types (here T and NAtype), instead of as an array of boxed
>>>> >> >> values,
>>>> >> >> so that:
>>>> >> >>     * efficient operations can be performed on the T values (by
>>>> >> >> skipping
>>>> >> >> the missing ones manually)
>>>> >> >>     * T values are stored as a dense array and can be converted to
>>>> >> >> Array{T} without any copy or passed to BLAS when no missing values 
>>>> >> >> are
>>>> >> >> present
>>>> >> >>    * NA values can be packed in a BitArray to save memory and make NA
>>>> >> >> detection faster (see below)
>>>> >> >> - a fonction to check whether a given element of the array is of 
>>>> >> >> type T
>>>> >> >> rather than of NAtype (generalization of isna())
>>>> >> >> - a fonction to check whether all elements of the array are of type T
>>>> >> >> rather than of NAtype (generalization of anyna(), more efficient than
>>>> >> >> calling the previous function on all elements thanks to the packing 
>>>> >> >> of
>>>> >> >> NAs
>>>> >> >> in a BitArray)
>>>> >> >> In this scheme, what is missing is how to allow the compiler to pack
>>>> >> >> NAs
>>>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>>>> >> >> object.
>>>> >> >> Maybe by making it an enum-like immutable with a 1-bit field inside 
>>>> >> >> it?
>>>> >> >>
>>>> >> >> How does it sound?
>>>> >> >
>>>> >> >
>>>> >> > I've thought a bit about this, but it seems like it would be too much
>>>> >> > complexity in the compiler. Storing arrays as something besides
>>>> >> > contiguous
>>>> >> > elements and interaction between the codegen in C/C++ and the BitArray
>>>> >> > code
>>>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
>>>> >> > Jameson
>>>> >> > would know better than I. Additionally, this optimization (of storage 
>>>> >> > of
>>>> >> > arrays of unions of a singleton type and a bits type) seems pretty
>>>> >> > specific
>>>> >> > to DataArrays, but the actual advantages in terms of performance and
>>>> >> > expressibility would be small or non-existent. (This is in contrast to
>>>> >> > optimizing storage/dispatch with union types, which could benefit a 
>>>> >> > lot
>>>> >> > of
>>>> >> > code and is something a lot of languages do.) Finally, there are cases
>>>> >> > where
>>>> >> > it is useful to have direct access to the na BitArray chunks beyond
>>>> >> > anyna,
>>>> >> > e.g. pairwise summation and reductions across the first dimension.
>>>> >> >
>>>> >> > Simon
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to