That seems like a really good candidate for improving the compiler to make
sure that it can eliminate the second check.


On Tue, Aug 12, 2014 at 1:54 AM, John Myles White <[email protected]>
wrote:

> Yup.
>
> It’s unclear to me whether it should be included or not. It’s useful if
> you’ve already done an explicit isnull(x) check so that you can save a
> second check.
>
>  — John
>
> On Aug 11, 2014, at 10:52 PM, Stefan Karpinski <[email protected]>
> wrote:
>
> What's the deal with unsafe_get? Does that segfault if you try to access a
> null value reference value and give you random junk if you try to access an
> bits value?
>
>
> On Tue, Aug 12, 2014 at 1:09 AM, John Myles White <
> [email protected]> wrote:
>
>> Ok, I’ve cleaned this up and renamed it to NullableTypes.jl:
>> https://github.com/johnmyleswhite/NullableTypes.jl
>>
>>  — John
>>
>> On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <[email protected]>
>> wrote:
>>
>> Sounds good. We could maybe include the Nullable type in Base and thus
>> avoid the issue of what to call the module. If we need a module, how about
>> Nullables? Although, I have to say that name makes me think of The
>> Expendables.
>>
>> <expendables.png>
>>
>>
>> On Tue, Aug 12, 2014 at 12:10 AM, John Myles White <
>> [email protected]> wrote:
>>
>>> Just wanted to come back to this thread now that I’m back from vacation.
>>> It sounds like the consensus of Jeff, Stefan, Keno and Jameson is that
>>> we’re better off working with an explicit Option{T} type than trying to get
>>> the compiler to handle Union(S, T) more efficiently.
>>>
>>> If people are willing to accept that idea, I’d like to make the use of
>>> Option’s a priority for a release of DataArrays that would accompany Julia
>>> 0.4. There’s going to be a lot of changes to Julia’s core with that
>>> release, so it seems like the perfect time to make some breaking changes to
>>> JuliaData packages.
>>>
>>> After thinking about the type hierarchy more, I’ve come to really like
>>> the interpretion of Option{T} as a 0-or-1 element container type. In that
>>> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a
>>> relationship that has only horizontal links and no vertical links. I think
>>> that perspective simplifies things a lot, whie articulating the core issues
>>> involved with working with Option{T}.
>>>
>>> For me, the main question is whether we want Option types to be a
>>> feature that’s shared by people who want to express NULL pointers (i.e.
>>> ontological missingness) or whether we want to customize things for
>>> statistical missingness (i.e. epistemological missingness). As it stands,
>>> OptionTypes.jl implements such a bare-bones version of missingness that it
>>> could be safely used for either purpose.
>>>
>>> As for the naming debate, I think NullableTypes with a new type called
>>> Nullable{T} is the way to go.
>>>
>>>  — John
>>>
>>> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]>
>>> wrote:
>>>
>>> That also strikes me as the best approach for what it's worth – just use
>>> option/maybe/nullable for what you return when indexing into a DataArray
>>> but keep the DataArray storage as two separate arrays.
>>>
>>>
>>> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]>
>>> wrote:
>>>
>>>> As usual, I agree with Keno :)
>>>>
>>>> We could also implement optimizations for Union(Bits,OtherBits). In
>>>> theory this can be stack allocated along with a boolean flag that says
>>>> which one it is. However to take full advantage of this it seems you
>>>> need to generate lots of branches with code for each case. Possible
>>>> but tricky.
>>>>
>>>> There is also a strong connection to the general
>>>> array-of-structs-to-struct-of-arrays optimization. It would not be
>>>> totally crazy to build this in to our object representation somehow,
>>>> or add hooks allowing customization of data representations, like
>>>> staged functions but for data instead of code.
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote:
>>>> > I could (and probably will, someday) revive that commit. At the time,
>>>> > though, I seemed to find that it provided little performance benefit
>>>> -- the
>>>> > gc cost of allocating boxes was far greater (for type uncertainty
>>>> involving
>>>> > bitstypes) and the type dispatch wasn't as much of a performance
>>>> impact as I
>>>> > had previously assumed.
>>>> >
>>>> >
>>>> > On Friday, August 1, 2014, Keno Fischer <[email protected]
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> It is possible to do generic compiler improvements for Union types
>>>> >> (Jameson had a branch at some point that did callsite splitting if we
>>>> >> inferred a Union type). However, I think the best way to go here is
>>>> to
>>>> >> maintain the current separation of two arrays (one of the values one
>>>> >> for the NAs), but give an option type on access. The option type
>>>> would
>>>> >> then most likely be in memory and wouldn't have much overhead. Please
>>>> >> let me know if there's anything specific I should explain how the
>>>> >> compiler will handle it, I admit I have only skimmed this thread.
>>>> >>
>>>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]
>>>> >
>>>> >> wrote:
>>>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat
>>>> wrote:
>>>> >> >>
>>>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>>>> >> >>
>>>> >> >> To address Simon’s general points, which are really good reasons
>>>> to
>>>> >> >> avoid
>>>> >> >> jumping on the Option{T} bandwagon too soon:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I agree that most languages use tagged union types for Option{T}
>>>> >> >> rather
>>>> >> >> than a wrapper type that contains a Boolean value. It’s also
>>>> totally
>>>> >> >> true
>>>> >> >> that many compilers are able to make those constructs more
>>>> efficient
>>>> >> >> than
>>>> >> >> Julia currently does. But what we should expect from Julia in the
>>>> >> >> coming
>>>> >> >> years isn’t so clear to me. (And I personally think we need to
>>>> settle
>>>> >> >> on a
>>>> >> >> solution for representing missing data that’s viable in a year
>>>> rather
>>>> >> >> than
>>>> >> >> viable in five years.) This is an issue that I’d really like to
>>>> have
>>>> >> >> input
>>>> >> >> on from Jeff, Keno, Jameson or someone else involved with the
>>>> internals
>>>> >> >> of
>>>> >> >> the compiler. Getting input from the broader community is the main
>>>> >> >> reason I
>>>> >> >> wanted to put a demo of OptionTypes.jl out in front of other
>>>> folks.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I’m not clear how we could come to know that a datum is not
>>>> missing
>>>> >> >> without a resolution step that’s effectively equivalent to the
>>>> get()
>>>> >> >> function for Option{T}. I agree that the enforced use of get()
>>>> means
>>>> >> >> that
>>>> >> >> you can’t hope to use generic functions like sum on collections of
>>>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think
>>>> the
>>>> >> >> easiest way to express to the compiler that you know that all of
>>>> the
>>>> >> >> entries
>>>> >> >> of a DataArray are not NA is to convert the DataArray to a
>>>> straight
>>>> >> >> Array.
>>>> >> >> But maybe you have other mechanisms for expressing this knowledge.
>>>> >> >> Certainly
>>>> >> >> my proposal to do conversions to Arrays isn’t the most elegant
>>>> >> >> strategy.
>>>> >> >> It’s just all that I’ve got so far.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I kind of like the idea of Option{T} standing outside of the
>>>> main
>>>> >> >> type
>>>> >> >> system in a kind of mirror type system. I’m less happy about
>>>> Union(NA,
>>>> >> >> T)
>>>> >> >> being a super type of T, even though there are some good reasons
>>>> that
>>>> >> >> you’d
>>>> >> >> like to view T as a specialization of Union(NA, T). But I agree
>>>> that I
>>>> >> >> don’t
>>>> >> >> have a good feel about where missing data belongs in the type
>>>> >> >> hierarchy.
>>>> >> >> This is another question for which I’d love to get input from
>>>> others.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> In regard to Simon’s performance points:
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * Yes, memory usage alone argues strongly for working with
>>>> DataArray{T}
>>>> >> >> rather than Array{Option{T}}.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * Exploting tricks that make operations like anyna() faster is
>>>> another
>>>> >> >> good argument for keeping DataArray{T} around.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> * I’m not sure how to deal with inlining concerns or the undefined
>>>> >> >> reference checks. Do you have ideas for improving this within
>>>> >> >> DataArrays or
>>>> >> >> do we need supporting changes in the compiler?
>>>> >> >>
>>>> >> >> Actually it seems it would be possible to make
>>>> Array{Union(NAtype, T)}
>>>> >> >> more similar to and as efficient as DataArray{T}, by handling a
>>>> few
>>>> >> >> things
>>>> >> >> in the compiler. This would create a generalization of DataArray
>>>> to any
>>>> >> >> kind
>>>> >> >> of union type, which could be useful in other contexts. But more
>>>> >> >> importantly, it would make missing values integrate seamlessly
>>>> into
>>>> >> >> Julia,
>>>> >> >> getting rid of any hacks.
>>>> >> >>
>>>> >> >> More specifically, the following features would need to be
>>>> supported:
>>>> >> >> - a way of telling the compiler to store the data as two arrays of
>>>> >> >> concrete types (here T and NAtype), instead of as an array of
>>>> boxed
>>>> >> >> values,
>>>> >> >> so that:
>>>> >> >>     * efficient operations can be performed on the T values (by
>>>> >> >> skipping
>>>> >> >> the missing ones manually)
>>>> >> >>     * T values are stored as a dense array and can be converted to
>>>> >> >> Array{T} without any copy or passed to BLAS when no missing
>>>> values are
>>>> >> >> present
>>>> >> >>    * NA values can be packed in a BitArray to save memory and
>>>> make NA
>>>> >> >> detection faster (see below)
>>>> >> >> - a fonction to check whether a given element of the array is of
>>>> type T
>>>> >> >> rather than of NAtype (generalization of isna())
>>>> >> >> - a fonction to check whether all elements of the array are of
>>>> type T
>>>> >> >> rather than of NAtype (generalization of anyna(), more efficient
>>>> than
>>>> >> >> calling the previous function on all elements thanks to the
>>>> packing of
>>>> >> >> NAs
>>>> >> >> in a BitArray)
>>>> >> >> In this scheme, what is missing is how to allow the compiler to
>>>> pack
>>>> >> >> NAs
>>>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>>>> >> >> object.
>>>> >> >> Maybe by making it an enum-like immutable with a 1-bit field
>>>> inside it?
>>>> >> >>
>>>> >> >> How does it sound?
>>>> >> >
>>>> >> >
>>>> >> > I've thought a bit about this, but it seems like it would be too
>>>> much
>>>> >> > complexity in the compiler. Storing arrays as something besides
>>>> >> > contiguous
>>>> >> > elements and interaction between the codegen in C/C++ and the
>>>> BitArray
>>>> >> > code
>>>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
>>>> >> > Jameson
>>>> >> > would know better than I. Additionally, this optimization (of
>>>> storage of
>>>> >> > arrays of unions of a singleton type and a bits type) seems pretty
>>>> >> > specific
>>>> >> > to DataArrays, but the actual advantages in terms of performance
>>>> and
>>>> >> > expressibility would be small or non-existent. (This is in
>>>> contrast to
>>>> >> > optimizing storage/dispatch with union types, which could benefit
>>>> a lot
>>>> >> > of
>>>> >> > code and is something a lot of languages do.) Finally, there are
>>>> cases
>>>> >> > where
>>>> >> > it is useful to have direct access to the na BitArray chunks beyond
>>>> >> > anyna,
>>>> >> > e.g. pairwise summation and reductions across the first dimension.
>>>> >> >
>>>> >> > Simon
>>>>
>>>
>>>
>>>
>>
>>
>
>

Reply via email to