That seems like a really good candidate for improving the compiler to make sure that it can eliminate the second check.
On Tue, Aug 12, 2014 at 1:54 AM, John Myles White <[email protected]> wrote: > Yup. > > It’s unclear to me whether it should be included or not. It’s useful if > you’ve already done an explicit isnull(x) check so that you can save a > second check. > > — John > > On Aug 11, 2014, at 10:52 PM, Stefan Karpinski <[email protected]> > wrote: > > What's the deal with unsafe_get? Does that segfault if you try to access a > null value reference value and give you random junk if you try to access an > bits value? > > > On Tue, Aug 12, 2014 at 1:09 AM, John Myles White < > [email protected]> wrote: > >> Ok, I’ve cleaned this up and renamed it to NullableTypes.jl: >> https://github.com/johnmyleswhite/NullableTypes.jl >> >> — John >> >> On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <[email protected]> >> wrote: >> >> Sounds good. We could maybe include the Nullable type in Base and thus >> avoid the issue of what to call the module. If we need a module, how about >> Nullables? Although, I have to say that name makes me think of The >> Expendables. >> >> <expendables.png> >> >> >> On Tue, Aug 12, 2014 at 12:10 AM, John Myles White < >> [email protected]> wrote: >> >>> Just wanted to come back to this thread now that I’m back from vacation. >>> It sounds like the consensus of Jeff, Stefan, Keno and Jameson is that >>> we’re better off working with an explicit Option{T} type than trying to get >>> the compiler to handle Union(S, T) more efficiently. >>> >>> If people are willing to accept that idea, I’d like to make the use of >>> Option’s a priority for a release of DataArrays that would accompany Julia >>> 0.4. There’s going to be a lot of changes to Julia’s core with that >>> release, so it seems like the perfect time to make some breaking changes to >>> JuliaData packages. >>> >>> After thinking about the type hierarchy more, I’ve come to really like >>> the interpretion of Option{T} as a 0-or-1 element container type. In that >>> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a >>> relationship that has only horizontal links and no vertical links. I think >>> that perspective simplifies things a lot, whie articulating the core issues >>> involved with working with Option{T}. >>> >>> For me, the main question is whether we want Option types to be a >>> feature that’s shared by people who want to express NULL pointers (i.e. >>> ontological missingness) or whether we want to customize things for >>> statistical missingness (i.e. epistemological missingness). As it stands, >>> OptionTypes.jl implements such a bare-bones version of missingness that it >>> could be safely used for either purpose. >>> >>> As for the naming debate, I think NullableTypes with a new type called >>> Nullable{T} is the way to go. >>> >>> — John >>> >>> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]> >>> wrote: >>> >>> That also strikes me as the best approach for what it's worth – just use >>> option/maybe/nullable for what you return when indexing into a DataArray >>> but keep the DataArray storage as two separate arrays. >>> >>> >>> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]> >>> wrote: >>> >>>> As usual, I agree with Keno :) >>>> >>>> We could also implement optimizations for Union(Bits,OtherBits). In >>>> theory this can be stack allocated along with a boolean flag that says >>>> which one it is. However to take full advantage of this it seems you >>>> need to generate lots of branches with code for each case. Possible >>>> but tricky. >>>> >>>> There is also a strong connection to the general >>>> array-of-structs-to-struct-of-arrays optimization. It would not be >>>> totally crazy to build this in to our object representation somehow, >>>> or add hooks allowing customization of data representations, like >>>> staged functions but for data instead of code. >>>> >>>> >>>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote: >>>> > I could (and probably will, someday) revive that commit. At the time, >>>> > though, I seemed to find that it provided little performance benefit >>>> -- the >>>> > gc cost of allocating boxes was far greater (for type uncertainty >>>> involving >>>> > bitstypes) and the type dispatch wasn't as much of a performance >>>> impact as I >>>> > had previously assumed. >>>> > >>>> > >>>> > On Friday, August 1, 2014, Keno Fischer <[email protected] >>>> > >>>> > wrote: >>>> >> >>>> >> It is possible to do generic compiler improvements for Union types >>>> >> (Jameson had a branch at some point that did callsite splitting if we >>>> >> inferred a Union type). However, I think the best way to go here is >>>> to >>>> >> maintain the current separation of two arrays (one of the values one >>>> >> for the NAs), but give an option type on access. The option type >>>> would >>>> >> then most likely be in memory and wouldn't have much overhead. Please >>>> >> let me know if there's anything specific I should explain how the >>>> >> compiler will handle it, I admit I have only skimmed this thread. >>>> >> >>>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected] >>>> > >>>> >> wrote: >>>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat >>>> wrote: >>>> >> >> >>>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : >>>> >> >> >>>> >> >> To address Simon’s general points, which are really good reasons >>>> to >>>> >> >> avoid >>>> >> >> jumping on the Option{T} bandwagon too soon: >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * I agree that most languages use tagged union types for Option{T} >>>> >> >> rather >>>> >> >> than a wrapper type that contains a Boolean value. It’s also >>>> totally >>>> >> >> true >>>> >> >> that many compilers are able to make those constructs more >>>> efficient >>>> >> >> than >>>> >> >> Julia currently does. But what we should expect from Julia in the >>>> >> >> coming >>>> >> >> years isn’t so clear to me. (And I personally think we need to >>>> settle >>>> >> >> on a >>>> >> >> solution for representing missing data that’s viable in a year >>>> rather >>>> >> >> than >>>> >> >> viable in five years.) This is an issue that I’d really like to >>>> have >>>> >> >> input >>>> >> >> on from Jeff, Keno, Jameson or someone else involved with the >>>> internals >>>> >> >> of >>>> >> >> the compiler. Getting input from the broader community is the main >>>> >> >> reason I >>>> >> >> wanted to put a demo of OptionTypes.jl out in front of other >>>> folks. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * I’m not clear how we could come to know that a datum is not >>>> missing >>>> >> >> without a resolution step that’s effectively equivalent to the >>>> get() >>>> >> >> function for Option{T}. I agree that the enforced use of get() >>>> means >>>> >> >> that >>>> >> >> you can’t hope to use generic functions like sum on collections of >>>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think >>>> the >>>> >> >> easiest way to express to the compiler that you know that all of >>>> the >>>> >> >> entries >>>> >> >> of a DataArray are not NA is to convert the DataArray to a >>>> straight >>>> >> >> Array. >>>> >> >> But maybe you have other mechanisms for expressing this knowledge. >>>> >> >> Certainly >>>> >> >> my proposal to do conversions to Arrays isn’t the most elegant >>>> >> >> strategy. >>>> >> >> It’s just all that I’ve got so far. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * I kind of like the idea of Option{T} standing outside of the >>>> main >>>> >> >> type >>>> >> >> system in a kind of mirror type system. I’m less happy about >>>> Union(NA, >>>> >> >> T) >>>> >> >> being a super type of T, even though there are some good reasons >>>> that >>>> >> >> you’d >>>> >> >> like to view T as a specialization of Union(NA, T). But I agree >>>> that I >>>> >> >> don’t >>>> >> >> have a good feel about where missing data belongs in the type >>>> >> >> hierarchy. >>>> >> >> This is another question for which I’d love to get input from >>>> others. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> In regard to Simon’s performance points: >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * Yes, memory usage alone argues strongly for working with >>>> DataArray{T} >>>> >> >> rather than Array{Option{T}}. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * Exploting tricks that make operations like anyna() faster is >>>> another >>>> >> >> good argument for keeping DataArray{T} around. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> * I’m not sure how to deal with inlining concerns or the undefined >>>> >> >> reference checks. Do you have ideas for improving this within >>>> >> >> DataArrays or >>>> >> >> do we need supporting changes in the compiler? >>>> >> >> >>>> >> >> Actually it seems it would be possible to make >>>> Array{Union(NAtype, T)} >>>> >> >> more similar to and as efficient as DataArray{T}, by handling a >>>> few >>>> >> >> things >>>> >> >> in the compiler. This would create a generalization of DataArray >>>> to any >>>> >> >> kind >>>> >> >> of union type, which could be useful in other contexts. But more >>>> >> >> importantly, it would make missing values integrate seamlessly >>>> into >>>> >> >> Julia, >>>> >> >> getting rid of any hacks. >>>> >> >> >>>> >> >> More specifically, the following features would need to be >>>> supported: >>>> >> >> - a way of telling the compiler to store the data as two arrays of >>>> >> >> concrete types (here T and NAtype), instead of as an array of >>>> boxed >>>> >> >> values, >>>> >> >> so that: >>>> >> >> * efficient operations can be performed on the T values (by >>>> >> >> skipping >>>> >> >> the missing ones manually) >>>> >> >> * T values are stored as a dense array and can be converted to >>>> >> >> Array{T} without any copy or passed to BLAS when no missing >>>> values are >>>> >> >> present >>>> >> >> * NA values can be packed in a BitArray to save memory and >>>> make NA >>>> >> >> detection faster (see below) >>>> >> >> - a fonction to check whether a given element of the array is of >>>> type T >>>> >> >> rather than of NAtype (generalization of isna()) >>>> >> >> - a fonction to check whether all elements of the array are of >>>> type T >>>> >> >> rather than of NAtype (generalization of anyna(), more efficient >>>> than >>>> >> >> calling the previous function on all elements thanks to the >>>> packing of >>>> >> >> NAs >>>> >> >> in a BitArray) >>>> >> >> In this scheme, what is missing is how to allow the compiler to >>>> pack >>>> >> >> NAs >>>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit >>>> >> >> object. >>>> >> >> Maybe by making it an enum-like immutable with a 1-bit field >>>> inside it? >>>> >> >> >>>> >> >> How does it sound? >>>> >> > >>>> >> > >>>> >> > I've thought a bit about this, but it seems like it would be too >>>> much >>>> >> > complexity in the compiler. Storing arrays as something besides >>>> >> > contiguous >>>> >> > elements and interaction between the codegen in C/C++ and the >>>> BitArray >>>> >> > code >>>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and >>>> >> > Jameson >>>> >> > would know better than I. Additionally, this optimization (of >>>> storage of >>>> >> > arrays of unions of a singleton type and a bits type) seems pretty >>>> >> > specific >>>> >> > to DataArrays, but the actual advantages in terms of performance >>>> and >>>> >> > expressibility would be small or non-existent. (This is in >>>> contrast to >>>> >> > optimizing storage/dispatch with union types, which could benefit >>>> a lot >>>> >> > of >>>> >> > code and is something a lot of languages do.) Finally, there are >>>> cases >>>> >> > where >>>> >> > it is useful to have direct access to the na BitArray chunks beyond >>>> >> > anyna, >>>> >> > e.g. pairwise summation and reductions across the first dimension. >>>> >> > >>>> >> > Simon >>>> >>> >>> >>> >> >> > >
