Thanks, everybody, for the input. I’m going offline for the next week while camping in Oregon, but I’ll return to this project once I’m back.
— John On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <[email protected]> wrote: > That also strikes me as the best approach for what it's worth – just use > option/maybe/nullable for what you return when indexing into a DataArray but > keep the DataArray storage as two separate arrays. > > > On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <[email protected]> wrote: > As usual, I agree with Keno :) > > We could also implement optimizations for Union(Bits,OtherBits). In > theory this can be stack allocated along with a boolean flag that says > which one it is. However to take full advantage of this it seems you > need to generate lots of branches with code for each case. Possible > but tricky. > > There is also a strong connection to the general > array-of-structs-to-struct-of-arrays optimization. It would not be > totally crazy to build this in to our object representation somehow, > or add hooks allowing customization of data representations, like > staged functions but for data instead of code. > > > On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <[email protected]> wrote: > > I could (and probably will, someday) revive that commit. At the time, > > though, I seemed to find that it provided little performance benefit -- the > > gc cost of allocating boxes was far greater (for type uncertainty involving > > bitstypes) and the type dispatch wasn't as much of a performance impact as I > > had previously assumed. > > > > > > On Friday, August 1, 2014, Keno Fischer <[email protected]> > > wrote: > >> > >> It is possible to do generic compiler improvements for Union types > >> (Jameson had a branch at some point that did callsite splitting if we > >> inferred a Union type). However, I think the best way to go here is to > >> maintain the current separation of two arrays (one of the values one > >> for the NAs), but give an option type on access. The option type would > >> then most likely be in memory and wouldn't have much overhead. Please > >> let me know if there's anything specific I should explain how the > >> compiler will handle it, I admit I have only skimmed this thread. > >> > >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]> > >> wrote: > >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote: > >> >> > >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : > >> >> > >> >> To address Simon’s general points, which are really good reasons to > >> >> avoid > >> >> jumping on the Option{T} bandwagon too soon: > >> >> > >> >> > >> >> > >> >> * I agree that most languages use tagged union types for Option{T} > >> >> rather > >> >> than a wrapper type that contains a Boolean value. It’s also totally > >> >> true > >> >> that many compilers are able to make those constructs more efficient > >> >> than > >> >> Julia currently does. But what we should expect from Julia in the > >> >> coming > >> >> years isn’t so clear to me. (And I personally think we need to settle > >> >> on a > >> >> solution for representing missing data that’s viable in a year rather > >> >> than > >> >> viable in five years.) This is an issue that I’d really like to have > >> >> input > >> >> on from Jeff, Keno, Jameson or someone else involved with the internals > >> >> of > >> >> the compiler. Getting input from the broader community is the main > >> >> reason I > >> >> wanted to put a demo of OptionTypes.jl out in front of other folks. > >> >> > >> >> > >> >> > >> >> * I’m not clear how we could come to know that a datum is not missing > >> >> without a resolution step that’s effectively equivalent to the get() > >> >> function for Option{T}. I agree that the enforced use of get() means > >> >> that > >> >> you can’t hope to use generic functions like sum on collections of > >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the > >> >> easiest way to express to the compiler that you know that all of the > >> >> entries > >> >> of a DataArray are not NA is to convert the DataArray to a straight > >> >> Array. > >> >> But maybe you have other mechanisms for expressing this knowledge. > >> >> Certainly > >> >> my proposal to do conversions to Arrays isn’t the most elegant > >> >> strategy. > >> >> It’s just all that I’ve got so far. > >> >> > >> >> > >> >> > >> >> * I kind of like the idea of Option{T} standing outside of the main > >> >> type > >> >> system in a kind of mirror type system. I’m less happy about Union(NA, > >> >> T) > >> >> being a super type of T, even though there are some good reasons that > >> >> you’d > >> >> like to view T as a specialization of Union(NA, T). But I agree that I > >> >> don’t > >> >> have a good feel about where missing data belongs in the type > >> >> hierarchy. > >> >> This is another question for which I’d love to get input from others. > >> >> > >> >> > >> >> > >> >> In regard to Simon’s performance points: > >> >> > >> >> > >> >> > >> >> * Yes, memory usage alone argues strongly for working with DataArray{T} > >> >> rather than Array{Option{T}}. > >> >> > >> >> > >> >> > >> >> * Exploting tricks that make operations like anyna() faster is another > >> >> good argument for keeping DataArray{T} around. > >> >> > >> >> > >> >> > >> >> * I’m not sure how to deal with inlining concerns or the undefined > >> >> reference checks. Do you have ideas for improving this within > >> >> DataArrays or > >> >> do we need supporting changes in the compiler? > >> >> > >> >> Actually it seems it would be possible to make Array{Union(NAtype, T)} > >> >> more similar to and as efficient as DataArray{T}, by handling a few > >> >> things > >> >> in the compiler. This would create a generalization of DataArray to any > >> >> kind > >> >> of union type, which could be useful in other contexts. But more > >> >> importantly, it would make missing values integrate seamlessly into > >> >> Julia, > >> >> getting rid of any hacks. > >> >> > >> >> More specifically, the following features would need to be supported: > >> >> - a way of telling the compiler to store the data as two arrays of > >> >> concrete types (here T and NAtype), instead of as an array of boxed > >> >> values, > >> >> so that: > >> >> * efficient operations can be performed on the T values (by > >> >> skipping > >> >> the missing ones manually) > >> >> * T values are stored as a dense array and can be converted to > >> >> Array{T} without any copy or passed to BLAS when no missing values are > >> >> present > >> >> * NA values can be packed in a BitArray to save memory and make NA > >> >> detection faster (see below) > >> >> - a fonction to check whether a given element of the array is of type T > >> >> rather than of NAtype (generalization of isna()) > >> >> - a fonction to check whether all elements of the array are of type T > >> >> rather than of NAtype (generalization of anyna(), more efficient than > >> >> calling the previous function on all elements thanks to the packing of > >> >> NAs > >> >> in a BitArray) > >> >> In this scheme, what is missing is how to allow the compiler to pack > >> >> NAs > >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit > >> >> object. > >> >> Maybe by making it an enum-like immutable with a 1-bit field inside it? > >> >> > >> >> How does it sound? > >> > > >> > > >> > I've thought a bit about this, but it seems like it would be too much > >> > complexity in the compiler. Storing arrays as something besides > >> > contiguous > >> > elements and interaction between the codegen in C/C++ and the BitArray > >> > code > >> > in Julia both seem likely to be painful, although Jeff, Keno, and > >> > Jameson > >> > would know better than I. Additionally, this optimization (of storage of > >> > arrays of unions of a singleton type and a bits type) seems pretty > >> > specific > >> > to DataArrays, but the actual advantages in terms of performance and > >> > expressibility would be small or non-existent. (This is in contrast to > >> > optimizing storage/dispatch with union types, which could benefit a lot > >> > of > >> > code and is something a lot of languages do.) Finally, there are cases > >> > where > >> > it is useful to have direct access to the na BitArray chunks beyond > >> > anyna, > >> > e.g. pairwise summation and reductions across the first dimension. > >> > > >> > Simon >
