Nope -- that the right idea. Just pointing out that you get more value from improving the storage representation than from specializing union call sites
On Friday, August 1, 2014, Simon Kornblith <[email protected]> wrote: > Is there a reason we can't change the way unions of small bits types are > represented, so that if we know something is a Union(Float64,NA) it can > live in registers or on the stack instead of having to be heap allocated? > > On Friday, August 1, 2014 2:54:49 PM UTC-4, Jameson wrote: >> >> I could (and probably will, someday) revive that commit. At the time, >> though, I seemed to find that it provided little performance benefit -- the >> gc cost of allocating boxes was far greater (for type uncertainty involving >> bitstypes) and the type dispatch wasn't as much of a performance impact as >> I had previously assumed. >> >> >> On Friday, August 1, 2014, Keno Fischer <[email protected]> >> wrote: >> >>> It is possible to do generic compiler improvements for Union types >>> (Jameson had a branch at some point that did callsite splitting if we >>> inferred a Union type). However, I think the best way to go here is to >>> maintain the current separation of two arrays (one of the values one >>> for the NAs), but give an option type on access. The option type would >>> then most likely be in memory and wouldn't have much overhead. Please >>> let me know if there's anything specific I should explain how the >>> compiler will handle it, I admit I have only skimmed this thread. >>> >>> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <[email protected]> >>> wrote: >>> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote: >>> >> >>> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : >>> >> >>> >> To address Simon’s general points, which are really good reasons to >>> avoid >>> >> jumping on the Option{T} bandwagon too soon: >>> >> >>> >> >>> >> >>> >> * I agree that most languages use tagged union types for Option{T} >>> rather >>> >> than a wrapper type that contains a Boolean value. It’s also totally >>> true >>> >> that many compilers are able to make those constructs more efficient >>> than >>> >> Julia currently does. But what we should expect from Julia in the >>> coming >>> >> years isn’t so clear to me. (And I personally think we need to settle >>> on a >>> >> solution for representing missing data that’s viable in a year rather >>> than >>> >> viable in five years.) This is an issue that I’d really like to have >>> input >>> >> on from Jeff, Keno, Jameson or someone else involved with the >>> internals of >>> >> the compiler. Getting input from the broader community is the main >>> reason I >>> >> wanted to put a demo of OptionTypes.jl out in front of other folks. >>> >> >>> >> >>> >> >>> >> * I’m not clear how we could come to know that a datum is not missing >>> >> without a resolution step that’s effectively equivalent to the get() >>> >> function for Option{T}. I agree that the enforced use of get() means >>> that >>> >> you can’t hope to use generic functions like sum on collections of >>> >> Option{T}. But I’m also not sure that’s such a bad thing: I think the >>> >> easiest way to express to the compiler that you know that all of the >>> entries >>> >> of a DataArray are not NA is to convert the DataArray to a straight >>> Array. >>> >> But maybe you have other mechanisms for expressing this knowledge. >>> Certainly >>> >> my proposal to do conversions to Arrays isn’t the most elegant >>> strategy. >>> >> It’s just all that I’ve got so far. >>> >> >>> >> >>> >> >>> >> * I kind of like the idea of Option{T} standing outside of the main >>> type >>> >> system in a kind of mirror type system. I’m less happy about >>> Union(NA, T) >>> >> being a super type of T, even though there are some good reasons that >>> you’d >>> >> like to view T as a specialization of Union(NA, T). But I agree that >>> I don’t >>> >> have a good feel about where missing data belongs in the type >>> hierarchy. >>> >> This is another question for which I’d love to get input from others. >>> >> >>> >> >>> >> >>> >> In regard to Simon’s performance points: >>> >> >>> >> >>> >> >>> >> * Yes, memory usage alone argues strongly for working with >>> DataArray{T} >>> >> rather than Array{Option{T}}. >>> >> >>> >> >>> >> >>> >> * Exploting tricks that make operations like anyna() faster is another >>> >> good argument for keeping DataArray{T} around. >>> >> >>> >> >>> >> >>> >> * I’m not sure how to deal with inlining concerns or the undefined >>> >> reference checks. Do you have ideas for improving this within >>> DataArrays or >>> >> do we need supporting changes in the compiler? >>> >> >>> >> Actually it seems it would be possible to make Array{Union(NAtype, T)} >>> >> more similar to and as efficient as DataArray{T}, by handling a few >>> things >>> >> in the compiler. This would create a generalization of DataArray to >>> any kind >>> >> of union type, which could be useful in other contexts. But more >>> >> importantly, it would make missing values integrate seamlessly into >>> Julia, >>> >> getting rid of any hacks. >>> >> >>> >> More specifically, the following features would need to be supported: >>> >> - a way of telling the compiler to store the data as two arrays of >>> >> concrete types (here T and NAtype), instead of as an array of boxed >>> values, >>> >> so that: >>> >> * efficient operations can be performed on the T values (by >>> skipping >>> >> the missing ones manually) >>> >> * T values are stored as a dense array and can be converted to >>> >> Array{T} without any copy or passed to BLAS when no missing values are >>> >> present >>> >> * NA values can be packed in a BitArray to save memory and make NA >>> >> detection faster (see below) >>> >> - a fonction to check whether a given element of the array is of type >>> T >>> >> rather than of NAtype (generalization of isna()) >>> >> - a fonction to check whether all elements of the array are of type T >>> >> rather than of NAtype (generalization of anyna(), more efficient than >>> >> calling the previous function on all elements thanks to the packing >>> of NAs >>> >> in a BitArray) >>> >> In this scheme, what is missing is how to allow the compiler to pack >>> NAs >>> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit >>> object. >>> >> Maybe by making it an enum-like immutable with a 1-bit field inside >>> it? >>> >> >>> >> How does it sound? >>> > >>> > >>> > I've thought a bit about this, but it seems like it would be too much >>> > complexity in the compiler. Storing arrays as something besides >>> contiguous >>> > elements and interaction between the codegen in C/C++ and the BitArray >>> code >>> > in Julia both seem likely to be painful, although Jeff, Keno, and >>> Jameson >>> > would know better than I. Additionally, this optimization (of storage >>> of >>> > arrays of unions of a singleton type and a bits type) seems pretty >>> specific >>> > to DataArrays, but the actual advantages in terms of performance and >>> > expressibility would be small or non-existent. (This is in contrast to >>> > optimizing storage/dispatch with union types, which could benefit a >>> lot of >>> > code and is something a lot of languages do.) Finally, there are cases >>> where >>> > it is useful to have direct access to the na BitArray chunks beyond >>> anyna, >>> > e.g. pairwise summation and reductions across the first dimension. >>> > >>> > Simon >>> >>
