To address Simon’s general points, which are really good reasons to avoid 
jumping on the Option{T} bandwagon too soon:

* I agree that most languages use tagged union types for Option{T} rather than 
a wrapper type that contains a Boolean value. It’s also totally true that many 
compilers are able to make those constructs more efficient than Julia currently 
does. But what we should expect from Julia in the coming years isn’t so clear 
to me. (And I personally think we need to settle on a solution for representing 
missing data that’s viable in a year rather than viable in five years.) This is 
an issue that I’d really like to have input on from Jeff, Keno, Jameson or 
someone else involved with the internals of the compiler. Getting input from 
the broader community is the main reason I wanted to put a demo of 
OptionTypes.jl out in front of other folks.

* I’m not clear how we could come to know that a datum is not missing without a 
resolution step that’s effectively equivalent to the get() function for 
Option{T}. I agree that the enforced use of get() means that you can’t hope to 
use generic functions like sum on collections of Option{T}. But I’m also not 
sure that’s such a bad thing: I think the easiest way to express to the 
compiler that you know that all of the entries of a DataArray are not NA is to 
convert the DataArray to a straight Array. But maybe you have other mechanisms 
for expressing this knowledge. Certainly my proposal to do conversions to 
Arrays isn’t the most elegant strategy. It’s just all that I’ve got so far.

* I kind of like the idea of Option{T} standing outside of the main type system 
in a kind of mirror type system. I’m less happy about Union(NA, T) being a 
super type of T, even though there are some good reasons that you’d like to 
view T as a specialization of Union(NA, T). But I agree that I don’t have a 
good feel about where missing data belongs in the type hierarchy. This is 
another question for which I’d love to get input from others.

In regard to Simon’s performance points:

* Yes, memory usage alone argues strongly for working with DataArray{T} rather 
than Array{Option{T}}.

* Exploting tricks that make operations like anyna() faster is another good 
argument for keeping DataArray{T} around.

* I’m not sure how to deal with inlining concerns or the undefined reference 
checks. Do you have ideas for improving this within DataArrays or do we need 
supporting changes in the compiler?

 — John

On Jul 31, 2014, at 4:49 PM, Simon Kornblith <[email protected]> wrote:

> My previous comments on option types versus union types for DataArray 
> indexing are here. To elaborate on John's summary above, my present views are:
> In many other language, including languages that are lower-level than Julia 
> such as Rust, option types are union types and it's the compiler's job to 
> make them fast. Making union types fast would simply mean that the presence 
> of Union in the output of code_typed is no longer to be dreaded.
> The option type approach forces special handling of data that could 
> potentially be missing but isn't, and this breaks generic code. I feel it is 
> better if, once we know a datum is not missing, it has the same concrete type 
> as if it were known to exist in advance.
> I'm not sure how missing data ought to fit into the type hierarchy, but 
> neither the Union(NA, T) nor Option{T} approaches seem completely ideal.
> Some more specific notes WRT performance:
> Storing option types in an Array{Option{T}} would require nearly twice as 
> much memory as storing them in a DataArray in most cases. While a Bool is 
> only 1 byte, there is additional padding needed for alignment. On x86_64 for 
> types <= 64 bits the alignment is usually the size of the type.
> As long as option types live in registers and not in memory, they should 
> consume no additional space as compared with scalar indexing into both the 
> data and na arrays of a DataArray, as indexing into the na array would 
> already require a register. There are, however, some cases where the BitArray 
> representation of NAs can be exploited for performance. In John's example 
> code forsum above with dropna = false above, the Option type approach would 
> read every bit of the naarray individually. It is faster to first check if 
> any values are NA, which can be done 64 bits at a time, and throw an error if 
> there are NAs or sum the values if not.
> There is other overhead related to DataArray indexing that needs to be 
> addressed, notably that scalar indexing needs to be inlined (as well as isna 
> and get, if we're using option types), and that accessing the data and na 
> arrays currently incurs an undefined reference check.
> Simon
> 
> On Thursday, July 31, 2014 12:51:15 PM UTC-4, John Myles White wrote:
> Harlan, I don't think your assumption about performance is necessarily 
> correct: it's sometimes the case that you can get *better* performance by 
> working with bytes rather than bits, since bytes are the primitive objects of 
> computation. For that reason, I think the performance implications of using a 
> full Bool are very sensitive to the exact computations being performed.
> 
> In general, I don't think the expansion of a BitArray bit into a Boolean is a 
> big issue for most data analysis tasks. As evidence, I'd note that expanding 
> a bit to a Bool is certainly not worse than the cost of translating a single 
> byte from an ASCIIString or UTF8String object into a Char object when doing 
> iteration over strings. If you think that decision for Base Julia is 
> reasonable, I think the decision to use Option is also defensible for similar 
> reasons.
> 
> As for integration into the current system, my thinking is that DataArrays 
> would be changed so that OptionTypes would be generated whenever you attempt 
> to access a scalar element of an AbstractDataArray. The following example 
> shows how that might work:
> 
> function mean(a::AbstractDataArray, dropna::Boolean = false)
>    sum, n = 0.0, 0
>    if dropna
>        for i in 1:length(a)
>            o = a[i]
>            if !isna(o)
>                sum += get(o)
>                n += 1
>            end
>        end
>    else
>        for i in 1:length(a)
>            sum += get(o)
>            n += 1
>        end
>    end
>    return sum / n
> end
> 
> One could also define this function so that it always returns an Option type, 
> rather than a direct Float64 value.
> 
> For special types like Float64, you could even computes means while returning 
> NaN's using the default values interface to `get`:
> 
> function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
>    sum, n = 0.0, 0
>    for i in 1:length(a)
>        sum += get(o, nan(T))
>        n += 1
>    end
>    return sum / n
> end
> 
> Unlike our current system, the use of OptionTypes would provide acceptable 
> performance without requiring that users break abstraction barriers. This is 
> the big gain: Option{T} exploits Julia's current type system / compiler to 
> express the same ideas as NAtype does in an efficient way. Options wouldn't 
> need to be boxed the way that our Union types currently are being boxed 
> because their type could be easily inferred by the compiler.
> 
> If you're not familiar with the requirement for breaking our current 
> abstraction barriers, note that indexing into a DataArray at the moment 
> poisons the performance of every program you write, because the result has an 
> uncertain type that the compiler doesn't know how to optimize. To work around 
> this, you have to write code that effectively accesses the raw .na and .data 
> fields of a DataArray. Simon's done some great work to make this easier to 
> do, but I'm not sure that's the right direction for us to head in the 
> long-run.
> 
> In general, I think an approach to missing data built on the combination of 
> Option{T} and DataArray{T} provides an interface that's simple and consistent 
> (everything happens in terms of isna/get) even if it's an interface that's 
> somewhat unfamilar to R/Python folks. Most important to me is that using 
> Option types efficiently doesn't require a deep understanding of Julia's type 
> system, whereas our current abstractions require you to understand how to 
> work around problems raised by Union types when programming for the current 
> compiler. An Option type is basically a forcing function that says: "Julia is 
> aggressively typed. If you want to work with a missing value of type T, you 
> need to explicitly say how you're going to handle any missingness so that the 
> system only interacts with values of type T."
> 
> I should note that I'm not very sure the use of Options is the right 
> approach: Simon Kornblith has argued very persuasively for waiting for the 
> compiler to improve its ability to handle tagged unions like those generated 
> by indexing into DataArrays. My personal feeling is that it's easier to 
> expose a simple abstraction that doesn't assume the compiler will change 
> dramatically. This is based on my aesthetic sense that Julia's power comes 
> from making optimizations transparent and explicit, rather than making the 
> compiler smarter.
> 
> It's also worth noting that there are problems for which the use of Option 
> types isn't very helpful: the computation of medians, for example, isn't 
> defined in terms of scalars, so having a better abstraction for expressing 
> missing scalars won't get us anywhere.
> 
> One other caveat is that I'm not providing an `unsafe_get` method, which 
> means that the `isna` followed by `get` idiom I showed above does two checks 
> when you could get away with only one. I still haven't figured out how I'd 
> like to handle that issue.
> 
> So there's still a lot of design work needed, but I wanted to let people see 
> how this interface could work were we to choose it.
> 
>  -- John
> 
> On Jul 31, 2014, at 8:15 AM, Harlan Harris <[email protected]> wrote:
> 
>> John, how might this interact with DataArrays? This design, unlike 
>> DataArrays, requires that you use an entire byte to store the missingness, 
>> so it's not likely to be as fast if you're manipulating a lot of them. Is 
>> the intended use case here something sum(DataArray) yields Option{Float64}? 
>> 
>> 
>> On Thu, Jul 31, 2014 at 11:01 AM, Stefan Karpinski <[email protected]> 
>> wrote:
>> This looks quite promising. The `get` interface looks like a very nice way 
>> to generically deal with missing values – provide a default or get an error. 
>> Automatic conversion of the default to the type of the Option value is 
>> particularly nice. This seems like it will be a pleasant and efficient API.
>> 
>> Minor question: how come the non-NA constructor for Option takes both the 
>> `na` and `value` arguments? Doesn't supplying a value imply that it's not NA 
>> while not supplying a value implies that it is NA?
>> 
>> 
>> On Thursday, July 31, 2014, John Myles White <[email protected]> wrote:
>> At JuliaCon, I described the idea of using Option types instead of NAtype to 
>> make it easier to write type-stable code that interacts with NA’s. To help 
>> facilitate a debate about the utility of this approach, I just wrote up a 
>> minimal package that implements Option types: 
>> https://github.com/johnmyleswhite/OptionTypes.jl
>> 
>> Essentially, an Option type is a scalar version of a DataArray: it contains 
>> a value and a Boolean mask indicating whether that value is NA or not. 
>> Because Option types are parametric, they allow us to express the variants 
>> of NA that R has, such as a Float64 specific NA and an Int64 specific NA.
>> 
>>  — John

Reply via email to