Harlan, I don't think your assumption about performance is necessarily correct: 
it's sometimes the case that you can get *better* performance by working with 
bytes rather than bits, since bytes are the primitive objects of computation. 
For that reason, I think the performance implications of using a full Bool are 
very sensitive to the exact computations being performed.

In general, I don't think the expansion of a BitArray bit into a Boolean is a 
big issue for most data analysis tasks. As evidence, I'd note that expanding a 
bit to a Bool is certainly not worse than the cost of translating a single byte 
from an ASCIIString or UTF8String object into a Char object when doing 
iteration over strings. If you think that decision for Base Julia is 
reasonable, I think the decision to use Option is also defensible for similar 
reasons.

As for integration into the current system, my thinking is that DataArrays 
would be changed so that OptionTypes would be generated whenever you attempt to 
access a scalar element of an AbstractDataArray. The following example shows 
how that might work:

function mean(a::AbstractDataArray, dropna::Boolean = false)
   sum, n = 0.0, 0
   if dropna
       for i in 1:length(a)
           o = a[i]
           if !isna(o)
               sum += get(o)
               n += 1
           end
       end
   else
       for i in 1:length(a)
           sum += get(o)
           n += 1
       end
   end
   return sum / n
end

One could also define this function so that it always returns an Option type, 
rather than a direct Float64 value.

For special types like Float64, you could even computes means while returning 
NaN's using the default values interface to `get`:

function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
   sum, n = 0.0, 0
   for i in 1:length(a)
       sum += get(o, nan(T))
       n += 1
   end
   return sum / n
end

Unlike our current system, the use of OptionTypes would provide acceptable 
performance without requiring that users break abstraction barriers. This is 
the big gain: Option{T} exploits Julia's current type system / compiler to 
express the same ideas as NAtype does in an efficient way. Options wouldn't 
need to be boxed the way that our Union types currently are being boxed because 
their type could be easily inferred by the compiler.

If you're not familiar with the requirement for breaking our current 
abstraction barriers, note that indexing into a DataArray at the moment poisons 
the performance of every program you write, because the result has an uncertain 
type that the compiler doesn't know how to optimize. To work around this, you 
have to write code that effectively accesses the raw .na and .data fields of a 
DataArray. Simon's done some great work to make this easier to do, but I'm not 
sure that's the right direction for us to head in the long-run.

In general, I think an approach to missing data built on the combination of 
Option{T} and DataArray{T} provides an interface that's simple and consistent 
(everything happens in terms of isna/get) even if it's an interface that's 
somewhat unfamilar to R/Python folks. Most important to me is that using Option 
types efficiently doesn't require a deep understanding of Julia's type system, 
whereas our current abstractions require you to understand how to work around 
problems raised by Union types when programming for the current compiler. An 
Option type is basically a forcing function that says: "Julia is aggressively 
typed. If you want to work with a missing value of type T, you need to 
explicitly say how you're going to handle any missingness so that the system 
only interacts with values of type T."

I should note that I'm not very sure the use of Options is the right approach: 
Simon Kornblith has argued very persuasively for waiting for the compiler to 
improve its ability to handle tagged unions like those generated by indexing 
into DataArrays. My personal feeling is that it's easier to expose a simple 
abstraction that doesn't assume the compiler will change dramatically. This is 
based on my aesthetic sense that Julia's power comes from making optimizations 
transparent and explicit, rather than making the compiler smarter.

It's also worth noting that there are problems for which the use of Option 
types isn't very helpful: the computation of medians, for example, isn't 
defined in terms of scalars, so having a better abstraction for expressing 
missing scalars won't get us anywhere.

One other caveat is that I'm not providing an `unsafe_get` method, which means 
that the `isna` followed by `get` idiom I showed above does two checks when you 
could get away with only one. I still haven't figured out how I'd like to 
handle that issue.

So there's still a lot of design work needed, but I wanted to let people see 
how this interface could work were we to choose it.

 -- John

On Jul 31, 2014, at 8:15 AM, Harlan Harris <[email protected]> wrote:

> John, how might this interact with DataArrays? This design, unlike 
> DataArrays, requires that you use an entire byte to store the missingness, so 
> it's not likely to be as fast if you're manipulating a lot of them. Is the 
> intended use case here something sum(DataArray) yields Option{Float64}? 
> 
> 
> On Thu, Jul 31, 2014 at 11:01 AM, Stefan Karpinski <[email protected]> 
> wrote:
> This looks quite promising. The `get` interface looks like a very nice way to 
> generically deal with missing values – provide a default or get an error. 
> Automatic conversion of the default to the type of the Option value is 
> particularly nice. This seems like it will be a pleasant and efficient API.
> 
> Minor question: how come the non-NA constructor for Option takes both the 
> `na` and `value` arguments? Doesn't supplying a value imply that it's not NA 
> while not supplying a value implies that it is NA?
> 
> 
> On Thursday, July 31, 2014, John Myles White <[email protected]> wrote:
> At JuliaCon, I described the idea of using Option types instead of NAtype to 
> make it easier to write type-stable code that interacts with NA’s. To help 
> facilitate a debate about the utility of this approach, I just wrote up a 
> minimal package that implements Option types: 
> https://github.com/johnmyleswhite/OptionTypes.jl
> 
> Essentially, an Option type is a scalar version of a DataArray: it contains a 
> value and a Boolean mask indicating whether that value is NA or not. Because 
> Option types are parametric, they allow us to express the variants of NA that 
> R has, such as a Float64 specific NA and an Int64 specific NA.
> 
>  — John
> 
> 

Reply via email to