Harlan, I don't think your assumption about performance is necessarily correct:
it's sometimes the case that you can get *better* performance by working with
bytes rather than bits, since bytes are the primitive objects of computation.
For that reason, I think the performance implications of using a full Bool are
very sensitive to the exact computations being performed.
In general, I don't think the expansion of a BitArray bit into a Boolean is a
big issue for most data analysis tasks. As evidence, I'd note that expanding a
bit to a Bool is certainly not worse than the cost of translating a single byte
from an ASCIIString or UTF8String object into a Char object when doing
iteration over strings. If you think that decision for Base Julia is
reasonable, I think the decision to use Option is also defensible for similar
reasons.
As for integration into the current system, my thinking is that DataArrays
would be changed so that OptionTypes would be generated whenever you attempt to
access a scalar element of an AbstractDataArray. The following example shows
how that might work:
function mean(a::AbstractDataArray, dropna::Boolean = false)
sum, n = 0.0, 0
if dropna
for i in 1:length(a)
o = a[i]
if !isna(o)
sum += get(o)
n += 1
end
end
else
for i in 1:length(a)
sum += get(o)
n += 1
end
end
return sum / n
end
One could also define this function so that it always returns an Option type,
rather than a direct Float64 value.
For special types like Float64, you could even computes means while returning
NaN's using the default values interface to `get`:
function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
sum, n = 0.0, 0
for i in 1:length(a)
sum += get(o, nan(T))
n += 1
end
return sum / n
end
Unlike our current system, the use of OptionTypes would provide acceptable
performance without requiring that users break abstraction barriers. This is
the big gain: Option{T} exploits Julia's current type system / compiler to
express the same ideas as NAtype does in an efficient way. Options wouldn't
need to be boxed the way that our Union types currently are being boxed because
their type could be easily inferred by the compiler.
If you're not familiar with the requirement for breaking our current
abstraction barriers, note that indexing into a DataArray at the moment poisons
the performance of every program you write, because the result has an uncertain
type that the compiler doesn't know how to optimize. To work around this, you
have to write code that effectively accesses the raw .na and .data fields of a
DataArray. Simon's done some great work to make this easier to do, but I'm not
sure that's the right direction for us to head in the long-run.
In general, I think an approach to missing data built on the combination of
Option{T} and DataArray{T} provides an interface that's simple and consistent
(everything happens in terms of isna/get) even if it's an interface that's
somewhat unfamilar to R/Python folks. Most important to me is that using Option
types efficiently doesn't require a deep understanding of Julia's type system,
whereas our current abstractions require you to understand how to work around
problems raised by Union types when programming for the current compiler. An
Option type is basically a forcing function that says: "Julia is aggressively
typed. If you want to work with a missing value of type T, you need to
explicitly say how you're going to handle any missingness so that the system
only interacts with values of type T."
I should note that I'm not very sure the use of Options is the right approach:
Simon Kornblith has argued very persuasively for waiting for the compiler to
improve its ability to handle tagged unions like those generated by indexing
into DataArrays. My personal feeling is that it's easier to expose a simple
abstraction that doesn't assume the compiler will change dramatically. This is
based on my aesthetic sense that Julia's power comes from making optimizations
transparent and explicit, rather than making the compiler smarter.
It's also worth noting that there are problems for which the use of Option
types isn't very helpful: the computation of medians, for example, isn't
defined in terms of scalars, so having a better abstraction for expressing
missing scalars won't get us anywhere.
One other caveat is that I'm not providing an `unsafe_get` method, which means
that the `isna` followed by `get` idiom I showed above does two checks when you
could get away with only one. I still haven't figured out how I'd like to
handle that issue.
So there's still a lot of design work needed, but I wanted to let people see
how this interface could work were we to choose it.
-- John
On Jul 31, 2014, at 8:15 AM, Harlan Harris <[email protected]> wrote:
> John, how might this interact with DataArrays? This design, unlike
> DataArrays, requires that you use an entire byte to store the missingness, so
> it's not likely to be as fast if you're manipulating a lot of them. Is the
> intended use case here something sum(DataArray) yields Option{Float64}?
>
>
> On Thu, Jul 31, 2014 at 11:01 AM, Stefan Karpinski <[email protected]>
> wrote:
> This looks quite promising. The `get` interface looks like a very nice way to
> generically deal with missing values – provide a default or get an error.
> Automatic conversion of the default to the type of the Option value is
> particularly nice. This seems like it will be a pleasant and efficient API.
>
> Minor question: how come the non-NA constructor for Option takes both the
> `na` and `value` arguments? Doesn't supplying a value imply that it's not NA
> while not supplying a value implies that it is NA?
>
>
> On Thursday, July 31, 2014, John Myles White <[email protected]> wrote:
> At JuliaCon, I described the idea of using Option types instead of NAtype to
> make it easier to write type-stable code that interacts with NA’s. To help
> facilitate a debate about the utility of this approach, I just wrote up a
> minimal package that implements Option types:
> https://github.com/johnmyleswhite/OptionTypes.jl
>
> Essentially, an Option type is a scalar version of a DataArray: it contains a
> value and a Boolean mask indicating whether that value is NA or not. Because
> Option types are parametric, they allow us to express the variants of NA that
> R has, such as a Float64 specific NA and an Int64 specific NA.
>
> — John
>
>