Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
> To address Simon’s general points, which are really good reasons to
> avoid jumping on the Option{T} bandwagon too soon:
>
>
>
> * I agree that most languages use tagged union types for Option{T}
> rather than a wrapper type that contains a Boolean value. It’s also
> totally true that many compilers are able to make those constructs
> more efficient than Julia currently does. But what we should expect
> from Julia in the coming years isn’t so clear to me. (And I personally
> think we need to settle on a solution for representing missing data
> that’s viable in a year rather than viable in five years.) This is an
> issue that I’d really like to have input on from Jeff, Keno, Jameson
> or someone else involved with the internals of the compiler. Getting
> input from the broader community is the main reason I wanted to put a
> demo of OptionTypes.jl out in front of other folks.
>
>
> * I’m not clear how we could come to know that a datum is not missing
> without a resolution step that’s effectively equivalent to the get()
> function for Option{T}. I agree that the enforced use of get() means
> that you can’t hope to use generic functions like sum on collections
> of Option{T}. But I’m also not sure that’s such a bad thing: I think
> the easiest way to express to the compiler that you know that all of
> the entries of a DataArray are not NA is to convert the DataArray to a
> straight Array. But maybe you have other mechanisms for expressing
> this knowledge. Certainly my proposal to do conversions to Arrays
> isn’t the most elegant strategy. It’s just all that I’ve got so far.
>
>
> * I kind of like the idea of Option{T} standing outside of the main
> type system in a kind of mirror type system. I’m less happy about
> Union(NA, T) being a super type of T, even though there are some good
> reasons that you’d like to view T as a specialization of Union(NA, T).
> But I agree that I don’t have a good feel about where missing data
> belongs in the type hierarchy. This is another question for which I’d
> love to get input from others.
Either with Option{T} or with Union(NAtype, T), nullable types create a
mirror type hierarchy. But in the latter case, there are bridges between
the two hierarchies:
- horizontally, T <: Union(NAtype, T)
- vertically (in diagonal), T2 <: Union(NAtype, T1) when T2 <: T1
For example, for Real and Float64:
Real <- Union(NAtype, Real)
| / |
| / |
| / |
Float64 <- Union(NAtype, Float64)
I see this as a relatively nice definition of the type hierarchy for
missing data.
Thus I agree with Simon that using a type union if possible would be
better than a specific option type which would obscure the type
hierarchy -- which is really essential in Julia. Let us see what people
working on the compiler say.
But thanks for tackling this John, it's a very interesting investigation
and it would be great to be able to handle missing data in a more Julian
fashion!
Regards
>
> In regard to Simon’s performance points:
>
>
> * Yes, memory usage alone argues strongly for working with
> DataArray{T} rather than Array{Option{T}}.
>
>
> * Exploting tricks that make operations like anyna() faster is another
> good argument for keeping DataArray{T} around.
>
>
> * I’m not sure how to deal with inlining concerns or the undefined
> reference checks. Do you have ideas for improving this within
> DataArrays or do we need supporting changes in the compiler?
>
>
> — John
>
>
> On Jul 31, 2014, at 4:49 PM, Simon Kornblith <[email protected]>
> wrote:
>
>
>
> > My previous comments on option types versus union types for
> > DataArray indexing are here. To elaborate on John's summary above,
> > my present views are:
> > * In many other language, including languages that are
> > lower-level than Julia such as Rust, option types are union
> > types and it's the compiler's job to make them fast. Making
> > union types fast would simply mean that the presence
> > of Union in the output of code_typed is no longer to be
> > dreaded.
> > * The option type approach forces special handling of data
> > that could potentially be missing but isn't, and this breaks
> > generic code. I feel it is better if, once we know a datum
> > is not missing, it has the same concrete type as if it were
> > known to exist in advance.
> > * I'm not sure how missing data ought to fit into the type
> > hierarchy, but neither the Union(NA, T) nor Option{T}
> > approaches seem completely ideal.
> > Some more specific notes WRT performance:
> > * Storing option types in an Array{Option{T}} would require
> > nearly twice as much memory as storing them in a DataArray
> > in most cases. While a Bool is only 1 byte, there is
> > additional padding needed for alignment. On x86_64 for types
> > <= 64 bits the alignment is usually the size of the type.
> > * As long as option types live in registers and not in memory,
> > they should consume no additional space as compared with
> > scalar indexing into both the data and na arrays of a
> > DataArray, as indexing into the na array would already
> > require a register. There are, however, some cases where the
> > BitArray representation of NAs can be exploited for
> > performance. In John's example code forsum above with dropna
> > = false above, the Option type approach would read every bit
> > of the naarray individually. It is faster to first check if
> > any values are NA, which can be done 64 bits at a time, and
> > throw an error if there are NAs or sum the values if not.
> > * There is other overhead related to DataArray indexing that
> > needs to be addressed, notably that scalar indexing needs to
> > be inlined (as well as isna and get, if we're using option
> > types), and that accessing the data and na arrays currently
> > incurs an undefined reference check.
> > Simon
> >
> > On Thursday, July 31, 2014 12:51:15 PM UTC-4, John Myles White
> > wrote:
> >
> > Harlan, I don't think your assumption about performance is
> > necessarily correct: it's sometimes the case that you can
> > get *better* performance by working with bytes rather than
> > bits, since bytes are the primitive objects of computation.
> > For that reason, I think the performance implications of
> > using a full Bool are very sensitive to the exact
> > computations being performed.
> >
> > In general, I don't think the expansion of a BitArray bit
> > into a Boolean is a big issue for most data analysis tasks.
> > As evidence, I'd note that expanding a bit to a Bool is
> > certainly not worse than the cost of translating a single
> > byte from an ASCIIString or UTF8String object into a
> > Char object when doing iteration over strings. If you think
> > that decision for Base Julia is reasonable, I think the
> > decision to use Option is also defensible for similar
> > reasons.
> >
> > As for integration into the current system, my thinking is
> > that DataArrays would be changed so that OptionTypes would
> > be generated whenever you attempt to access a scalar element
> > of an AbstractDataArray. The following example shows how
> > that might work:
> >
> > function mean(a::AbstractDataArray, dropna::Boolean = false)
> > sum, n = 0.0, 0
> > if dropna
> > for i in 1:length(a)
> > o = a[i]
> > if !isna(o)
> > sum += get(o)
> > n += 1
> > end
> > end
> > else
> > for i in 1:length(a)
> > sum += get(o)
> > n += 1
> > end
> > end
> > return sum / n
> > end
> >
> > One could also define this function so that it always
> > returns an Option type, rather than a direct Float64 value.
> >
> >
> > For special types like Float64, you could even computes
> > means while returning NaN's using the default values
> > interface to `get`:
> >
> > function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
> > sum, n = 0.0, 0
> > for i in 1:length(a)
> > sum += get(o, nan(T))
> > n += 1
> > end
> > return sum / n
> > end
> >
> > Unlike our current system, the use of OptionTypes would
> > provide acceptable performance without requiring that users
> > break abstraction barriers. This is the big gain: Option{T}
> > exploits Julia's current type system / compiler to express
> > the same ideas as NAtype does in an efficient way. Options
> > wouldn't need to be boxed the way that our Union types
> > currently are being boxed because their type could be easily
> > inferred by the compiler.
> >
> > If you're not familiar with the requirement for breaking our
> > current abstraction barriers, note that indexing into a
> > DataArray at the moment poisons the performance of every
> > program you write, because the result has an uncertain type
> > that the compiler doesn't know how to optimize. To work
> > around this, you have to write code that effectively
> > accesses the raw .na and .data fields of a DataArray.
> > Simon's done some great work to make this easier to do, but
> > I'm not sure that's the right direction for us to head in
> > the long-run.
> >
> > In general, I think an approach to missing data built on the
> > combination of Option{T} and DataArray{T} provides an
> > interface that's simple and consistent (everything happens
> > in terms of isna/get) even if it's an interface that's
> > somewhat unfamilar to R/Python folks. Most important to me
> > is that using Option types efficiently doesn't require a
> > deep understanding of Julia's type system, whereas our
> > current abstractions require you to understand how to work
> > around problems raised by Union types when programming for
> > the current compiler. An Option type is basically a forcing
> > function that says: "Julia is aggressively typed. If you
> > want to work with a missing value of type T, you need to
> > explicitly say how you're going to handle any missingness so
> > that the system only interacts with values of type T."
> >
> > I should note that I'm not very sure the use of Options is
> > the right approach: Simon Kornblith has argued very
> > persuasively for waiting for the compiler to improve its
> > ability to handle tagged unions like those generated by
> > indexing into DataArrays. My personal feeling is that it's
> > easier to expose a simple abstraction that doesn't assume
> > the compiler will change dramatically. This is based on my
> > aesthetic sense that Julia's power comes from making
> > optimizations transparent and explicit, rather than making
> > the compiler smarter.
> >
> > It's also worth noting that there are problems for which the
> > use of Option types isn't very helpful: the computation of
> > medians, for example, isn't defined in terms of scalars, so
> > having a better abstraction for expressing missing scalars
> > won't get us anywhere.
> >
> > One other caveat is that I'm not providing an `unsafe_get`
> > method, which means that the `isna` followed by `get` idiom
> > I showed above does two checks when you could get away with
> > only one. I still haven't figured out how I'd like to handle
> > that issue.
> >
> >
> > So there's still a lot of design work needed, but I wanted
> > to let people see how this interface could work were we to
> > choose it.
> >
> >
> > -- John
> >
> >
> > On Jul 31, 2014, at 8:15 AM, Harlan Harris
> > <[email protected]> wrote:
> >
> >
> >
> > > John, how might this interact with DataArrays? This
> > > design, unlike DataArrays, requires that you use an entire
> > > byte to store the missingness, so it's not likely to be as
> > > fast if you're manipulating a lot of them. Is the intended
> > > use case here something sum(DataArray) yields
> > > Option{Float64}?
> > >
> > >
> > >
> > > On Thu, Jul 31, 2014 at 11:01 AM, Stefan
> > > Karpinski <[email protected]> wrote:
> > >
> > > This looks quite promising. The `get` interface
> > > looks like a very nice way to generically deal
> > > with missing values – provide a default or get an
> > > error. Automatic conversion of the default to the
> > > type of the Option value is particularly nice.
> > > This seems like it will be a pleasant and
> > > efficient API.
> > >
> > >
> > >
> > > Minor question: how come the non-NA constructor
> > > for Option takes both the `na` and `value`
> > > arguments? Doesn't supplying a value imply that
> > > it's not NA while not supplying a value implies
> > > that it is NA?
> > >
> > >
> > >
> > > On Thursday, July 31, 2014, John Myles White
> > > <[email protected]> wrote:
> > >
> > > At JuliaCon, I described the idea of using
> > > Option types instead of NAtype to make it
> > > easier to write type-stable code that
> > > interacts with NA’s. To help facilitate a
> > > debate about the utility of this approach,
> > > I just wrote up a minimal package that
> > > implements Option
> > > types:
> > https://github.com/johnmyleswhite/OptionTypes.jl
> > >
> > > Essentially, an Option type is a scalar
> > > version of a DataArray: it contains a
> > > value and a Boolean mask indicating
> > > whether that value is NA or not. Because
> > > Option types are parametric, they allow us
> > > to express the variants of NA that R has,
> > > such as a Float64 specific NA and an Int64
> > > specific NA.
> > >
> > > — John
>
>
>