My previous comments on option types versus union types for DataArray 
indexing are here 
<https://github.com/JuliaStats/DataArrays.jl/issues/71#issuecomment-48243633>. 
To elaborate on John's summary above, my present views are:

   - In many other language, including languages that are lower-level than 
   Julia such as Rust, option types are union types and it's the compiler's 
   job to make them fast. Making union types fast would simply mean that the 
   presence of Union in the output of code_typed is no longer to be dreaded.
   - The option type approach forces special handling of data that could 
   potentially be missing but isn't, and this breaks generic code. I feel it 
   is better if, once we know a datum is not missing, it has the same concrete 
   type as if it were known to exist in advance.
   - I'm not sure how missing data ought to fit into the type hierarchy, 
   but neither the Union(NA, T) nor Option{T} approaches seem completely ideal.

Some more specific notes WRT performance:

   - Storing option types in an Array{Option{T}} would require nearly twice 
   as much memory as storing them in a DataArray in most cases. While a Bool 
   is only 1 byte, there is additional padding needed for alignment. On x86_64 
   for types <= 64 bits the alignment is usually the size of the type.
   - As long as option types live in registers and not in memory, they 
   should consume no additional space as compared with scalar indexing into 
   both the data and na arrays of a DataArray, as indexing into the na 
   array would already require a register. There are, however, some cases 
   where the BitArray representation of NAs can be exploited for performance. 
   In John's example code for sum above with dropna = false above, the 
   Option type approach would read every bit of the na array individually. 
   It is faster to first check if any values are NA, which can be done 64 bits 
   at a time, and throw an error if there are NAs or sum the values if not.
   - There is other overhead related to DataArray indexing that needs to be 
   addressed, notably that scalar indexing needs to be inlined (as well as 
   isna and get, if we're using option types), and that accessing the data 
   and na arrays currently incurs an undefined reference check.
   
Simon

On Thursday, July 31, 2014 12:51:15 PM UTC-4, John Myles White wrote:
>
> Harlan, I don't think your assumption about performance is necessarily 
> correct: it's sometimes the case that you can get *better* performance by 
> working with bytes rather than bits, since bytes are the primitive objects 
> of computation. For that reason, I think the performance implications of 
> using a full Bool are very sensitive to the exact computations being 
> performed.
>
> In general, I don't think the expansion of a BitArray bit into a Boolean 
> is a big issue for most data analysis tasks. As evidence, I'd note that 
> expanding a bit to a Bool is certainly not worse than the cost 
> of translating a single byte from an ASCIIString or UTF8String object into 
> a Char object when doing iteration over strings. If you think that decision 
> for Base Julia is reasonable, I think the decision to use Option is also 
> defensible for similar reasons.
>
> As for integration into the current system, my thinking is that 
> DataArrays would be changed so that OptionTypes would be generated whenever 
> you attempt to access a scalar element of an AbstractDataArray. The 
> following example shows how that might work:
>
> function mean(a::AbstractDataArray, dropna::Boolean = false)
>    sum, n = 0.0, 0
>    if dropna
>        for i in 1:length(a)
>            o = a[i]
>            if !isna(o)
>                sum += get(o)
>                n += 1
>            end
>        end
>    else
>        for i in 1:length(a)
>            sum += get(o)
>            n += 1
>        end
>    end
>    return sum / n
> end
>
> One could also define this function so that it always returns an Option 
> type, rather than a direct Float64 value.
>
> For special types like Float64, you could even computes means while 
> returning NaN's using the default values interface to `get`:
>
> function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
>    sum, n = 0.0, 0
>    for i in 1:length(a)
>        sum += get(o, nan(T))
>        n += 1
>    end
>    return sum / n
> end
>
> Unlike our current system, the use of OptionTypes would provide 
> acceptable performance without requiring that users break abstraction 
> barriers. This is the big gain: Option{T} exploits Julia's current type 
> system / compiler to express the same ideas as NAtype does in an efficient 
> way. Options wouldn't need to be boxed the way that our Union types 
> currently are being boxed because their type could be easily inferred by 
> the compiler.
>
> If you're not familiar with the requirement for breaking our 
> current abstraction barriers, note that indexing into a DataArray at the 
> moment poisons the performance of every program you write, because the 
> result has an uncertain type that the compiler doesn't know how to 
> optimize. To work around this, you have to write code that effectively 
> accesses the raw .na and .data fields of a DataArray. Simon's done some 
> great work to make this easier to do, but I'm not sure that's the right 
> direction for us to head in the long-run.
>
> In general, I think an approach to missing data built on the combination 
> of Option{T} and DataArray{T} provides an interface that's simple and 
> consistent (everything happens in terms of isna/get) even if it's an 
> interface that's somewhat unfamilar to R/Python folks. Most important to me 
> is that using Option types efficiently doesn't require a deep understanding 
> of Julia's type system, whereas our current abstractions require you to 
> understand how to work around problems raised by Union types when 
> programming for the current compiler. An Option type is basically a forcing 
> function that says: "Julia is aggressively typed. If you want to work with 
> a missing value of type T, you need to explicitly say how you're going to 
> handle any missingness so that the system only interacts with values of 
> type T."
>
> I should note that I'm not very sure the use of Options is the right 
> approach: Simon Kornblith has argued very persuasively for waiting for the 
> compiler to improve its ability to handle tagged unions like those 
> generated by indexing into DataArrays. My personal feeling is that it's 
> easier to expose a simple abstraction that doesn't assume the compiler will 
> change dramatically. This is based on my aesthetic sense that Julia's power 
> comes from making optimizations transparent and explicit, rather than 
> making the compiler smarter.
>
> It's also worth noting that there are problems for which the use of 
> Option types isn't very helpful: the computation of medians, for example, 
> isn't defined in terms of scalars, so having a better abstraction for 
> expressing missing scalars won't get us anywhere.
>
> One other caveat is that I'm not providing an `unsafe_get` method, which 
> means that the `isna` followed by `get` idiom I showed above does two 
> checks when you could get away with only one. I still haven't figured out 
> how I'd like to handle that issue.
>
> So there's still a lot of design work needed, but I wanted to let people 
> see how this interface could work were we to choose it.
>
>  -- John
>
> On Jul 31, 2014, at 8:15 AM, Harlan Harris <[email protected] 
> <javascript:>> wrote:
>
> John, how might this interact with DataArrays? This design, unlike 
> DataArrays, requires that you use an entire byte to store the missingness, 
> so it's not likely to be as fast if you're manipulating a lot of them. Is 
> the intended use case here something sum(DataArray) yields Option{Float64}? 
>
>
> On Thu, Jul 31, 2014 at 11:01 AM, Stefan Karpinski <[email protected] 
> <javascript:>> wrote:
>
>> This looks quite promising. The `get` interface looks like a very nice 
>> way to generically deal with missing values – provide a default or get an 
>> error. Automatic conversion of the default to the type of the Option value 
>> is particularly nice. This seems like it will be a pleasant and efficient 
>> API.
>>
>> Minor question: how come the non-NA constructor for Option takes both the 
>> `na` and `value` arguments? Doesn't supplying a value imply that it's not 
>> NA while not supplying a value implies that it is NA?
>>
>>
>> On Thursday, July 31, 2014, John Myles White <[email protected] 
>> <javascript:>> wrote:
>>
>>> At JuliaCon, I described the idea of using Option types instead of 
>>> NAtype to make it easier to write type-stable code that interacts with 
>>> NA’s. To help facilitate a debate about the utility of this approach, I 
>>> just wrote up a minimal package that implements Option types: 
>>> https://github.com/johnmyleswhite/OptionTypes.jl
>>>
>>> Essentially, an Option type is a scalar version of a DataArray: it 
>>> contains a value and a Boolean mask indicating whether that value is NA or 
>>> not. Because Option types are parametric, they allow us to express the 
>>> variants of NA that R has, such as a Float64 specific NA and an Int64 
>>> specific NA.
>>>
>>>  — John
>>>
>>>  
>
>

Reply via email to