Re: [Numpy-discussion] An NA compromise idea -- many-NA

Mark Wiebe Fri, 01 Jul 2011 13:42:52 -0700

On Fri, Jul 1, 2011 at 3:36 PM, Charles R Harris
<[email protected]>wrote:


>
>
> On Fri, Jul 1, 2011 at 2:33 PM, Mark Wiebe <[email protected]> wrote:
>
>> On Fri, Jul 1, 2011 at 3:29 PM, Charles R Harris <
>> [email protected]> wrote:
>>
>>>
>>>
>>> On Fri, Jul 1, 2011 at 2:26 PM, Mark Wiebe <[email protected]> wrote:
>>>
>>>> On Fri, Jul 1, 2011 at 3:20 PM, Mark Wiebe <[email protected]> wrote:
>>>>
>>>>> On Fri, Jul 1, 2011 at 3:01 PM, Skipper Seabold 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> On Fri, Jul 1, 2011 at 3:46 PM, Dag Sverre Seljebotn
>>>>>> <[email protected]> wrote:
>>>>>> > I propose a simple idea *for the long term* for generalizing Mark's
>>>>>> > proposal, that I hope may perhaps put some people behind Mark's
>>>>>> concrete
>>>>>> > proposal in the short term.
>>>>>> >
>>>>>> > If key feature missing in Mark's proposal is the ability to
>>>>>> distinguish
>>>>>> > between different reason for NA-ness; IGNORE vs. NA. However, one
>>>>>> could
>>>>>> > conceive wanting to track a whole host of reasons:
>>>>>> >
>>>>>> > homework_grades = np.asarray([2, 3, 1, EATEN_BY_DOG, 5, SICK, 2,
>>>>>> TOO_LAZY])
>>>>>> >
>>>>>> > Wouldn't it be a shame to put a lot of work into NA, but then have
>>>>>> users
>>>>>> > to still keep a seperate "shadow-array" for stuff like this?
>>>>>> >
>>>>>> > a) In this case the generality of Mark's proposal seems justified
>>>>>> and
>>>>>> > less confusing to teach newcomers (?)
>>>>>> >
>>>>>> > b) Since Mark's proposal seems to generalize well to many NAs
>>>>>> (there's 8
>>>>>> > bits in the mask, and millions of available NaN-s in floating
>>>>>> point), if
>>>>>> > people agreed to this one could leave it for later and just go on
>>>>>> with
>>>>>> > the proposed idea.
>>>>>> >
>>>>>>
>>>>>> I have not been following the discussion in much detail, so forgive me
>>>>>> if this has come up. But I think this approach is also similar to
>>>>>> thinking behind missing values in SAS and "extended" missing values in
>>>>>> Stata. They are missing but preserve an order. This way you can pull
>>>>>> out values that are missing because they were eaten by a dog and see
>>>>>> if these missing ones are systematically different than the ones that
>>>>>> are missing because they're too lazy. Use case that pops to mind,
>>>>>> seeing if the various ways of attrition in surveys or experiments
>>>>>> varies in a non-random way.
>>>>>>
>>>>>>
>>>>>> http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000989180.htm
>>>>>> http://www.stata.com/help.cgi?missing
>>>>>
>>>>>
>>>>> That's interesting, and I see that they use a numerical ordering for
>>>>> the different NA values. I think if instead of using the AND operator to
>>>>> combine masks, we use MINIMUM, this behavior would happen naturally with
>>>>> almost no additional work. Then, in addition to np.NA and np.NA(dtype), it
>>>>> could allow np.NA(dtype, ID) to assign an ID between 1 and 255, where 1 is
>>>>> the default.
>>>>>
>>>>
>>>> Sorry, my brain is a bit addled by all these comments. This idea would
>>>> also require flipping the mask so 0 is unmasked. and 1 to 255 is masked as
>>>> Christopher pointed out in a different thread.
>>>>
>>>
>>> Or you could subtract instead of add and use maximum instead of minimum.
>>> I thought those details would be hidden.
>>>
>>
>> Definitely, but the most natural distinction thinking numerically is
>> between zero and non-zero, and there's only one zero, so giving it the
>> 'unmasked' value is natural for this way of extending it. If you follow
>> Joe's idea where you're basically introducing it as an image alpha mask, you
>> would have 0 be fully masked, 128 be 50% masked, and 255 be fully unmasked.
>>
>>
> I'm not complaining ;) I thought these ideas were out there from the
> beginning, but maybe that was just me...
>

You're right, but it feels like it's been 10 years in internet time by now.
:)

The design has evolved a lot from all the feedback too, so revisiting some
of these things that initially may have felt less like they fit before
doesn't hurt. I'm not so keen on rereading 250+ email messages though...

-Mark


>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] An NA compromise idea -- many-NA

Reply via email to