Re: [PROPOSAL] An initial Schema API in Python

Robert Bradshaw Tue, 20 Aug 2019 13:41:50 -0700

On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote:
>
>
>
> On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bhule...@google.com> wrote:
>>
>>
>>
>> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <chad...@gmail.com> wrote:
>>>>
>>>> >> Agreed on float since it seems to trivially map to a double, but I’m 
>>>> >> torn on int still. While I do want int type hints to work, it doesn’t 
>>>> >> seem appropriate to map it to AtomicType.INT64, since it has a 
>>>> >> completely different range of values.
>>>> >>
>>>> >> Let’s say we used native int for the runtime field type, not just as a 
>>>> >> schema declaration for numpy.int64. What is the real world fallout from 
>>>> >> this? Would there be data loss?
>>>> >
>>>> > I'm not sure I follow the question exactly, what is the interplay 
>>>> > between int and numpy.int64 in this scenario? Are you saying that 
>>>> > np.int64 is used in the schema declaration, but we just use native int 
>>>> > at runtime, and check the bit width when encoding?
>>>> >
>>>> > In any case, I don't think the real world fallout of using int is nearly 
>>>> > that dire. I suppose data loss is possible if a poorly designed pipeline 
>>>> > overflows an int64 and crashes,
>>>>
>>>> The primary risk is that it *won't* crash when overflowing an int64,
>>>> it'll just silently give the wrong answer. That's much less safe than
>>>> using a native int and then actually crashing in the case it's too
>>>> large at the point one tries to encode it.
>>>
>>>
>>> If the behavior of numpy.int64 is less safe than int, and both support 
>>> 64-bit integers, and int is the more intuitive type to use, then that seems 
>>> to make a strong case for using int rather than numpy.int64.
>>>
>>
>> I'm not sure we established numpy.int64 is less safe, just that a silent 
>> overflow is a risk.


Silent overflows are inherently less safe, especially for a language
where users in general never have to deal with this.

>> By default numpy will just log a warning when an overflow occurs, so it's 
>> not totally silent, but definitely risky. numpy can however be made to throw 
>> an exception when an overflow occurs with `np.seterr(over='raise')`.

Warning logs on remote machines are unlikely to ever be seen. Even if
one knew about the numpy setting (keep in mind the user may not ever
directly user or import numpy), it doesn't seem to work (and one would
have to set it on the remote workers, or propagate this setting if set
in the main program).

In [1]: import numpy as np
In [2]: np.seterr(over='raise')  # returns previous value
Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
In [3]: np.int64(2**36) * np.int64(2**36)
Out[3]: 0

>> Regardless of what type is used in the typing representation of a schema, 
>> we've established that RowCoder.encode should accept anything convertible to 
>> an int for integer fields. So it will need to check it's width and raise an 
>> error if it's too large.
>> I added some tests last week to ensure that RowCoder does this [1]. However 
>> they're currently skipped because I'm unsure of the proper place to raise 
>> the error. I wrote up the details in a comment [2] (sorry I did a force push 
>> so the comment doesn't show up in the appropriate place).
>>
>> Note that when decoding an INT32/64 field RowCoder still produces plain old 
>> ints (since it relies on VarIntCoder), so int really is the runtime type, 
>> and the numpy types are just for the typing representation of a schema.
>>
>> I also updated my PR to accept int, float, and str in the typing 
>> representation of a schema, and added the following summary of type mappings 
>> to typehints.schema [1], since it's not readily apparent from the code 
>> itself:
>
>
> Cool!
>
>>
>>
>> Python              Schema
>> np.int8     <-----> BYTE
>> np.int16    <-----> INT16
>> np.int32    <-----> INT32
>> np.int64    <-----> INT64
>> int         ---/
>> np.float32  <-----> FLOAT
>> np.float64  <-----> DOUBLE
>> float       ---/
>> bool        <-----> BOOLEAN
>> The mappings for STRING and BYTES are different between python 2 and python 
>> 3,
>> because of the changes to str:
>> py3:
>> str/unicode <-----> STRING
>> bytes       <-----> BYTES
>> ByteString  ---/
>> py2:
>> unicode     <-----> STRING
>> str/bytes   ---/
>> ByteString  <-----> BYTES
>>
>> As you can see, int and float typings can now be used to create a schema 
>> with an INT64 or DOUBLE attribute, but when creating an anonymous NamedTuple 
>> sub-class from a schema, the numpy types are preferred. I prefer that 
>> approach, if only for symmetry with the other integer and floating point 
>> types, but I can change it to prefer int/float if I'm the only one that 
>> feels that way.

Just to be clear, this is just talking about the schema itself (as at
that level, due to the many-to-one mapping above, no distinction is
made between int vs. int64). The runtime types are still int/float,
right?

> Just an opinion: As a user I would expect anonymous types created for me to 
> have native python types. I do not have data on what would be the 
> expectations of users in general.

I think if the schema declares a field to be int64, but the runtime
type of the values is actually int, that's probably OK.

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to