Re: [PROPOSAL] An initial Schema API in Python

Ahmet Altay Mon, 19 Aug 2019 17:44:44 -0700

On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <[email protected]> wrote:


>
>
> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <[email protected]> wrote:
>
>> >> Agreed on float since it seems to trivially map to a double, but I’m
>>> torn on int still. While I do want int type hints to work, it doesn’t seem
>>> appropriate to map it to AtomicType.INT64, since it has a completely
>>> different range of values.
>>> >>
>>> >> Let’s say we used native int for the runtime field type, not just as
>>> a schema declaration for numpy.int64. What is the real world fallout from
>>> this? Would there be data loss?
>>> >
>>> > I'm not sure I follow the question exactly, what is the interplay
>>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>>> is used in the schema declaration, but we just use native int at runtime,
>>> and check the bit width when encoding?
>>> >
>>> > In any case, I don't think the real world fallout of using int is
>>> nearly that dire. I suppose data loss is possible if a poorly designed
>>> pipeline overflows an int64 and crashes,
>>>
>>> The primary risk is that it *won't* crash when overflowing an int64,
>>> it'll just silently give the wrong answer. That's much less safe than
>>> using a native int and then actually crashing in the case it's too
>>> large at the point one tries to encode it.
>>>
>>
>> If the behavior of numpy.int64 is less safe than int, and both support
>> 64-bit integers, and int is the more intuitive type to use, then that seems
>> to make a strong case for using int rather than numpy.int64.
>>
>>
> I'm not sure we established numpy.int64 is less safe, just that a silent
> overflow is a risk. By default numpy will just log a warning when an
> overflow occurs, so it's not totally silent, but definitely risky. numpy
> can however be made to throw an exception when an overflow occurs with
> `np.seterr(over='raise')`.
>
> Regardless of what type is used in the typing representation of a schema,
> we've established that RowCoder.encode should accept anything convertible
> to an int for integer fields. So it will need to check it's width and raise
> an error if it's too large.
> I added some tests last week to ensure that RowCoder does this [1].
> However they're currently skipped because I'm unsure of the proper place to
> raise the error. I wrote up the details in a comment [2] (sorry I did a
> force push so the comment doesn't show up in the appropriate place).
>
> Note that when decoding an INT32/64 field RowCoder still produces plain
> old ints (since it relies on VarIntCoder), so int really is the runtime
> type, and the numpy types are just for the typing representation of a
> schema.
>
> I also updated my PR to accept int, float, and str in the typing
> representation of a schema, and added the following summary of type
> mappings to typehints.schema [1], since it's not readily apparent from the
> code itself:
>

Cool!


>
> Python              Schema
> np.int8     <-----> BYTE
> np.int16    <-----> INT16
> np.int32    <-----> INT32
> np.int64    <-----> INT64
> int         ---/
> np.float32  <-----> FLOAT
> np.float64  <-----> DOUBLE
> float       ---/
> bool        <-----> BOOLEAN
> The mappings for STRING and BYTES are different between python 2 and
> python 3,
> because of the changes to str:
> py3:
> str/unicode <-----> STRING
> bytes       <-----> BYTES
> ByteString  ---/
> py2:
> unicode     <-----> STRING
> str/bytes   ---/
> ByteString  <-----> BYTES
>
> As you can see, int and float typings can now be used to create a schema
> with an INT64 or DOUBLE attribute, but when creating an anonymous
> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
> that approach, if only for symmetry with the other integer and floating
> point types, but I can change it to prefer int/float if I'm the only one
> that feels that way.
>

Just an opinion: As a user I would expect anonymous types created for me to
have native python types. I do not have data on what would be the
expectations of users in general.


>
> Brian
>
> [1]
> https://github.com/apache/beam/pull/9188/files#diff-94d5ea3d2d121722c91b220a353490e2R88
> [2] https://github.com/apache/beam/pull/9188#discussion_r312682478
> [3]
> https://github.com/apache/beam/blob/25dcc50a8de9c607069a8efc80a6da67a6e8b0ca/sdks/python/apache_beam/typehints/schemas.py#L20
>
>
>> -chad
>>
>>

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to