Re: [PROPOSAL] An initial Schema API in Python

Ahmet Altay Tue, 20 Aug 2019 15:56:05 -0700

On Tue, Aug 20, 2019 at 3:48 PM Brian Hulette <bhule...@google.com> wrote:


>
>
> On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote:
>> >
>> >
>> >
>> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bhule...@google.com>
>> wrote:
>> >>
>> >>
>> >>
>> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <chad...@gmail.com>
>> wrote:
>> >>>>
>> >>>> >> Agreed on float since it seems to trivially map to a double, but
>> I’m torn on int still. While I do want int type hints to work, it doesn’t
>> seem appropriate to map it to AtomicType.INT64, since it has a completely
>> different range of values.
>> >>>> >>
>> >>>> >> Let’s say we used native int for the runtime field type, not just
>> as a schema declaration for numpy.int64. What is the real world fallout
>> from this? Would there be data loss?
>> >>>> >
>> >>>> > I'm not sure I follow the question exactly, what is the interplay
>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>> is used in the schema declaration, but we just use native int at runtime,
>> and check the bit width when encoding?
>> >>>> >
>> >>>> > In any case, I don't think the real world fallout of using int is
>> nearly that dire. I suppose data loss is possible if a poorly designed
>> pipeline overflows an int64 and crashes,
>> >>>>
>> >>>> The primary risk is that it *won't* crash when overflowing an int64,
>> >>>> it'll just silently give the wrong answer. That's much less safe than
>> >>>> using a native int and then actually crashing in the case it's too
>> >>>> large at the point one tries to encode it.
>> >>>
>> >>>
>> >>> If the behavior of numpy.int64 is less safe than int, and both
>> support 64-bit integers, and int is the more intuitive type to use, then
>> that seems to make a strong case for using int rather than numpy.int64.
>> >>>
>> >>
>> >> I'm not sure we established numpy.int64 is less safe, just that a
>> silent overflow is a risk.
>>
>> Silent overflows are inherently less safe, especially for a language
>> where users in general never have to deal with this.
>>
>
> Absolutely agree that silent overflows are unsafe! I was just trying to
> point out that numpy isn't strictly silent. But as you point out below it's
> irrelevant because the runtime type is still int.
>
>
>> >> By default numpy will just log a warning when an overflow occurs, so
>> it's not totally silent, but definitely risky. numpy can however be made to
>> throw an exception when an overflow occurs with `np.seterr(over='raise')`.
>>
>> Warning logs on remote machines are unlikely to ever be seen. Even if
>> one knew about the numpy setting (keep in mind the user may not ever
>> directly user or import numpy), it doesn't seem to work (and one would
>> have to set it on the remote workers, or propagate this setting if set
>> in the main program).
>>
>> In [1]: import numpy as np
>> In [2]: np.seterr(over='raise')  # returns previous value
>> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
>> 'ignore'}
>> In [3]: np.int64(2**36) * np.int64(2**36)
>> Out[3]: 0
>>
>>
> That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked
> for me:
>
> In [4]: import numpy as np
>
> In [5]: np.int64(2**36) * np.int64(2**36)
> /usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1:
> RuntimeWarning: overflow encountered in long_scalars
>
>
>
> #!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python
> Out[5]: 0
>
> In [6]: np.seterr(over='raise')
> Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
> 'ignore'}
>
> In [7]: np.int64(2**36) * np.int64(2**36)
> ---------------------------------------------------------------------------
> FloatingPointError                        Traceback (most recent call last)
> <ipython-input-7-962da6705127> in <module>()
> ----> 1 np.int64(2**36) * np.int64(2**36)
>
> FloatingPointError: overflow encountered in long_scalars
>
>
>
>> >> Regardless of what type is used in the typing representation of a
>> schema, we've established that RowCoder.encode should accept anything
>> convertible to an int for integer fields. So it will need to check it's
>> width and raise an error if it's too large.
>> >> I added some tests last week to ensure that RowCoder does this [1].
>> However they're currently skipped because I'm unsure of the proper place to
>> raise the error. I wrote up the details in a comment [2] (sorry I did a
>> force push so the comment doesn't show up in the appropriate place).
>> >>
>> >> Note that when decoding an INT32/64 field RowCoder still produces
>> plain old ints (since it relies on VarIntCoder), so int really is the
>> runtime type, and the numpy types are just for the typing representation of
>> a schema.
>> >>
>> >> I also updated my PR to accept int, float, and str in the typing
>> representation of a schema, and added the following summary of type
>> mappings to typehints.schema [1], since it's not readily apparent from the
>> code itself:
>> >
>> >
>> > Cool!
>> >
>> >>
>> >>
>> >> Python              Schema
>> >> np.int8     <-----> BYTE
>> >> np.int16    <-----> INT16
>> >> np.int32    <-----> INT32
>> >> np.int64    <-----> INT64
>> >> int         ---/
>> >> np.float32  <-----> FLOAT
>> >> np.float64  <-----> DOUBLE
>> >> float       ---/
>> >> bool        <-----> BOOLEAN
>> >> The mappings for STRING and BYTES are different between python 2 and
>> python 3,
>> >> because of the changes to str:
>> >> py3:
>> >> str/unicode <-----> STRING
>> >> bytes       <-----> BYTES
>> >> ByteString  ---/
>> >> py2:
>> >> unicode     <-----> STRING
>> >> str/bytes   ---/
>> >> ByteString  <-----> BYTES
>> >>
>> >> As you can see, int and float typings can now be used to create a
>> schema with an INT64 or DOUBLE attribute, but when creating an anonymous
>> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
>> that approach, if only for symmetry with the other integer and floating
>> point types, but I can change it to prefer int/float if I'm the only one
>> that feels that way.
>>
>> Just to be clear, this is just talking about the schema itself (as at
>> that level, due to the many-to-one mapping above, no distinction is
>> made between int vs. int64). The runtime types are still int/float,
>> right?
>>
>
> Correct, the runtime type is still int. I'm only using the numpy types in
> the typing representation of a schema, so that we have some way to
> distinguish between the different integer/float bit widths. I chose numpy
> types because they are actual types (i.e. type(np.int64) == type) and thus
> are compatible with typing, unlike pyarrow types.
>

This make sense. Thank you for explaining.


>
>
>>
>> > Just an opinion: As a user I would expect anonymous types created for
>> me to have native python types. I do not have data on what would be the
>> expectations of users in general.
>>
>> I think if the schema declares a field to be int64, but the runtime
>> type of the values is actually int, that's probably OK.
>>
>

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to