On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bhule...@google.com> wrote:
> > > On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <chad...@gmail.com> wrote: > >> >> Agreed on float since it seems to trivially map to a double, but I’m >>> torn on int still. While I do want int type hints to work, it doesn’t seem >>> appropriate to map it to AtomicType.INT64, since it has a completely >>> different range of values. >>> >> >>> >> Let’s say we used native int for the runtime field type, not just as >>> a schema declaration for numpy.int64. What is the real world fallout from >>> this? Would there be data loss? >>> > >>> > I'm not sure I follow the question exactly, what is the interplay >>> between int and numpy.int64 in this scenario? Are you saying that np.int64 >>> is used in the schema declaration, but we just use native int at runtime, >>> and check the bit width when encoding? >>> > >>> > In any case, I don't think the real world fallout of using int is >>> nearly that dire. I suppose data loss is possible if a poorly designed >>> pipeline overflows an int64 and crashes, >>> >>> The primary risk is that it *won't* crash when overflowing an int64, >>> it'll just silently give the wrong answer. That's much less safe than >>> using a native int and then actually crashing in the case it's too >>> large at the point one tries to encode it. >>> >> >> If the behavior of numpy.int64 is less safe than int, and both support >> 64-bit integers, and int is the more intuitive type to use, then that seems >> to make a strong case for using int rather than numpy.int64. >> >> > I'm not sure we established numpy.int64 is less safe, just that a silent > overflow is a risk. By default numpy will just log a warning when an > overflow occurs, so it's not totally silent, but definitely risky. numpy > can however be made to throw an exception when an overflow occurs with > `np.seterr(over='raise')`. > > Regardless of what type is used in the typing representation of a schema, > we've established that RowCoder.encode should accept anything convertible > to an int for integer fields. So it will need to check it's width and raise > an error if it's too large. > I added some tests last week to ensure that RowCoder does this [1]. > However they're currently skipped because I'm unsure of the proper place to > raise the error. I wrote up the details in a comment [2] (sorry I did a > force push so the comment doesn't show up in the appropriate place). > > Note that when decoding an INT32/64 field RowCoder still produces plain > old ints (since it relies on VarIntCoder), so int really is the runtime > type, and the numpy types are just for the typing representation of a > schema. > > I also updated my PR to accept int, float, and str in the typing > representation of a schema, and added the following summary of type > mappings to typehints.schema [1], since it's not readily apparent from the > code itself: > Cool! > > Python Schema > np.int8 <-----> BYTE > np.int16 <-----> INT16 > np.int32 <-----> INT32 > np.int64 <-----> INT64 > int ---/ > np.float32 <-----> FLOAT > np.float64 <-----> DOUBLE > float ---/ > bool <-----> BOOLEAN > The mappings for STRING and BYTES are different between python 2 and > python 3, > because of the changes to str: > py3: > str/unicode <-----> STRING > bytes <-----> BYTES > ByteString ---/ > py2: > unicode <-----> STRING > str/bytes ---/ > ByteString <-----> BYTES > > As you can see, int and float typings can now be used to create a schema > with an INT64 or DOUBLE attribute, but when creating an anonymous > NamedTuple sub-class from a schema, the numpy types are preferred. I prefer > that approach, if only for symmetry with the other integer and floating > point types, but I can change it to prefer int/float if I'm the only one > that feels that way. > Just an opinion: As a user I would expect anonymous types created for me to have native python types. I do not have data on what would be the expectations of users in general. > > Brian > > [1] > https://github.com/apache/beam/pull/9188/files#diff-94d5ea3d2d121722c91b220a353490e2R88 > [2] https://github.com/apache/beam/pull/9188#discussion_r312682478 > [3] > https://github.com/apache/beam/blob/25dcc50a8de9c607069a8efc80a6da67a6e8b0ca/sdks/python/apache_beam/typehints/schemas.py#L20 > > >> -chad >> >>