On Tue, Aug 20, 2019 at 3:48 PM Brian Hulette <bhule...@google.com> wrote:
> > > On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw <rober...@google.com> > wrote: > >> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote: >> > >> > >> > >> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bhule...@google.com> >> wrote: >> >> >> >> >> >> >> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <chad...@gmail.com> >> wrote: >> >>>> >> >>>> >> Agreed on float since it seems to trivially map to a double, but >> I’m torn on int still. While I do want int type hints to work, it doesn’t >> seem appropriate to map it to AtomicType.INT64, since it has a completely >> different range of values. >> >>>> >> >> >>>> >> Let’s say we used native int for the runtime field type, not just >> as a schema declaration for numpy.int64. What is the real world fallout >> from this? Would there be data loss? >> >>>> > >> >>>> > I'm not sure I follow the question exactly, what is the interplay >> between int and numpy.int64 in this scenario? Are you saying that np.int64 >> is used in the schema declaration, but we just use native int at runtime, >> and check the bit width when encoding? >> >>>> > >> >>>> > In any case, I don't think the real world fallout of using int is >> nearly that dire. I suppose data loss is possible if a poorly designed >> pipeline overflows an int64 and crashes, >> >>>> >> >>>> The primary risk is that it *won't* crash when overflowing an int64, >> >>>> it'll just silently give the wrong answer. That's much less safe than >> >>>> using a native int and then actually crashing in the case it's too >> >>>> large at the point one tries to encode it. >> >>> >> >>> >> >>> If the behavior of numpy.int64 is less safe than int, and both >> support 64-bit integers, and int is the more intuitive type to use, then >> that seems to make a strong case for using int rather than numpy.int64. >> >>> >> >> >> >> I'm not sure we established numpy.int64 is less safe, just that a >> silent overflow is a risk. >> >> Silent overflows are inherently less safe, especially for a language >> where users in general never have to deal with this. >> > > Absolutely agree that silent overflows are unsafe! I was just trying to > point out that numpy isn't strictly silent. But as you point out below it's > irrelevant because the runtime type is still int. > > >> >> By default numpy will just log a warning when an overflow occurs, so >> it's not totally silent, but definitely risky. numpy can however be made to >> throw an exception when an overflow occurs with `np.seterr(over='raise')`. >> >> Warning logs on remote machines are unlikely to ever be seen. Even if >> one knew about the numpy setting (keep in mind the user may not ever >> directly user or import numpy), it doesn't seem to work (and one would >> have to set it on the remote workers, or propagate this setting if set >> in the main program). >> >> In [1]: import numpy as np >> In [2]: np.seterr(over='raise') # returns previous value >> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': >> 'ignore'} >> In [3]: np.int64(2**36) * np.int64(2**36) >> Out[3]: 0 >> >> > That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked > for me: > > In [4]: import numpy as np > > In [5]: np.int64(2**36) * np.int64(2**36) > /usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1: > RuntimeWarning: overflow encountered in long_scalars > > > > #!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python > Out[5]: 0 > > In [6]: np.seterr(over='raise') > Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': > 'ignore'} > > In [7]: np.int64(2**36) * np.int64(2**36) > --------------------------------------------------------------------------- > FloatingPointError Traceback (most recent call last) > <ipython-input-7-962da6705127> in <module>() > ----> 1 np.int64(2**36) * np.int64(2**36) > > FloatingPointError: overflow encountered in long_scalars > > > >> >> Regardless of what type is used in the typing representation of a >> schema, we've established that RowCoder.encode should accept anything >> convertible to an int for integer fields. So it will need to check it's >> width and raise an error if it's too large. >> >> I added some tests last week to ensure that RowCoder does this [1]. >> However they're currently skipped because I'm unsure of the proper place to >> raise the error. I wrote up the details in a comment [2] (sorry I did a >> force push so the comment doesn't show up in the appropriate place). >> >> >> >> Note that when decoding an INT32/64 field RowCoder still produces >> plain old ints (since it relies on VarIntCoder), so int really is the >> runtime type, and the numpy types are just for the typing representation of >> a schema. >> >> >> >> I also updated my PR to accept int, float, and str in the typing >> representation of a schema, and added the following summary of type >> mappings to typehints.schema [1], since it's not readily apparent from the >> code itself: >> > >> > >> > Cool! >> > >> >> >> >> >> >> Python Schema >> >> np.int8 <-----> BYTE >> >> np.int16 <-----> INT16 >> >> np.int32 <-----> INT32 >> >> np.int64 <-----> INT64 >> >> int ---/ >> >> np.float32 <-----> FLOAT >> >> np.float64 <-----> DOUBLE >> >> float ---/ >> >> bool <-----> BOOLEAN >> >> The mappings for STRING and BYTES are different between python 2 and >> python 3, >> >> because of the changes to str: >> >> py3: >> >> str/unicode <-----> STRING >> >> bytes <-----> BYTES >> >> ByteString ---/ >> >> py2: >> >> unicode <-----> STRING >> >> str/bytes ---/ >> >> ByteString <-----> BYTES >> >> >> >> As you can see, int and float typings can now be used to create a >> schema with an INT64 or DOUBLE attribute, but when creating an anonymous >> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer >> that approach, if only for symmetry with the other integer and floating >> point types, but I can change it to prefer int/float if I'm the only one >> that feels that way. >> >> Just to be clear, this is just talking about the schema itself (as at >> that level, due to the many-to-one mapping above, no distinction is >> made between int vs. int64). The runtime types are still int/float, >> right? >> > > Correct, the runtime type is still int. I'm only using the numpy types in > the typing representation of a schema, so that we have some way to > distinguish between the different integer/float bit widths. I chose numpy > types because they are actual types (i.e. type(np.int64) == type) and thus > are compatible with typing, unlike pyarrow types. > This make sense. Thank you for explaining. > > >> >> > Just an opinion: As a user I would expect anonymous types created for >> me to have native python types. I do not have data on what would be the >> expectations of users in general. >> >> I think if the schema declares a field to be int64, but the runtime >> type of the values is actually int, that's probably OK. >> >