On Wed, Aug 7, 2019 at 10:51 AM Brian Hulette <bhule...@google.com> wrote:
> If it is not a big deal supporting both sounds good. I was actually >> referring to your comment about typing.Collection not being available on >> python 2. >> > > Oh, of course, sorry somehow that completely slipped my mind. Now that I'm > actually thinking it through, you're right there are several things that > could be cleaner if we make it a python 3 only feature: > - We could use `typing.Collection` > - No need to worry about 2/3 compatibility for strings, we could just use > `str` > - We could just use `bytes` for byte arrays (as a shorthand for > `typing.ByteString` [1]) > > If we do make it a python 3 only feature how would it be enforced? Just by > documentation, or could I add a version check? I can't find any precedents > for entire features that reject py2 currently. > +Valentyn Tymofieiev <valen...@google.com> might have a suggestion. We do not have a precedence in this. Documentation might be enough. Our intention was to drop py2 support very soon anyway. (as discussed on some other thread in the mailing list.) > > What happens when one does np.int16(107) + 1, is the numpy type preserved? > > > Yes it looks like once you're in numpy land you stay there, unless you > explicitly leave with a call to int(). ints in an arithmetic expression > like np.int16(107) + 1 are implicitly converted to a reasonable type > (np.int64 if bit_length() < 64, else np.float64). > > Another concern I have is, if there is a user function or a library that >> user does not control, that uses typing to indicate that a function accepts >> a type of int, would it be compatible with numpy types? > > > That's a good point. I tested this out with mypy and it would not be > compatible: > Maybe we can find a way to do the conversions automatically for the users. > > def square(x: int): > return x*x > > square(np.int16(32)) # mypy error > > Users would have to cast to int to make this work as Chad pointed out. I > was curious about the actual performance cost of these conversions, so I > wrote up a script to benchmark them [2]. The results from running on python > 2.7.16 on my desktop: > > pass: 6.117 ns/op > int to int: 71.524 ns/op > np.int8 to int: 89.784 ns/op > int to np.int8: 89.784 ns/op > np.int8 to np.int8: 89.784 ns/op > np.int16 to int: 86.715 ns/op > int to np.int16: 86.715 ns/op > np.int16 to np.int16: 86.715 ns/op > np.int32 to int: 89.172 ns/op > int to np.int32: 89.172 ns/op > np.int32 to np.int32: 89.172 ns/op > np.int64 to int: 88.072 ns/op > int to np.int64: 88.072 ns/op > np.int64 to np.int64: 88.072 ns/op > > It's interesting to note you don't pay much of a penalty for converting > to/from numpy types over just casting something that's already an int. > > [1] https://docs.python.org/3/library/typing.html#typing.ByteString > [2] https://gist.github.com/TheNeuralBit/158dff3fa90dc46a369bb014e913650d > > On Fri, Aug 2, 2019 at 6:43 PM Ahmet Altay <al...@google.com> wrote: > >> To clarify, I am happy to start with implementation and iterating on it. >> I do not want to block this late into the discussion. >> >> On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <bhule...@google.com> wrote: >> >>> I meant "or sub-class it and define fields with type annotations" not >>> "with attributes". I believe that version doesn't work in python 2 since it >>> doesn't support the `name: type` syntax. >>> >> >> If it is not a big deal supporting both sounds good. I was actually >> referring to your comment about typing.Collection not being available on >> python 2. >> >> >>> >>> On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bhule...@google.com> >>> wrote: >>> >>>> > Do we need to support python 2? If supporting python 2 will >>>> complicate things, we could make this a python3 only feature. >>>> I don't think supporting python 2 complicates things. It's just that >>>> there are two different ways to use typing.NamedTuple in python 3 - you can >>>> either instantiate it and provide a list of (name, type) pairs, or >>>> sub-class it and define fields with attributes. But in python 2 only the >>>> former works. >>>> >>>> > Why are we mapping to numpy types? Design document suggests mapping >>>> to python native types as the plan. >>>> We did discuss using numpy types in a comment [1], but you're right we >>>> never resolved it and the doc still lists native types. My biggest concern >>>> with just using native int/float types is I think we definitely need *some* >>>> way to distinguish between the schema proto's various int/float sizes in >>>> the python representation. If we don't we would need to either a) reject >>>> schemas that contain any size other than the one that we support, or b) no >>>> longer have a bijective mapping between proto and python (i.e. any integer >>>> type that passes through the Python SDK would get converted to an int64). >>>> And if we do need some way to distinguish between the integer types, I >>>> thought a de facto standard was better than creating our own - as Robert >>>> noted in that comment thread "The only strong opinion I have is that we >>>> shouldn't invent our own." >>>> >>>> As I was experimenting with different approaches I also discovered the >>>> numpy numeric types are very nice because you can instantiate them and they >>>> look just like ints, for example `np.int16(107) == 107` evaluates to true >>>> even though `type(np.int16(107)) == type(107)` does not. >>>> >>>> Another concern with python's int type is that it supports unlimited >>>> precision [2], so it's really not a good type to use for any of the schema >>>> ints. My PR as it is right now actually doesn't even support int. I >>>> probably should at least make a change to accept int as a type >>>> specification for iint64 but throw an error when encoding if an int is too >>>> big. >>>> >>> >> I agree with Robert's position of not inventing our own. I assume we >> could make a decision between python native types, arrow types, and numpy >> types. >> >> What happens when one does np.int16(107) + 1, is the numpy type preserved? >> >> Another concern I have is, if there is a user function or a library that >> user does not control, that uses typing to indicate that a function accepts >> a type of int, would it be compatible with numpy types? >> >> >>> >>>> [1] >>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA >>>> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric >>>> >>>> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote: >>>> > >>>> > Thank you Brian. >>>> > >>>> > I did not spend enough time yet to review. Some early questions, I >>>> apologize if I missed an earlier discussion. >>>> > - Do we need to support python 2? If supporting python 2 will >>>> complicate things, we could make this a python3 only feature. >>>> > - Why are we mapping to numpy types? Design document suggests mapping >>>> to python native types as the plan. >>>> > >>>> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >> >>>> >> tl;dr: I have a PR at [1] that defines an initial Schema API in >>>> python based on the typing module, and uses typing.NamedTuple to represent >>>> a Schema. There are some risks with that approach but I propose we move >>>> forward with it as a first draft and iterate. >>>> >> >>>> >> >>>> >> I've opened up a PR [1] that implements RowCoder in the Python SDK >>>> and verifies it's compatibility with the Java implementation via tests in >>>> standard_coders.yaml. A lot of miscellaneous changes are required to get >>>> that point, including a pretty significant one: providing some native >>>> python representation for schemas. >>>> >> >>>> >> As discussed in the PR description I opted to fully embrace the >>>> typing module for the native representation of schema types: >>>> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode). >>>> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas >>>> we settled on typing.Collection, but unfortunately this doesn't seem to be >>>> supported in python 2, I'm open to other suggestions here. >>>> >> - Map maps to typing.Mapping. >>>> >> - Rows map to typing.NamedTuple. >>>> >> - nullability is indicated with typing.Optional. Note there's no >>>> distinction between Optional[Optional[T]] and Optional[T] in typing, both >>>> map to Union[T, None] - so this is actually a good analog for the nullable >>>> flag on FieldType in schema.proto. >>>> >> >>>> >> With this approach a schema in Python might look like: >>>> >> ``` >>>> >> class Movie(NamedTuple): >>>> >> name: np.unicode >>>> >> year: Optional[np.int16] >>>> >> >>>> >> # The class/type annotation syntax doesn't work in Python 2. Instead >>>> you can use: >>>> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year', >>>> Optional[np.int16])] >>>> >> >>>> >> # DoFns annotated with_output_types(Movie) will use RowCoder >>>> >> coders.registry.register_coder(Movie, coders.RowCoder) >>>> >> ``` >>>> >> >>>> >> I think the choice to use typing.NamedTuple as a row type is >>>> potentially controversial - Udi, Robert Bradshaw and I were already >>>> discussing it a bit in a comment on the portable schemas doc [2], but I >>>> wanted to bring that discussion to the ML. >>>> >> >>>> >> On the pro side: >>>> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both >>>> store attributes internally as an ordered collection (List<Object> in Row, >>>> a tuple in NamedTuple) and provide shortcuts for accessing those attributes >>>> by field name based on the schema. >>>> >> + NamedTuple is a native type, and we're trying to get out of the >>>> business of defining our own type hints (I think). >>>> >> >>>> >> On the con side: >>>> >> - When using the class-based version of NamedTuple in python 3 a >>>> user might be tempted to add more functionality to their class (for >>>> example, define a method) rather than just defining a schema - but I'm not >>>> sure we're prepared to guarantee that we will always produce an instance of >>>> their class, just something that has the defined attributes. This concern >>>> can potentially be alleviated once we have support for logical types. >>>> >> >>>> >> Unless there are any objections I think it would make sense to start >>>> with this implementation (documenting the limitations), and then iterate on >>>> it. Please take a look at the PR [1] and let me know what you think about >>>> this proposal. >>>> >> >>>> >> Thanks, >>>> >> Brian >>>> >> >>>> >> [1] https://github.com/apache/beam/pull/9188 >>>> >> [2] >>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8 >>>> >> [3] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java >>>> >>>