> Do we need to support python 2? If supporting python 2 will complicate things, we could make this a python3 only feature. I don't think supporting python 2 complicates things. It's just that there are two different ways to use typing.NamedTuple in python 3 - you can either instantiate it and provide a list of (name, type) pairs, or sub-class it and define fields with attributes. But in python 2 only the former works.
> Why are we mapping to numpy types? Design document suggests mapping to python native types as the plan. We did discuss using numpy types in a comment [1], but you're right we never resolved it and the doc still lists native types. My biggest concern with just using native int/float types is I think we definitely need *some* way to distinguish between the schema proto's various int/float sizes in the python representation. If we don't we would need to either a) reject schemas that contain any size other than the one that we support, or b) no longer have a bijective mapping between proto and python (i.e. any integer type that passes through the Python SDK would get converted to an int64). And if we do need some way to distinguish between the integer types, I thought a de facto standard was better than creating our own - as Robert noted in that comment thread "The only strong opinion I have is that we shouldn't invent our own." As I was experimenting with different approaches I also discovered the numpy numeric types are very nice because you can instantiate them and they look just like ints, for example `np.int16(107) == 107` evaluates to true even though `type(np.int16(107)) == type(107)` does not. Another concern with python's int type is that it supports unlimited precision [2], so it's really not a good type to use for any of the schema ints. My PR as it is right now actually doesn't even support int. I probably should at least make a change to accept int as a type specification for iint64 but throw an error when encoding if an int is too big. [1] https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <[email protected]> wrote: > > Thank you Brian. > > I did not spend enough time yet to review. Some early questions, I apologize if I missed an earlier discussion. > - Do we need to support python 2? If supporting python 2 will complicate things, we could make this a python3 only feature. > - Why are we mapping to numpy types? Design document suggests mapping to python native types as the plan. > > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <[email protected]> wrote: >> >> tl;dr: I have a PR at [1] that defines an initial Schema API in python based on the typing module, and uses typing.NamedTuple to represent a Schema. There are some risks with that approach but I propose we move forward with it as a first draft and iterate. >> >> >> I've opened up a PR [1] that implements RowCoder in the Python SDK and verifies it's compatibility with the Java implementation via tests in standard_coders.yaml. A lot of miscellaneous changes are required to get that point, including a pretty significant one: providing some native python representation for schemas. >> >> As discussed in the PR description I opted to fully embrace the typing module for the native representation of schema types: >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode). >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we settled on typing.Collection, but unfortunately this doesn't seem to be supported in python 2, I'm open to other suggestions here. >> - Map maps to typing.Mapping. >> - Rows map to typing.NamedTuple. >> - nullability is indicated with typing.Optional. Note there's no distinction between Optional[Optional[T]] and Optional[T] in typing, both map to Union[T, None] - so this is actually a good analog for the nullable flag on FieldType in schema.proto. >> >> With this approach a schema in Python might look like: >> ``` >> class Movie(NamedTuple): >> name: np.unicode >> year: Optional[np.int16] >> >> # The class/type annotation syntax doesn't work in Python 2. Instead you can use: >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year', Optional[np.int16])] >> >> # DoFns annotated with_output_types(Movie) will use RowCoder >> coders.registry.register_coder(Movie, coders.RowCoder) >> ``` >> >> I think the choice to use typing.NamedTuple as a row type is potentially controversial - Udi, Robert Bradshaw and I were already discussing it a bit in a comment on the portable schemas doc [2], but I wanted to bring that discussion to the ML. >> >> On the pro side: >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both store attributes internally as an ordered collection (List<Object> in Row, a tuple in NamedTuple) and provide shortcuts for accessing those attributes by field name based on the schema. >> + NamedTuple is a native type, and we're trying to get out of the business of defining our own type hints (I think). >> >> On the con side: >> - When using the class-based version of NamedTuple in python 3 a user might be tempted to add more functionality to their class (for example, define a method) rather than just defining a schema - but I'm not sure we're prepared to guarantee that we will always produce an instance of their class, just something that has the defined attributes. This concern can potentially be alleviated once we have support for logical types. >> >> Unless there are any objections I think it would make sense to start with this implementation (documenting the limitations), and then iterate on it. Please take a look at the PR [1] and let me know what you think about this proposal. >> >> Thanks, >> Brian >> >> [1] https://github.com/apache/beam/pull/9188 >> [2] https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8 >> [3] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
