Thank you Brian. I did not spend enough time yet to review. Some early questions, I apologize if I missed an earlier discussion. - Do we need to support python 2? If supporting python 2 will complicate things, we could make this a python3 only feature. - Why are we mapping to numpy types? Design document suggests mapping to python native types as the plan.
On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bhule...@google.com> wrote: > tl;dr: I have a PR at [1] that defines an initial Schema API in python > based on the typing module, and uses typing.NamedTuple to represent a > Schema. There are some risks with that approach but I propose we move > forward with it as a first draft and iterate. > > > I've opened up a PR [1] that implements RowCoder in the Python SDK and > verifies it's compatibility with the Java implementation via tests in > standard_coders.yaml. A lot of miscellaneous changes are required to get > that point, including a pretty significant one: providing some native > python representation for schemas. > > As discussed in the PR description I opted to fully embrace the typing > module for the native representation of schema types: > - Primitive types all map to numpy types (e.g. np.int16, np.unicode). > - Arrays map to typing.List. In https://s.apache.org/beam-schemas we > settled on typing.Collection, but unfortunately this doesn't seem to be > supported in python 2, I'm open to other suggestions here. > - Map maps to typing.Mapping. > - Rows map to typing.NamedTuple. > - nullability is indicated with typing.Optional. Note there's no > distinction between Optional[Optional[T]] and Optional[T] in typing, both > map to Union[T, None] - so this is actually a good analog for the nullable > flag on FieldType in schema.proto. > > With this approach a schema in Python might look like: > ``` > class Movie(NamedTuple): > name: np.unicode > year: Optional[np.int16] > > # The class/type annotation syntax doesn't work in Python 2. Instead you > can use: > # Movie = NamedTuple('Movie', [('name', np.unicode), ('year', > Optional[np.int16])] > > # DoFns annotated with_output_types(Movie) will use RowCoder > coders.registry.register_coder(Movie, coders.RowCoder) > ``` > > I think the choice to use typing.NamedTuple as a row type is potentially > controversial - Udi, Robert Bradshaw and I were already discussing it a bit > in a comment on the portable schemas doc [2], but I wanted to bring that > discussion to the ML. > > On the pro side: > + NamedTuple is a pretty great analog for Java's Row type [3]. Both store > attributes internally as an ordered collection (List<Object> in Row, a > tuple in NamedTuple) and provide shortcuts for accessing those attributes > by field name based on the schema. > + NamedTuple is a native type, and we're trying to get out of the > business of defining our own type hints (I think). > > On the con side: > - When using the class-based version of NamedTuple in python 3 a user > might be tempted to add more functionality to their class (for example, > define a method) rather than just defining a schema - but I'm not sure > we're prepared to guarantee that we will always produce an instance of > their class, just something that has the defined attributes. This concern > can potentially be alleviated once we have support for logical types. > > Unless there are any objections I think it would make sense to start with > this implementation (documenting the limitations), and then iterate on it. > Please take a look at the PR [1] and let me know what you think about this > proposal. > > Thanks, > Brian > > [1] https://github.com/apache/beam/pull/9188 > [2] > https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8 > [3] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java >