To clarify, I am happy to start with implementation and iterating on it. I do not want to block this late into the discussion.
On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <bhule...@google.com> wrote: > I meant "or sub-class it and define fields with type annotations" not > "with attributes". I believe that version doesn't work in python 2 since it > doesn't support the `name: type` syntax. > If it is not a big deal supporting both sounds good. I was actually referring to your comment about typing.Collection not being available on python 2. > > On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bhule...@google.com> wrote: > >> > Do we need to support python 2? If supporting python 2 will complicate >> things, we could make this a python3 only feature. >> I don't think supporting python 2 complicates things. It's just that >> there are two different ways to use typing.NamedTuple in python 3 - you can >> either instantiate it and provide a list of (name, type) pairs, or >> sub-class it and define fields with attributes. But in python 2 only the >> former works. >> >> > Why are we mapping to numpy types? Design document suggests mapping to >> python native types as the plan. >> We did discuss using numpy types in a comment [1], but you're right we >> never resolved it and the doc still lists native types. My biggest concern >> with just using native int/float types is I think we definitely need *some* >> way to distinguish between the schema proto's various int/float sizes in >> the python representation. If we don't we would need to either a) reject >> schemas that contain any size other than the one that we support, or b) no >> longer have a bijective mapping between proto and python (i.e. any integer >> type that passes through the Python SDK would get converted to an int64). >> And if we do need some way to distinguish between the integer types, I >> thought a de facto standard was better than creating our own - as Robert >> noted in that comment thread "The only strong opinion I have is that we >> shouldn't invent our own." >> >> As I was experimenting with different approaches I also discovered the >> numpy numeric types are very nice because you can instantiate them and they >> look just like ints, for example `np.int16(107) == 107` evaluates to true >> even though `type(np.int16(107)) == type(107)` does not. >> >> Another concern with python's int type is that it supports unlimited >> precision [2], so it's really not a good type to use for any of the schema >> ints. My PR as it is right now actually doesn't even support int. I >> probably should at least make a change to accept int as a type >> specification for iint64 but throw an error when encoding if an int is too >> big. >> > I agree with Robert's position of not inventing our own. I assume we could make a decision between python native types, arrow types, and numpy types. What happens when one does np.int16(107) + 1, is the numpy type preserved? Another concern I have is, if there is a user function or a library that user does not control, that uses typing to indicate that a function accepts a type of int, would it be compatible with numpy types? > >> [1] >> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA >> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric >> >> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote: >> > >> > Thank you Brian. >> > >> > I did not spend enough time yet to review. Some early questions, I >> apologize if I missed an earlier discussion. >> > - Do we need to support python 2? If supporting python 2 will >> complicate things, we could make this a python3 only feature. >> > - Why are we mapping to numpy types? Design document suggests mapping >> to python native types as the plan. >> > >> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bhule...@google.com> >> wrote: >> >> >> >> tl;dr: I have a PR at [1] that defines an initial Schema API in python >> based on the typing module, and uses typing.NamedTuple to represent a >> Schema. There are some risks with that approach but I propose we move >> forward with it as a first draft and iterate. >> >> >> >> >> >> I've opened up a PR [1] that implements RowCoder in the Python SDK and >> verifies it's compatibility with the Java implementation via tests in >> standard_coders.yaml. A lot of miscellaneous changes are required to get >> that point, including a pretty significant one: providing some native >> python representation for schemas. >> >> >> >> As discussed in the PR description I opted to fully embrace the typing >> module for the native representation of schema types: >> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode). >> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we >> settled on typing.Collection, but unfortunately this doesn't seem to be >> supported in python 2, I'm open to other suggestions here. >> >> - Map maps to typing.Mapping. >> >> - Rows map to typing.NamedTuple. >> >> - nullability is indicated with typing.Optional. Note there's no >> distinction between Optional[Optional[T]] and Optional[T] in typing, both >> map to Union[T, None] - so this is actually a good analog for the nullable >> flag on FieldType in schema.proto. >> >> >> >> With this approach a schema in Python might look like: >> >> ``` >> >> class Movie(NamedTuple): >> >> name: np.unicode >> >> year: Optional[np.int16] >> >> >> >> # The class/type annotation syntax doesn't work in Python 2. Instead >> you can use: >> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year', >> Optional[np.int16])] >> >> >> >> # DoFns annotated with_output_types(Movie) will use RowCoder >> >> coders.registry.register_coder(Movie, coders.RowCoder) >> >> ``` >> >> >> >> I think the choice to use typing.NamedTuple as a row type is >> potentially controversial - Udi, Robert Bradshaw and I were already >> discussing it a bit in a comment on the portable schemas doc [2], but I >> wanted to bring that discussion to the ML. >> >> >> >> On the pro side: >> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both >> store attributes internally as an ordered collection (List<Object> in Row, >> a tuple in NamedTuple) and provide shortcuts for accessing those attributes >> by field name based on the schema. >> >> + NamedTuple is a native type, and we're trying to get out of the >> business of defining our own type hints (I think). >> >> >> >> On the con side: >> >> - When using the class-based version of NamedTuple in python 3 a user >> might be tempted to add more functionality to their class (for example, >> define a method) rather than just defining a schema - but I'm not sure >> we're prepared to guarantee that we will always produce an instance of >> their class, just something that has the defined attributes. This concern >> can potentially be alleviated once we have support for logical types. >> >> >> >> Unless there are any objections I think it would make sense to start >> with this implementation (documenting the limitations), and then iterate on >> it. Please take a look at the PR [1] and let me know what you think about >> this proposal. >> >> >> >> Thanks, >> >> Brian >> >> >> >> [1] https://github.com/apache/beam/pull/9188 >> >> [2] >> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8 >> >> [3] >> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java >> >