Re: [PROPOSAL] An initial Schema API in Python

Brian Hulette Fri, 02 Aug 2019 18:03:03 -0700

I meant "or sub-class it and define fields with type annotations" not "with
attributes". I believe that version doesn't work in python 2 since it
doesn't support the `name: type` syntax.


On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <[email protected]> wrote:

> > Do we need to support python 2? If supporting python 2 will complicate
> things, we could make this a python3 only feature.
> I don't think supporting python 2 complicates things. It's just that there
> are two different ways to use typing.NamedTuple in python 3 - you can
> either instantiate it and provide a list of (name, type) pairs, or
> sub-class it and define fields with attributes. But in python 2 only the
> former works.
>
> > Why are we mapping to numpy types? Design document suggests mapping to
> python native types as the plan.
> We did discuss using numpy types in a comment [1], but you're right we
> never resolved it and the doc still lists native types. My biggest concern
> with just using native int/float types is I think we definitely need *some*
> way to distinguish between the schema proto's various int/float sizes in
> the python representation. If we don't we would need to either a) reject
> schemas that contain any size other than the one that we support, or b) no
> longer have a bijective mapping between proto and python (i.e. any integer
> type that passes through the Python SDK would get converted to an int64).
> And if we do need some way to distinguish between the integer types, I
> thought a de facto standard was better than creating our own - as Robert
> noted in that comment thread "The only strong opinion I have is that we
> shouldn't invent our own."
>
> As I was experimenting with different approaches I also discovered the
> numpy numeric types are very nice because you can instantiate them and they
> look just like ints, for example `np.int16(107) == 107` evaluates to true
> even though `type(np.int16(107)) == type(107)` does not.
>
> Another concern with python's int type is that it supports unlimited
> precision [2], so it's really not a good type to use for any of the schema
> ints. My PR as it is right now actually doesn't even support int. I
> probably should at least make a change to accept int as a type
> specification for iint64 but throw an error when encoding if an int is too
> big.
>
> [1]
> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>
> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <[email protected]> wrote:
> >
> > Thank you Brian.
> >
> > I did not spend enough time yet to review. Some early questions, I
> apologize if I missed an earlier discussion.
> > - Do we need to support python 2? If supporting python 2 will complicate
> things, we could make this a python3 only feature.
> > - Why are we mapping to numpy types? Design document suggests mapping to
> python native types as the plan.
> >
> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <[email protected]>
> wrote:
> >>
> >> tl;dr: I have a PR at [1] that defines an initial Schema API in python
> based on the typing module, and uses typing.NamedTuple to represent a
> Schema. There are some risks with that approach but I propose we move
> forward with it as a first draft and iterate.
> >>
> >>
> >> I've opened up a PR [1] that implements RowCoder in the Python SDK and
> verifies it's compatibility with the Java implementation via tests in
> standard_coders.yaml. A lot of miscellaneous changes are required to get
> that point, including a pretty significant one: providing some native
> python representation for schemas.
> >>
> >> As discussed in the PR description I opted to fully embrace the typing
> module for the native representation of schema types:
> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
> settled on typing.Collection, but unfortunately this doesn't seem to be
> supported in python 2, I'm open to other suggestions here.
> >> - Map maps to typing.Mapping.
> >> - Rows map to typing.NamedTuple.
> >> - nullability is indicated with typing.Optional. Note there's no
> distinction between Optional[Optional[T]] and Optional[T] in typing, both
> map to Union[T, None] - so this is actually a good analog for the nullable
> flag on FieldType in schema.proto.
> >>
> >> With this approach a schema in Python might look like:
> >> ```
> >> class Movie(NamedTuple):
> >>   name: np.unicode
> >>   year: Optional[np.int16]
> >>
> >> # The class/type annotation syntax doesn't work in Python 2. Instead
> you can use:
> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
> Optional[np.int16])]
> >>
> >> # DoFns annotated with_output_types(Movie) will use RowCoder
> >> coders.registry.register_coder(Movie, coders.RowCoder)
> >> ```
> >>
> >> I think the choice to use typing.NamedTuple as a row type is
> potentially controversial - Udi, Robert Bradshaw and I were already
> discussing it a bit in a comment on the portable schemas doc [2], but I
> wanted to bring that discussion to the ML.
> >>
> >> On the pro side:
> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
> store attributes internally as an ordered collection (List<Object> in Row,
> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
> by field name based on the schema.
> >> +  NamedTuple is a native type, and we're trying to get out of the
> business of defining our own type hints (I think).
> >>
> >> On the con side:
> >> - When using the class-based version of NamedTuple in python 3 a user
> might be tempted to add more functionality to their class (for example,
> define a method) rather than just defining a schema - but I'm not sure
> we're prepared to guarantee that we will always produce an instance of
> their class, just something that has the defined attributes. This concern
> can potentially be alleviated once we have support for logical types.
> >>
> >> Unless there are any objections I think it would make sense to start
> with this implementation (documenting the limitations), and then iterate on
> it. Please take a look at the PR [1] and let me know what you think about
> this proposal.
> >>
> >> Thanks,
> >> Brian
> >>
> >> [1] https://github.com/apache/beam/pull/9188
> >> [2]
> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
> >> [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to