Re: [PROPOSAL] An initial Schema API in Python

Brian Hulette Fri, 02 Aug 2019 17:56:22 -0700

> Do we need to support python 2? If supporting python 2 will complicate
things, we could make this a python3 only feature.
I don't think supporting python 2 complicates things. It's just that there
are two different ways to use typing.NamedTuple in python 3 - you can
either instantiate it and provide a list of (name, type) pairs, or
sub-class it and define fields with attributes. But in python 2 only the
former works.


> Why are we mapping to numpy types? Design document suggests mapping to
python native types as the plan.
We did discuss using numpy types in a comment [1], but you're right we
never resolved it and the doc still lists native types. My biggest concern
with just using native int/float types is I think we definitely need *some*
way to distinguish between the schema proto's various int/float sizes in
the python representation. If we don't we would need to either a) reject
schemas that contain any size other than the one that we support, or b) no
longer have a bijective mapping between proto and python (i.e. any integer
type that passes through the Python SDK would get converted to an int64).
And if we do need some way to distinguish between the integer types, I
thought a de facto standard was better than creating our own - as Robert
noted in that comment thread "The only strong opinion I have is that we
shouldn't invent our own."

As I was experimenting with different approaches I also discovered the
numpy numeric types are very nice because you can instantiate them and they
look just like ints, for example `np.int16(107) == 107` evaluates to true
even though `type(np.int16(107)) == type(107)` does not.

Another concern with python's int type is that it supports unlimited
precision [2], so it's really not a good type to use for any of the schema
ints. My PR as it is right now actually doesn't even support int. I
probably should at least make a change to accept int as a type
specification for iint64 but throw an error when encoding if an int is too
big.

[1]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
[2] https://docs.python.org/3/library/stdtypes.html#typesnumeric

On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <[email protected]> wrote:
>
> Thank you Brian.
>
> I did not spend enough time yet to review. Some early questions, I
apologize if I missed an earlier discussion.
> - Do we need to support python 2? If supporting python 2 will complicate
things, we could make this a python3 only feature.
> - Why are we mapping to numpy types? Design document suggests mapping to
python native types as the plan.
>
> On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <[email protected]> wrote:
>>
>> tl;dr: I have a PR at [1] that defines an initial Schema API in python
based on the typing module, and uses typing.NamedTuple to represent a
Schema. There are some risks with that approach but I propose we move
forward with it as a first draft and iterate.
>>
>>
>> I've opened up a PR [1] that implements RowCoder in the Python SDK and
verifies it's compatibility with the Java implementation via tests in
standard_coders.yaml. A lot of miscellaneous changes are required to get
that point, including a pretty significant one: providing some native
python representation for schemas.
>>
>> As discussed in the PR description I opted to fully embrace the typing
module for the native representation of schema types:
>> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
settled on typing.Collection, but unfortunately this doesn't seem to be
supported in python 2, I'm open to other suggestions here.
>> - Map maps to typing.Mapping.
>> - Rows map to typing.NamedTuple.
>> - nullability is indicated with typing.Optional. Note there's no
distinction between Optional[Optional[T]] and Optional[T] in typing, both
map to Union[T, None] - so this is actually a good analog for the nullable
flag on FieldType in schema.proto.
>>
>> With this approach a schema in Python might look like:
>> ```
>> class Movie(NamedTuple):
>>   name: np.unicode
>>   year: Optional[np.int16]
>>
>> # The class/type annotation syntax doesn't work in Python 2. Instead you
can use:
>> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
Optional[np.int16])]
>>
>> # DoFns annotated with_output_types(Movie) will use RowCoder
>> coders.registry.register_coder(Movie, coders.RowCoder)
>> ```
>>
>> I think the choice to use typing.NamedTuple as a row type is potentially
controversial - Udi, Robert Bradshaw and I were already discussing it a bit
in a comment on the portable schemas doc [2], but I wanted to bring that
discussion to the ML.
>>
>> On the pro side:
>> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
store attributes internally as an ordered collection (List<Object> in Row,
a tuple in NamedTuple) and provide shortcuts for accessing those attributes
by field name based on the schema.
>> +  NamedTuple is a native type, and we're trying to get out of the
business of defining our own type hints (I think).
>>
>> On the con side:
>> - When using the class-based version of NamedTuple in python 3 a user
might be tempted to add more functionality to their class (for example,
define a method) rather than just defining a schema - but I'm not sure
we're prepared to guarantee that we will always produce an instance of
their class, just something that has the defined attributes. This concern
can potentially be alleviated once we have support for logical types.
>>
>> Unless there are any objections I think it would make sense to start
with this implementation (documenting the limitations), and then iterate on
it. Please take a look at the PR [1] and let me know what you think about
this proposal.
>>
>> Thanks,
>> Brian
>>
>> [1] https://github.com/apache/beam/pull/9188
>> [2]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>> [3]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to