Re: [PROPOSAL] An initial Schema API in Python

Brian Hulette Wed, 07 Aug 2019 12:11:33 -0700

On Tue, Aug 6, 2019 at 3:55 AM Robert Bradshaw <rober...@google.com> wrote:

> On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova <chad...@gmail.com> wrote:
> >
> > Hi,
> >
> > This looks like a great feature.
> >
> > Is there a plan to eventually support custom field types?
> >
> > I assume adding support for dataclasses in python 3.7+ should be trivial
> to do in a follow up PR. Do you see any complications with that? The main
> advantage that dataclasses have over NamedTuple in this context is argument
> defaults, which is a nice convenience.
>
> Java has a notion of logical types which has yet to be figured out in
> a cross-langauge way but tackles this exact issue. I think there's a
> lot of value in "anonymous" named tuples as intermediates well, e.g.
> one might to a projection onto a subset of fields, and then do a
> grouping/aggregating operation, in which case the new schema can be
> inferred (even if it doesn't have a name).
>

Yes I think in the future we can add support for dataclasses. In Java,
SchemaCoder handles things like this by inferring schemas from user types
like POJOs and automatically creating functions for converting instances of
those types to/from Row, and we could do something similar for converting
other structured data representations to/from NamedTuple.

I don't think it's trivial to add in a follow-up PR though, because we're
not actually making any guarantees yet that your type and it's methods will
be re-created when the pipeline is running, just that we will produce
objects with the expected attributes.

>
> > My PR as it is right now actually doesn’t even support int. I probably
> should at least make a change to accept int as a type specification for
> iint64 but throw an error when encoding if an int is too big.
> >
> > Should probably do the same for float.
> >
> > Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?
> >
> > I have similar concerns. I guess we’ll just have to cast to int before
> passing into 3rd party code, which is not ideal. Why not use int for int64
> in python?
>
> A properly written library should accept any type implementing the
> __int__ (or __index__) methods in place of an int, rather than doing
> explicit type checks, though performance may suffer. Likewise when
> encoding, we should accept all sorts of ints when an int32 (say) is
> expected, rather than force the user to know and cast to the right
> type.
>

But I don't think there's any way for library functions to type hint that
they accept any type implementing __int__, so while a properly written
library may be able to accept this np.int*, I don't think they could write
type hints in such a way that static type checkers would consider it
valid. Nevermind
I just came across typing.SupportsInt - mypy will let you pass any numpy
integer type to a function with that type hint.

It seems reasonable for RowCoder accept anything that supports __int__ when
encoding INT*. We could just convert to an int and verify that bit_length()
is short enough for the specified type.

>
> As for the mappings between Python types and schemas, there are
> several mappings that are somewhat being conflated.
>
> (1) There is the mapping used in definitions. At the moment, all
> subclasses of NamedTuple map to the same generic Row schema type,
> probably not something we want in the long run (but could be OK for
> now if we think doing better in the future counts as backwards
> compatible). For integral types, it makes sense to accept
> np.int{8,16,32,64}, but should we accept the equivalent arrow types
> here as well? I think we also need to accept the plain Python "int"
> and "float" type, otherwise a standard Python class like
>
>     NamedTuple('Word', [('name', str), ('rank', int), ('frequency', float)]
>
> will be surprisingly rejected.
>

Agreed on float since it seems to trivially map to a double, but I'm torn
on int still. While I do want int type hints to work, it doesn't seem
appropriate to map it to AtomicType.INT64, since it has a completely
different range of values. I wonder if instead there should be a variable
width integer primitive (or logical type) that int can map to. Since other
SDKs may not have a type to map this to it may need to be cast (explicitly
or implicitly) to some other integer type to interoperate.

>
> (2) The mapping of concrete values to Python types. Rows mapping to
> NamedTuples may give expectations beyond the attributes they offer
> (and I'd imagine we'll want to be flexible with the possible
> representations here, e.g. offering a slice of an arrow record batch).
> Or do we need to pay the cost of re-creating the users NamedTuple
> subclass. Ints are another interesting case--it may be quite
> surprising to users for the returned values to have silent truncating
> overflow semantics (very unlike Python) rather than the arbitrary
> precision that Python's ints give (especially if the conversion from a
> python int to an int64 happens due to an invisible fusion barrier).
> Better to compute the larger value and then thrown an error if/when it
> is encoded into a fixed width type later.
>

NamedTuples: Yeah I'm starting to wonder if we should hide the use of
NamedTuple as an implementation detail somehow. It causes a couple of
issues:
1. As you point out it can give user's expectations beyond just the
attributes.
2. Currently you have to specifically call register_coder(MyRowType,
RowCoder) to opt-in to using RowCoder for your type, since I thought
someone may want to use NamedTuple without RowCoder and would be surprised
when we tried to shoe-horn it into RowCoder by default. But it would be
nice if users didn't have to do that.

Ints: But is it any better that the user would get an error when they said
they wanted to use a schema with a native int field, and we throw a runtime
error when trying to encode a particular one because it requires > 64 bits?
It seems preferable to make them specify int64 for these fields so they're
forced to acknowledge that this can happen.

>
> (3) The mapping of Python values into a row (e.g. for serialization).
> If there are errors (e.g. a DoFn produces tuples of the wrong type),
> how eagerly can we detect them? At what cost? How strict should we be
> (e.g. if a named tuple with certain fields is expected, can we map a
> concrete subclass to it? A NamedTuple that has a superset of the
> fields? Implicitly mapping Python's float (aka a 64-bit C double) to a
> float32 is a particularly sticky question.
>
> I think we can make forward progress on implementation in parallel to
> answering these questions, but we should be explicit and document what
> the best options are here and then get the code to align.

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to