Re: [PROPOSAL] An initial Schema API in Python

Brian Hulette Wed, 07 Aug 2019 14:12:13 -0700

Thanks for all the suggestions, I've added responses inline.

On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova <chad...@gmail.com> wrote:


> There’s a lot of ground to cover here, so I’m going to pull from a few
> different responses.
> ------------------------------
>
> numpy ints
>
> A properly written library should accept any type implementing the *int*
> (or *index*) methods in place of an int, rather than doing explicit type
> checks
>
> Yes, but the reality is that very very few actually do this, including
> Beam itself (check the code for Timestamp and Duration, to name a few).
>
> Which brings me to my next topic:
>
> I tested this out with mypy and it would not be compatible:
>
> def square(x: int):
> return x*x
>
> square(np.int16(32)) # mypy error
>
> The proper way to check this scenario is using typing.SupportsInt. Note
> that this *only* guarantees that __int__ exists, so you still need to
> cast to int if you want to do anything with the object:
>
> def square(x: typing.SupportsInt) -> int:
>     if not isinstance(x, int):
>         x = int(x)
>     return x*x
>
> square('foo')  # error!
> square(1.2)  # ok
>
>  Yep I came across this while writing my last reply. I agree though it
seems unlikely that many libraries actually do this.

------------------------------
>
> Native python ints
>
> Agreed on float since it seems to trivially map to a double, but I’m torn
> on int still. While I do want int type hints to work, it doesn’t seem
> appropriate to map it to AtomicType.INT64, since it has a completely
> different range of values.
>
> Let’s say we used native int for the runtime field type, not just as a
> schema declaration for numpy.int64. What is the real world fallout from
> this? Would there be data loss?
>
I'm not sure I follow the question exactly, what is the interplay between
int and numpy.int64 in this scenario? Are you saying that np.int64 is used
in the schema declaration, but we just use native int at runtime, and check
the bit width when encoding?

In any case, I don't think the real world fallout of using int is nearly
that dire. I suppose data loss is possible if a poorly designed pipeline
overflows an int64 and crashes, but that's possible whether we use int or
np.int64 at runtime. I'm just saying that a user could be forgiven for
thinking that they're safe from overflows if they declare a schema as
NamedTuple('Foo', [('to_infinity_and_beyond', int)]), but they shouldn't
make the same mistake when they explicitly call it an int64.

> ------------------------------
>
> NamedTuples
>
> (2) The mapping of concrete values to Python types. Rows mapping to
> NamedTuples may give expectations beyond the attributes they offer (and I’d
> imagine we’ll want to be flexible with the possible representations here,
> e.g. offering a slice of an arrow record batch). Or do we need to pay the
> cost of re-creating the users NamedTuple subclass.
>
> I was a little surprised to see that the current implementation is
> creating a new class on the fly each time. I would suspect that this would
> be slower than recording the class on registration and looking it up. Why
> not do this? Using the same type will be less confusing and more powerful,
> as custom methods can be added.
>
each time just means once per schema when a worker is being initialized.
The classes are registered by UUID when they're created so we can look them
up if the same schema is used again. The issue is that the registry that's
built up at pipeline construction time isn't carried over to the workers,
so we need to re-build it when a worker starts, and the only information we
have at that point is the schema stored as the payload of RowCoder, which
currently has nothing SDK-specific.

In https://s.apache.org/beam-schemas and in the portable schemas thread
<https://lists.apache.org/thread.html/b9901224b13054c93743cf473262c1ac03411404a69adc4336a59c16@%3Cdev.beam.apache.org%3E>
we
discussed the possibility of including SDK-specific logical types, where
some sort of serialized class to represent the type can be included
per-SDK. That could allow us to actually send the user's type over to the
workers to solve this problem, but that's not in the spec I'm working off
of now, and would take some more design work.

> ------------------------------
>
> Python3-only
>
> Now that I’m actually thinking it through, you’re right there are several
> things that could be cleaner if we make it a python 3 only feature:
>
> Please don’t do this.
>
>
>    - We could use typing.Collection
>
> If you want something a little more relaxed than typing.List why not
> using typing.Sequence? Sequences are expected to be ordered (they
> therefore support __getitem__ and __reverse__) and Collections are not,
> so I think Sequence is a better fit for an array.
>
>
>    - No need to worry about 2/3 compatibility for strings, we could just
>    use str
>
> This is already widely handled throughout the Beam python SDK using the
> future/past library, so it seems silly to give up on this solution for
> schemas.
>
> On this topic, I added some comments to the PR about using
> past.builtins.unicode instead of numpy.unicode. They’re the same type,
> but there’s no reason to get this via numpy, when everywhere else in the
> code gets it from past.
>
>
>    - We could just use bytes for byte arrays (as a shorthand for
>    typing.ByteString [1])
>
> Neat, but in my obviously very biased opinion it is not worth cutting off
> python2 users over this.
>
Ok I won't do this :) I wasn't aware of typing.Sequence, that does seem
like a good fit. The other two items are just nice-to-haves, I'm happy to
work around those and use Sequence for arrays instead.

> That's all for now.
> -chad
>
>
>

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to