Re: [PROPOSAL] An initial Schema API in Python

Chad Dombrova Wed, 07 Aug 2019 12:52:12 -0700

There’s a lot of ground to cover here, so I’m going to pull from a few
different responses.
------------------------------


numpy ints

A properly written library should accept any type implementing the *int*
(or *index*) methods in place of an int, rather than doing explicit type
checks

Yes, but the reality is that very very few actually do this, including Beam
itself (check the code for Timestamp and Duration, to name a few).

Which brings me to my next topic:

I tested this out with mypy and it would not be compatible:

def square(x: int):
return x*x

square(np.int16(32)) # mypy error

The proper way to check this scenario is using typing.SupportsInt. Note
that this *only* guarantees that __int__ exists, so you still need to cast
to int if you want to do anything with the object:

def square(x: typing.SupportsInt) -> int:
    if not isinstance(x, int):
        x = int(x)
    return x*x

square('foo')  # error!
square(1.2)  # ok

------------------------------

Native python ints

Agreed on float since it seems to trivially map to a double, but I’m torn
on int still. While I do want int type hints to work, it doesn’t seem
appropriate to map it to AtomicType.INT64, since it has a completely
different range of values.

Let’s say we used native int for the runtime field type, not just as a
schema declaration for numpy.int64. What is the real world fallout from
this? Would there be data loss?
------------------------------

NamedTuples

(2) The mapping of concrete values to Python types. Rows mapping to
NamedTuples may give expectations beyond the attributes they offer (and I’d
imagine we’ll want to be flexible with the possible representations here,
e.g. offering a slice of an arrow record batch). Or do we need to pay the
cost of re-creating the users NamedTuple subclass.

I was a little surprised to see that the current implementation is creating
a new class on the fly each time. I would suspect that this would be slower
than recording the class on registration and looking it up. Why not do
this? Using the same type will be less confusing and more powerful, as
custom methods can be added.
------------------------------

Python3-only

Now that I’m actually thinking it through, you’re right there are several
things that could be cleaner if we make it a python 3 only feature:

Please don’t do this.


   - We could use typing.Collection

If you want something a little more relaxed than typing.List why not using
typing.Sequence? Sequences are expected to be ordered (they therefore
support __getitem__ and __reverse__) and Collections are not, so I think
Sequence is a better fit for an array.


   - No need to worry about 2/3 compatibility for strings, we could just
   use str

This is already widely handled throughout the Beam python SDK using the
future/past library, so it seems silly to give up on this solution for
schemas.

On this topic, I added some comments to the PR about using
past.builtins.unicode instead of numpy.unicode. They’re the same type, but
there’s no reason to get this via numpy, when everywhere else in the code
gets it from past.


   - We could just use bytes for byte arrays (as a shorthand for
   typing.ByteString [1])

Neat, but in my obviously very biased opinion it is not worth cutting off
python2 users over this.

That's all for now.
-chad

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to