There’s a lot of ground to cover here, so I’m going to pull from a few different responses. ------------------------------
numpy ints A properly written library should accept any type implementing the *int* (or *index*) methods in place of an int, rather than doing explicit type checks Yes, but the reality is that very very few actually do this, including Beam itself (check the code for Timestamp and Duration, to name a few). Which brings me to my next topic: I tested this out with mypy and it would not be compatible: def square(x: int): return x*x square(np.int16(32)) # mypy error The proper way to check this scenario is using typing.SupportsInt. Note that this *only* guarantees that __int__ exists, so you still need to cast to int if you want to do anything with the object: def square(x: typing.SupportsInt) -> int: if not isinstance(x, int): x = int(x) return x*x square('foo') # error! square(1.2) # ok ------------------------------ Native python ints Agreed on float since it seems to trivially map to a double, but I’m torn on int still. While I do want int type hints to work, it doesn’t seem appropriate to map it to AtomicType.INT64, since it has a completely different range of values. Let’s say we used native int for the runtime field type, not just as a schema declaration for numpy.int64. What is the real world fallout from this? Would there be data loss? ------------------------------ NamedTuples (2) The mapping of concrete values to Python types. Rows mapping to NamedTuples may give expectations beyond the attributes they offer (and I’d imagine we’ll want to be flexible with the possible representations here, e.g. offering a slice of an arrow record batch). Or do we need to pay the cost of re-creating the users NamedTuple subclass. I was a little surprised to see that the current implementation is creating a new class on the fly each time. I would suspect that this would be slower than recording the class on registration and looking it up. Why not do this? Using the same type will be less confusing and more powerful, as custom methods can be added. ------------------------------ Python3-only Now that I’m actually thinking it through, you’re right there are several things that could be cleaner if we make it a python 3 only feature: Please don’t do this. - We could use typing.Collection If you want something a little more relaxed than typing.List why not using typing.Sequence? Sequences are expected to be ordered (they therefore support __getitem__ and __reverse__) and Collections are not, so I think Sequence is a better fit for an array. - No need to worry about 2/3 compatibility for strings, we could just use str This is already widely handled throughout the Beam python SDK using the future/past library, so it seems silly to give up on this solution for schemas. On this topic, I added some comments to the PR about using past.builtins.unicode instead of numpy.unicode. They’re the same type, but there’s no reason to get this via numpy, when everywhere else in the code gets it from past. - We could just use bytes for byte arrays (as a shorthand for typing.ByteString [1]) Neat, but in my obviously very biased opinion it is not worth cutting off python2 users over this. That's all for now. -chad