On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova <chad...@gmail.com> wrote:
>
> Hi,
>
> This looks like a great feature.
>
> Is there a plan to eventually support custom field types?
>
> I assume adding support for dataclasses in python 3.7+ should be trivial to 
> do in a follow up PR. Do you see any complications with that? The main 
> advantage that dataclasses have over NamedTuple in this context is argument 
> defaults, which is a nice convenience.

Java has a notion of logical types which has yet to be figured out in
a cross-langauge way but tackles this exact issue. I think there's a
lot of value in "anonymous" named tuples as intermediates well, e.g.
one might to a projection onto a subset of fields, and then do a
grouping/aggregating operation, in which case the new schema can be
inferred (even if it doesn't have a name).

> My PR as it is right now actually doesn’t even support int. I probably should 
> at least make a change to accept int as a type specification for iint64 but 
> throw an error when encoding if an int is too big.
>
> Should probably do the same for float.
>
> Another concern I have is, if there is a user function or a library that user 
> does not control, that uses typing to indicate that a function accepts a type 
> of int, would it be compatible with numpy types?
>
> I have similar concerns. I guess we’ll just have to cast to int before 
> passing into 3rd party code, which is not ideal. Why not use int for int64 in 
> python?

A properly written library should accept any type implementing the
__int__ (or __index__) methods in place of an int, rather than doing
explicit type checks, though performance may suffer. Likewise when
encoding, we should accept all sorts of ints when an int32 (say) is
expected, rather than force the user to know and cast to the right
type.

As for the mappings between Python types and schemas, there are
several mappings that are somewhat being conflated.

(1) There is the mapping used in definitions. At the moment, all
subclasses of NamedTuple map to the same generic Row schema type,
probably not something we want in the long run (but could be OK for
now if we think doing better in the future counts as backwards
compatible). For integral types, it makes sense to accept
np.int{8,16,32,64}, but should we accept the equivalent arrow types
here as well? I think we also need to accept the plain Python "int"
and "float" type, otherwise a standard Python class like

    NamedTuple('Word', [('name', str), ('rank', int), ('frequency', float)]

will be surprisingly rejected.

(2) The mapping of concrete values to Python types. Rows mapping to
NamedTuples may give expectations beyond the attributes they offer
(and I'd imagine we'll want to be flexible with the possible
representations here, e.g. offering a slice of an arrow record batch).
Or do we need to pay the cost of re-creating the users NamedTuple
subclass. Ints are another interesting case--it may be quite
surprising to users for the returned values to have silent truncating
overflow semantics (very unlike Python) rather than the arbitrary
precision that Python's ints give (especially if the conversion from a
python int to an int64 happens due to an invisible fusion barrier).
Better to compute the larger value and then thrown an error if/when it
is encoded into a fixed width type later.

(3) The mapping of Python values into a row (e.g. for serialization).
If there are errors (e.g. a DoFn produces tuples of the wrong type),
how eagerly can we detect them? At what cost? How strict should we be
(e.g. if a named tuple with certain fields is expected, can we map a
concrete subclass to it? A NamedTuple that has a superset of the
fields? Implicitly mapping Python's float (aka a 64-bit C double) to a
float32 is a particularly sticky question.

I think we can make forward progress on implementation in parallel to
answering these questions, but we should be explicit and document what
the best options are here and then get the code to align.

Reply via email to