On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova <chad...@gmail.com> wrote: > > Hi, > > This looks like a great feature. > > Is there a plan to eventually support custom field types? > > I assume adding support for dataclasses in python 3.7+ should be trivial to > do in a follow up PR. Do you see any complications with that? The main > advantage that dataclasses have over NamedTuple in this context is argument > defaults, which is a nice convenience.
Java has a notion of logical types which has yet to be figured out in a cross-langauge way but tackles this exact issue. I think there's a lot of value in "anonymous" named tuples as intermediates well, e.g. one might to a projection onto a subset of fields, and then do a grouping/aggregating operation, in which case the new schema can be inferred (even if it doesn't have a name). > My PR as it is right now actually doesn’t even support int. I probably should > at least make a change to accept int as a type specification for iint64 but > throw an error when encoding if an int is too big. > > Should probably do the same for float. > > Another concern I have is, if there is a user function or a library that user > does not control, that uses typing to indicate that a function accepts a type > of int, would it be compatible with numpy types? > > I have similar concerns. I guess we’ll just have to cast to int before > passing into 3rd party code, which is not ideal. Why not use int for int64 in > python? A properly written library should accept any type implementing the __int__ (or __index__) methods in place of an int, rather than doing explicit type checks, though performance may suffer. Likewise when encoding, we should accept all sorts of ints when an int32 (say) is expected, rather than force the user to know and cast to the right type. As for the mappings between Python types and schemas, there are several mappings that are somewhat being conflated. (1) There is the mapping used in definitions. At the moment, all subclasses of NamedTuple map to the same generic Row schema type, probably not something we want in the long run (but could be OK for now if we think doing better in the future counts as backwards compatible). For integral types, it makes sense to accept np.int{8,16,32,64}, but should we accept the equivalent arrow types here as well? I think we also need to accept the plain Python "int" and "float" type, otherwise a standard Python class like NamedTuple('Word', [('name', str), ('rank', int), ('frequency', float)] will be surprisingly rejected. (2) The mapping of concrete values to Python types. Rows mapping to NamedTuples may give expectations beyond the attributes they offer (and I'd imagine we'll want to be flexible with the possible representations here, e.g. offering a slice of an arrow record batch). Or do we need to pay the cost of re-creating the users NamedTuple subclass. Ints are another interesting case--it may be quite surprising to users for the returned values to have silent truncating overflow semantics (very unlike Python) rather than the arbitrary precision that Python's ints give (especially if the conversion from a python int to an int64 happens due to an invisible fusion barrier). Better to compute the larger value and then thrown an error if/when it is encoded into a fixed width type later. (3) The mapping of Python values into a row (e.g. for serialization). If there are errors (e.g. a DoFn produces tuples of the wrong type), how eagerly can we detect them? At what cost? How strict should we be (e.g. if a named tuple with certain fields is expected, can we map a concrete subclass to it? A NamedTuple that has a superset of the fields? Implicitly mapping Python's float (aka a 64-bit C double) to a float32 is a particularly sticky question. I think we can make forward progress on implementation in parallel to answering these questions, but we should be explicit and document what the best options are here and then get the code to align.