On Mon, Jun 21, 2021 at 4:58 PM Antoine Pitrou <anto...@python.org> wrote:
> > I certainly don't think we should have extension types with a different > type id. IMHO, it's a recipe for confusion. > Thanks, I think I got confused by the different perspectives in the thread. I'll do some more exploratory coding with the pure ExtensionType which should certainly make life easier! > > Le 21/06/2021 à 15:54, Simon Perkins a écrit : > > To put it another way, an Extension Type technically has Type::EXTENSION, > > but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE. > > > > When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to > > mismatch on ComplexFloatType::Type::type_id, as the latter is > > Type::EXTENSION? > > > > On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <simon.perk...@gmail.com> > > wrote: > > > >> I did some exploratory coding adding Complex Numbers as ExtensionTypes > in > >> this PR: https://github.com/apache/arrow/pull/10565 > >> > >>> My understanding is that it means having COMPLEX as an entry in the > >> arrow/type_fwd.h Type enum. I agree this would make implementation > >> work in the C++ library much more straightforward. > >> > >> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE > >> entries to the Type enum. > >> One thing I noted is that at least some portion of the code base > >> (visitor_inline.h) expects the ExtensionTypes > >> to have first-class type-like interfaces (i.e. needs a TypeTraits entry, > >> Visitor definitions). > >> > >> My impression at this point, is that the work in implementing a hybrid > >> approach > >> (i.e. ExtensionType with a Type enum entry) replicates that of adding a > >> first-class type. > >> As I am not extensively familiar with the code base, I thought I'd just > >> check whether this impression > >> is correct. > >> > >> regards > >> Simon > >> > >> > >> > >> > >> > >> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>>> > >>>> It might help this discussion and future discussions like it if we > >>> could > >>>> define how it is determined whether a type should be part of the Arrow > >>>> format, an extension type (and what does it mean to say there is a > >>>> "canonical" extension type), or just something that a language > >>>> implementation or downstream library builds for itself with metadata. > I > >>>> feel like this has come up before but I don't recall a resolution. > >>> > >>> > >>> There seemed to be consensus, but I guess we never formally voted on > the > >>> decision points here: > >>> > >>> > https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E > >>> > >>> Applying the criteria to complex types: > >>> 1. Is the type a new parameterization of an existing type? No > >>> > >>> 2. Does the type itself have its own specification for processing > (e.g. > >>> JSON, BSON, Thrift, Avro, Protobuf)? No > >>> > >>> 3. Is the underlying encoding of the type already semantically > supported > >>> by a type? Yes. Two have been mentioned in this thread and I would > also > >>> support adding a new packed struct type, but it appears isn't necessary > >>> for > >>> this. Note that FixedSizeLists have some limitations in regards to > parquet > >>> compatibility around nullability, there might be a few other sharp > edges. > >>> > >>> So if we use this criteria we would lean towards an extension type. > >>> > >>> We never converged on a standard for "canonical" extension types. I > would > >>> propose it roughly be the same criteria as a first class type: > >>> 1. Specification/document update PR that describes the representation > >>> 2. Implementation showing working integration tests across two > languages > >>> (for canonical types I think this can be any 2 languages instead of C++ > >>> and > >>> Java) > >>> 3. Formal vote accepting the canonical type. > >>> > >>> Thanks, > >>> Micah > >>> > >>> > >>> > >>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão < > >>> jorgecarlei...@gmail.com> wrote: > >>> > >>>> Isn't an array of complexes represented by what arrow already > supports? > >>> In > >>>> particular, I see at least two valid in-memory representations to use, > >>> that > >>>> depend on what we are going to do with it: > >>>> > >>>> * Struct[re, im] > >>>> * FixedList[2] > >>>> > >>>> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, > >>> ...], in > >>>> the second case we have 1 buffer, [x0, y0, x1, y1, ...]. > >>>> > >>>> The first representation is useful for column-based operations (e.g. > >>> taking > >>>> the real part in case 1 is trivial; requires a copy in the second > case), > >>>> the second representation is useful for row-base operations (e.g. > "take" > >>>> and "filter" require a single pass over buffer 1). Case 2 does not > >>> support > >>>> Re and Im of different physical types (arguably an issue). Both cases > >>>> support nullability of individual items or combined. > >>>> > >>>> What I conclude is that this does not seem to be a problem about a > base > >>>> in-memory representation, but rather on whether we agree on a > >>>> representation that justifies adding associated metadata to the spec. > >>>> > >>>> The case for the complex interval type recently proposed [1] is more > >>>> compelling to me because a complex ops over intervals usually required > >>> all > >>>> parts of the interval (and thus the "FixedList" representation is more > >>>> compelling), but each part has a different type. I.e. it is like a > >>>> "FixedTypedList[int32, int32, int64]", which we do not natively > support. > >>>> > >>>> [1] https://github.com/apache/arrow/pull/10177 > >>>> > >>>> Best, > >>>> Jorge > >>>> > >>>> > >>>> > >>>> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson < > >>>> neal.p.richard...@gmail.com> > >>>> wrote: > >>>> > >>>>> It might help this discussion and future discussions like it if we > >>> could > >>>>> define how it is determined whether a type should be part of the > Arrow > >>>>> format, an extension type (and what does it mean to say there is a > >>>>> "canonical" extension type), or just something that a language > >>>>> implementation or downstream library builds for itself with metadata. > >>> I > >>>>> feel like this has come up before but I don't recall a resolution. > >>>>> > >>>>> Examples might also help: are there examples of "canonical extension > >>>>> types"? > >>>>> > >>>>> Neal > >>>>> > >>>>> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield < > >>> emkornfi...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>>> > >>>>>>> My understanding is that it means having COMPLEX as an entry in > >>> the > >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation > >>>>>>> work in the C++ library much more straightforward. > >>>>>> > >>>>>> One idea I proposed would be to do that, and implement the > >>>>>>> serialization of the complex metadata using Extension types. > >>>>>> > >>>>>> > >>>>>> If this is a maintainable strategy for Canonical types it sounds > >>> good > >>>> to > >>>>>> me. > >>>>>> > >>>>>> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> My understanding is that it means having COMPLEX as an entry in > >>> the > >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation > >>>>>>> work in the C++ library much more straightforward. > >>>>>>> > >>>>>>> One idea I proposed would be to do that, and implement the > >>>>>>> serialization of the complex metadata using Extension types. > >>>>>>> > >>>>>>> On Thu, Jun 10, 2021 at 5:47 PM Weston Pace < > >>> weston.p...@gmail.com> > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> While dedicated types are not strictly required, compute > >>>> functions > >>>>>>> would > >>>>>>>>> be much easier to add for a first-class dedicated complex > >>>> datatype > >>>>>>>>> rather than for an extension type. > >>>>>>>> @pitrou > >>>>>>>> > >>>>>>>> This is perhaps a naive question (and admittedly, I'm not up to > >>>> speed > >>>>>>>> on my compute kernels) but why is this the case? For example, > >>> if > >>>>>>>> adding a complex addition kernel it seems we would be talking > >>>>> about... > >>>>>>>> > >>>>>>>> dest_scalar.real = scalar1.real + scalar2.real; > >>>>>>>> dest_scalar.im = scalar1.im + scalar2.im; > >>>>>>>> > >>>>>>>> vs... > >>>>>>>> > >>>>>>>> dest_scalar[0] = scalar1[0] + scalar2[0]; > >>>>>>>> dest_scalar[1] = scalar1[1] + scalar2[1]; > >>>>>>>> > >>>>>>>> On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney < > >>> wesmck...@gmail.com > >>>>> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> I'd be supportive of starting with this as a "canonical" > >>>> extension > >>>>>>>>> type so that all implementations are not expected to support > >>>>> complex > >>>>>>>>> types — this would encourage us to build sufficient > >>> integration > >>>>> e.g. > >>>>>>>>> with NumPy to get things working end-to-end with the on-wire > >>>>>>>>> representation being an extension type. We could certainly > >>> choose > >>>>> to > >>>>>>>>> treat the type as "first class" in the C++ library without it > >>>> being > >>>>>>>>> "top level" in the Type union in Flatbuffers. > >>>>>>>>> > >>>>>>>>> I agree that the use cases are more specialized, and the fact > >>>> that > >>>>> we > >>>>>>>>> haven't needed it until now (or at least, its absence suggests > >>>>> this) > >>>>>>>>> shows that this is the case. > >>>>>>>>> > >>>>>>>>> On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield < > >>>>>> emkornfi...@gmail.com> > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I'm convinced now that first-class types seem to be the > >>> way > >>>> to > >>>>>> go > >>>>>>> and I'm > >>>>>>>>>>> happy to take this approach. > >>>>>>>>>> > >>>>>>>>>> I agree from an implementation effort it is simpler, but I'm > >>>>> still > >>>>>>> not > >>>>>>>>>> convinced that we should be adding this as a first class > >>> type. > >>>>> As > >>>>>>> noted in > >>>>>>>>>> the survey below it appears Complex numbers are not a core > >>>>> concept > >>>>>>> in many > >>>>>>>>>> general purpose coding languages and it doesn't appear to > >>> be a > >>>>>>> common type > >>>>>>>>>> in SQL systems either. > >>>>>>>>>> > >>>>>>>>>> The reason why I am being nit-picky here is I think that > >>>> having a > >>>>>>> first > >>>>>>>>>> class type indicates that it should eventually be supported > >>> by > >>>>> all > >>>>>>>>>> reference implementations. An "well known" extension type I > >>>>> think > >>>>>>> offers > >>>>>>>>>> less guarantees which makes it seem more suitable for niche > >>>>> types. > >>>>>>>>>> > >>>>>>>>>>> I don't immediately see a Packed Struct type. Would this > >>> need > >>>>> to > >>>>>> be > >>>>>>>>>>>> implemented? > >>>>>>>>>>> Not necessarily (*). But before thinking about > >>>> implementation, > >>>>>>> this > >>>>>>>>>>> proposal must be accepted into the format. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes, this is a type that has been proposed in the past and I > >>>>> think > >>>>>>> handles > >>>>>>>>>> a lot of types not yet in Arrow but have been requested > >>> (e.g. > >>>> IP > >>>>>>>>>> Addresses, Geo coordinates), etc. > >>>>>>>>>> > >>>>>>>>>> On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins < > >>>>>>> simon.perk...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou < > >>>>>> anto...@python.org> > >>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Le 09/06/2021 à 17:52, Micah Kornfield a écrit : > >>>>>>>>>>>>> > >>>>>>>>>>>>> Adding a new first-class type in Arrow requires > >>> working > >>>>>>> integration > >>>>>>>>>>> tests > >>>>>>>>>>>>> between C++ and Java libraries (once the idea is > >>>> informally > >>>>>>> agreed > >>>>>>>>>>> upon) > >>>>>>>>>>>>> and then a final vote for approval. We haven't > >>>> formalized > >>>>>>> extension > >>>>>>>>>>>> types > >>>>>>>>>>>>> but I imagine a similar cross language requirement > >>> would > >>>> be > >>>>>>> agreed > >>>>>>>>>>> upon. > >>>>>>>>>>>>> Implementation of computation wouldn't be required for > >>>>> adding > >>>>>>> a new > >>>>>>>>>>> type. > >>>>>>>>>>>>> Different language bindings have taken different > >>>> approaches > >>>>>> on > >>>>>>> how much > >>>>>>>>>>>>> additional computational elements are packaged in > >>> them. > >>>>>>>>>>>> > >>>>>>>>>>>> While dedicated types are not strictly required, compute > >>>>>>> functions would > >>>>>>>>>>>> be much easier to add for a first-class dedicated > >>> complex > >>>>>>> datatype > >>>>>>>>>>>> rather than for an extension type. > >>>>>>>>>>>> > >>>>>>>>>>>> Since complex numbers are quite common in some domains, > >>> and > >>>>>>> since they > >>>>>>>>>>>> are conceptually simply, IMHO it would make sense to add > >>>> them > >>>>>> to > >>>>>>> the > >>>>>>>>>>>> native Arrow datatypes (at least COMPLEX64 and > >>> COMPLEX128). > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I'm convinced now that first-class types seem to be the > >>> way > >>>> to > >>>>>> go > >>>>>>> and I'm > >>>>>>>>>>> happy to take this approach. > >>>>>>>>>>> Regarding compute functions, it looks like the standard > >>> set > >>>> of > >>>>>>> scalar > >>>>>>>>>>> arithmetic and reduction functionality > >>>>>>>>>>> is desirable for complex numbers: > >>>>>>>>>>> https://arrow.apache.org/docs/cpp/compute.html# > >>>>>>>>>>> Perhaps it would be better to split the addition of the > >>> Types > >>>>> and > >>>>>>> addition > >>>>>>>>>>> Compute functionality into separate PRs? > >>>>>>>>>>> > >>>>>>>>>>> Regarding the process for managing this PR, it sounds > >>> like a > >>>>>>> proposal must > >>>>>>>>>>> be voted on? > >>>>>>>>>>> i.e. is this proposal still in this phase > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > http://arrow.apache.org/docs/developers/contributing.html#before-starting > >>>>>>>>>>> Regards > >>>>>>>>>>> > >>>>>>>>>>> Simon > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >