To put it another way, an Extension Type technically has Type::EXTENSION, but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.
When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to mismatch on ComplexFloatType::Type::type_id, as the latter is Type::EXTENSION? On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <simon.perk...@gmail.com> wrote: > I did some exploratory coding adding Complex Numbers as ExtensionTypes in > this PR: https://github.com/apache/arrow/pull/10565 > > > My understanding is that it means having COMPLEX as an entry in the > arrow/type_fwd.h Type enum. I agree this would make implementation > work in the C++ library much more straightforward. > > I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE > entries to the Type enum. > One thing I noted is that at least some portion of the code base > (visitor_inline.h) expects the ExtensionTypes > to have first-class type-like interfaces (i.e. needs a TypeTraits entry, > Visitor definitions). > > My impression at this point, is that the work in implementing a hybrid > approach > (i.e. ExtensionType with a Type enum entry) replicates that of adding a > first-class type. > As I am not extensively familiar with the code base, I thought I'd just > check whether this impression > is correct. > > regards > Simon > > > > > > On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > >> > It might help this discussion and future discussions like it if we >> could >> > define how it is determined whether a type should be part of the Arrow >> > format, an extension type (and what does it mean to say there is a >> > "canonical" extension type), or just something that a language >> > implementation or downstream library builds for itself with metadata. I >> > feel like this has come up before but I don't recall a resolution. >> >> >> There seemed to be consensus, but I guess we never formally voted on the >> decision points here: >> >> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E >> >> Applying the criteria to complex types: >> 1. Is the type a new parameterization of an existing type? No >> >> 2. Does the type itself have its own specification for processing (e.g. >> JSON, BSON, Thrift, Avro, Protobuf)? No >> >> 3. Is the underlying encoding of the type already semantically supported >> by a type? Yes. Two have been mentioned in this thread and I would also >> support adding a new packed struct type, but it appears isn't necessary >> for >> this. Note that FixedSizeLists have some limitations in regards to parquet >> compatibility around nullability, there might be a few other sharp edges. >> >> So if we use this criteria we would lean towards an extension type. >> >> We never converged on a standard for "canonical" extension types. I would >> propose it roughly be the same criteria as a first class type: >> 1. Specification/document update PR that describes the representation >> 2. Implementation showing working integration tests across two languages >> (for canonical types I think this can be any 2 languages instead of C++ >> and >> Java) >> 3. Formal vote accepting the canonical type. >> >> Thanks, >> Micah >> >> >> >> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão < >> jorgecarlei...@gmail.com> wrote: >> >> > Isn't an array of complexes represented by what arrow already supports? >> In >> > particular, I see at least two valid in-memory representations to use, >> that >> > depend on what we are going to do with it: >> > >> > * Struct[re, im] >> > * FixedList[2] >> > >> > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, >> ...], in >> > the second case we have 1 buffer, [x0, y0, x1, y1, ...]. >> > >> > The first representation is useful for column-based operations (e.g. >> taking >> > the real part in case 1 is trivial; requires a copy in the second case), >> > the second representation is useful for row-base operations (e.g. "take" >> > and "filter" require a single pass over buffer 1). Case 2 does not >> support >> > Re and Im of different physical types (arguably an issue). Both cases >> > support nullability of individual items or combined. >> > >> > What I conclude is that this does not seem to be a problem about a base >> > in-memory representation, but rather on whether we agree on a >> > representation that justifies adding associated metadata to the spec. >> > >> > The case for the complex interval type recently proposed [1] is more >> > compelling to me because a complex ops over intervals usually required >> all >> > parts of the interval (and thus the "FixedList" representation is more >> > compelling), but each part has a different type. I.e. it is like a >> > "FixedTypedList[int32, int32, int64]", which we do not natively support. >> > >> > [1] https://github.com/apache/arrow/pull/10177 >> > >> > Best, >> > Jorge >> > >> > >> > >> > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson < >> > neal.p.richard...@gmail.com> >> > wrote: >> > >> > > It might help this discussion and future discussions like it if we >> could >> > > define how it is determined whether a type should be part of the Arrow >> > > format, an extension type (and what does it mean to say there is a >> > > "canonical" extension type), or just something that a language >> > > implementation or downstream library builds for itself with metadata. >> I >> > > feel like this has come up before but I don't recall a resolution. >> > > >> > > Examples might also help: are there examples of "canonical extension >> > > types"? >> > > >> > > Neal >> > > >> > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield < >> emkornfi...@gmail.com> >> > > wrote: >> > > >> > > > > >> > > > > My understanding is that it means having COMPLEX as an entry in >> the >> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation >> > > > > work in the C++ library much more straightforward. >> > > > >> > > > One idea I proposed would be to do that, and implement the >> > > > > serialization of the complex metadata using Extension types. >> > > > >> > > > >> > > > If this is a maintainable strategy for Canonical types it sounds >> good >> > to >> > > > me. >> > > > >> > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com> >> > > wrote: >> > > > >> > > > > My understanding is that it means having COMPLEX as an entry in >> the >> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation >> > > > > work in the C++ library much more straightforward. >> > > > > >> > > > > One idea I proposed would be to do that, and implement the >> > > > > serialization of the complex metadata using Extension types. >> > > > > >> > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace < >> weston.p...@gmail.com> >> > > > wrote: >> > > > > > >> > > > > > > While dedicated types are not strictly required, compute >> > functions >> > > > > would >> > > > > > > be much easier to add for a first-class dedicated complex >> > datatype >> > > > > > > rather than for an extension type. >> > > > > > @pitrou >> > > > > > >> > > > > > This is perhaps a naive question (and admittedly, I'm not up to >> > speed >> > > > > > on my compute kernels) but why is this the case? For example, >> if >> > > > > > adding a complex addition kernel it seems we would be talking >> > > about... >> > > > > > >> > > > > > dest_scalar.real = scalar1.real + scalar2.real; >> > > > > > dest_scalar.im = scalar1.im + scalar2.im; >> > > > > > >> > > > > > vs... >> > > > > > >> > > > > > dest_scalar[0] = scalar1[0] + scalar2[0]; >> > > > > > dest_scalar[1] = scalar1[1] + scalar2[1]; >> > > > > > >> > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney < >> wesmck...@gmail.com >> > > >> > > > > wrote: >> > > > > > > >> > > > > > > I'd be supportive of starting with this as a "canonical" >> > extension >> > > > > > > type so that all implementations are not expected to support >> > > complex >> > > > > > > types — this would encourage us to build sufficient >> integration >> > > e.g. >> > > > > > > with NumPy to get things working end-to-end with the on-wire >> > > > > > > representation being an extension type. We could certainly >> choose >> > > to >> > > > > > > treat the type as "first class" in the C++ library without it >> > being >> > > > > > > "top level" in the Type union in Flatbuffers. >> > > > > > > >> > > > > > > I agree that the use cases are more specialized, and the fact >> > that >> > > we >> > > > > > > haven't needed it until now (or at least, its absence suggests >> > > this) >> > > > > > > shows that this is the case. >> > > > > > > >> > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield < >> > > > emkornfi...@gmail.com> >> > > > > wrote: >> > > > > > > > >> > > > > > > > > >> > > > > > > > > I'm convinced now that first-class types seem to be the >> way >> > to >> > > > go >> > > > > and I'm >> > > > > > > > > happy to take this approach. >> > > > > > > > >> > > > > > > > I agree from an implementation effort it is simpler, but I'm >> > > still >> > > > > not >> > > > > > > > convinced that we should be adding this as a first class >> type. >> > > As >> > > > > noted in >> > > > > > > > the survey below it appears Complex numbers are not a core >> > > concept >> > > > > in many >> > > > > > > > general purpose coding languages and it doesn't appear to >> be a >> > > > > common type >> > > > > > > > in SQL systems either. >> > > > > > > > >> > > > > > > > The reason why I am being nit-picky here is I think that >> > having a >> > > > > first >> > > > > > > > class type indicates that it should eventually be supported >> by >> > > all >> > > > > > > > reference implementations. An "well known" extension type I >> > > think >> > > > > offers >> > > > > > > > less guarantees which makes it seem more suitable for niche >> > > types. >> > > > > > > > >> > > > > > > > > I don't immediately see a Packed Struct type. Would this >> need >> > > to >> > > > be >> > > > > > > > > > implemented? >> > > > > > > > > Not necessarily (*). But before thinking about >> > implementation, >> > > > > this >> > > > > > > > > proposal must be accepted into the format. >> > > > > > > > >> > > > > > > > >> > > > > > > > Yes, this is a type that has been proposed in the past and I >> > > think >> > > > > handles >> > > > > > > > a lot of types not yet in Arrow but have been requested >> (e.g. >> > IP >> > > > > > > > Addresses, Geo coordinates), etc. >> > > > > > > > >> > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins < >> > > > > simon.perk...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou < >> > > > anto...@python.org> >> > > > > wrote: >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit : >> > > > > > > > > > > >> > > > > > > > > > > Adding a new first-class type in Arrow requires >> working >> > > > > integration >> > > > > > > > > tests >> > > > > > > > > > > between C++ and Java libraries (once the idea is >> > informally >> > > > > agreed >> > > > > > > > > upon) >> > > > > > > > > > > and then a final vote for approval. We haven't >> > formalized >> > > > > extension >> > > > > > > > > > types >> > > > > > > > > > > but I imagine a similar cross language requirement >> would >> > be >> > > > > agreed >> > > > > > > > > upon. >> > > > > > > > > > > Implementation of computation wouldn't be required for >> > > adding >> > > > > a new >> > > > > > > > > type. >> > > > > > > > > > > Different language bindings have taken different >> > approaches >> > > > on >> > > > > how much >> > > > > > > > > > > additional computational elements are packaged in >> them. >> > > > > > > > > > >> > > > > > > > > > While dedicated types are not strictly required, compute >> > > > > functions would >> > > > > > > > > > be much easier to add for a first-class dedicated >> complex >> > > > > datatype >> > > > > > > > > > rather than for an extension type. >> > > > > > > > > > >> > > > > > > > > > Since complex numbers are quite common in some domains, >> and >> > > > > since they >> > > > > > > > > > are conceptually simply, IMHO it would make sense to add >> > them >> > > > to >> > > > > the >> > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and >> COMPLEX128). >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > I'm convinced now that first-class types seem to be the >> way >> > to >> > > > go >> > > > > and I'm >> > > > > > > > > happy to take this approach. >> > > > > > > > > Regarding compute functions, it looks like the standard >> set >> > of >> > > > > scalar >> > > > > > > > > arithmetic and reduction functionality >> > > > > > > > > is desirable for complex numbers: >> > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html# >> > > > > > > > > Perhaps it would be better to split the addition of the >> Types >> > > and >> > > > > addition >> > > > > > > > > Compute functionality into separate PRs? >> > > > > > > > > >> > > > > > > > > Regarding the process for managing this PR, it sounds >> like a >> > > > > proposal must >> > > > > > > > > be voted on? >> > > > > > > > > i.e. is this proposal still in this phase >> > > > > > > > > >> > > > > >> > > > >> > > >> > >> http://arrow.apache.org/docs/developers/contributing.html#before-starting >> > > > > > > > > Regards >> > > > > > > > > >> > > > > > > > > Simon >> > > > > > > > > >> > > > > >> > > > >> > > >> > >> >