To put it another way, an Extension Type technically has Type::EXTENSION,
but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.

When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
mismatch on ComplexFloatType::Type::type_id, as the latter is
Type::EXTENSION?

On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <simon.perk...@gmail.com>
wrote:

> I did some exploratory coding adding Complex Numbers as ExtensionTypes in
> this PR: https://github.com/apache/arrow/pull/10565
>
> > My understanding is that it means having COMPLEX as an entry in the
> arrow/type_fwd.h Type enum. I agree this would make implementation
> work in the C++ library much more straightforward.
>
> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
> entries to the Type enum.
> One thing I noted is that at least some portion of the code base
> (visitor_inline.h) expects the ExtensionTypes
> to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
> Visitor definitions).
>
> My impression at this point, is that the work in implementing a hybrid
> approach
> (i.e. ExtensionType with a Type enum entry) replicates that of adding a
> first-class type.
> As I am not extensively familiar with the code base, I thought I'd just
> check whether this impression
> is correct.
>
> regards
>   Simon
>
>
>
>
>
> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> >
>> >  It might help this discussion and future discussions like it if we
>> could
>> > define how it is determined whether a type should be part of the Arrow
>> > format, an extension type (and what does it mean to say there is a
>> > "canonical" extension type), or just something that a language
>> > implementation or downstream library builds for itself with metadata. I
>> > feel like this has come up before but I don't recall a resolution.
>>
>>
>> There seemed to be  consensus, but I guess we never formally voted on the
>> decision points here:
>>
>> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
>>
>> Applying the criteria to complex types:
>> 1.  Is the type a new parameterization of an existing type?  No
>>
>> 2.  Does the type itself have its own specification for processing (e.g.
>> JSON, BSON, Thrift, Avro, Protobuf)? No
>>
>> 3.  Is the underlying encoding of the type already semantically supported
>> by a type?  Yes.  Two have been mentioned in this thread and I would also
>> support adding a new packed struct type, but it appears isn't necessary
>> for
>> this. Note that FixedSizeLists have some limitations in regards to parquet
>> compatibility around nullability, there might be a few other sharp edges.
>>
>> So if we use this criteria we would lean towards an extension type.
>>
>> We never converged on a standard for "canonical" extension types.  I would
>> propose it roughly be the same criteria as a first class type:
>> 1.  Specification/document update PR that describes the representation
>> 2.  Implementation showing working integration tests across two languages
>> (for canonical types I think this can be any 2 languages instead of C++
>> and
>> Java)
>> 3.  Formal vote accepting the canonical type.
>>
>> Thanks,
>> Micah
>>
>>
>>
>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com> wrote:
>>
>> > Isn't an array of complexes represented by what arrow already supports?
>> In
>> > particular, I see at least two valid in-memory representations to use,
>> that
>> > depend on what we are going to do with it:
>> >
>> > * Struct[re, im]
>> > * FixedList[2]
>> >
>> > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
>> ...], in
>> > the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>> >
>> > The first representation is useful for column-based operations (e.g.
>> taking
>> > the real part in case 1 is trivial; requires a copy in the second case),
>> > the second representation is useful for row-base operations (e.g. "take"
>> > and "filter" require a single pass over buffer 1). Case 2 does not
>> support
>> > Re and Im of different physical types (arguably an issue). Both cases
>> > support nullability of individual items or combined.
>> >
>> > What I conclude is that this does not seem to be a problem about a base
>> > in-memory representation, but rather on whether we agree on a
>> > representation that justifies adding associated metadata to the spec.
>> >
>> > The case for the complex interval type recently proposed [1] is more
>> > compelling to me because a complex ops over intervals usually required
>> all
>> > parts of the interval (and thus the "FixedList" representation is more
>> > compelling), but each part has a different type. I.e. it is like a
>> > "FixedTypedList[int32, int32, int64]", which we do not natively support.
>> >
>> > [1] https://github.com/apache/arrow/pull/10177
>> >
>> > Best,
>> > Jorge
>> >
>> >
>> >
>> > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
>> > neal.p.richard...@gmail.com>
>> > wrote:
>> >
>> > >  It might help this discussion and future discussions like it if we
>> could
>> > > define how it is determined whether a type should be part of the Arrow
>> > > format, an extension type (and what does it mean to say there is a
>> > > "canonical" extension type), or just something that a language
>> > > implementation or downstream library builds for itself with metadata.
>> I
>> > > feel like this has come up before but I don't recall a resolution.
>> > >
>> > > Examples might also help: are there examples of "canonical extension
>> > > types"?
>> > >
>> > > Neal
>> > >
>> > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>> > > wrote:
>> > >
>> > > > >
>> > > > > My understanding is that it means having COMPLEX as an entry in
>> the
>> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
>> > > > > work in the C++ library much more straightforward.
>> > > >
>> > > > One idea I proposed would be to do that, and implement the
>> > > > > serialization of the complex metadata using Extension types.
>> > > >
>> > > >
>> > > > If this is a maintainable strategy for Canonical types it sounds
>> good
>> > to
>> > > > me.
>> > > >
>> > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > My understanding is that it means having COMPLEX as an entry in
>> the
>> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
>> > > > > work in the C++ library much more straightforward.
>> > > > >
>> > > > > One idea I proposed would be to do that, and implement the
>> > > > > serialization of the complex metadata using Extension types.
>> > > > >
>> > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
>> weston.p...@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > > While dedicated types are not strictly required, compute
>> > functions
>> > > > > would
>> > > > > > > be much easier to add for a first-class dedicated complex
>> > datatype
>> > > > > > > rather than for an extension type.
>> > > > > > @pitrou
>> > > > > >
>> > > > > > This is perhaps a naive question (and admittedly, I'm not up to
>> > speed
>> > > > > > on my compute kernels) but why is this the case?  For example,
>> if
>> > > > > > adding a complex addition kernel it seems we would be talking
>> > > about...
>> > > > > >
>> > > > > > dest_scalar.real = scalar1.real + scalar2.real;
>> > > > > > dest_scalar.im = scalar1.im + scalar2.im;
>> > > > > >
>> > > > > > vs...
>> > > > > >
>> > > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
>> > > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
>> > > > > >
>> > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
>> wesmck...@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > I'd be supportive of starting with this as a "canonical"
>> > extension
>> > > > > > > type so that all implementations are not expected to support
>> > > complex
>> > > > > > > types — this would encourage us to build sufficient
>> integration
>> > > e.g.
>> > > > > > > with NumPy to get things working end-to-end with the on-wire
>> > > > > > > representation being an extension type. We could certainly
>> choose
>> > > to
>> > > > > > > treat the type as "first class" in the C++ library without it
>> > being
>> > > > > > > "top level" in the Type union in Flatbuffers.
>> > > > > > >
>> > > > > > > I agree that the use cases are more specialized, and the fact
>> > that
>> > > we
>> > > > > > > haven't needed it until now (or at least, its absence suggests
>> > > this)
>> > > > > > > shows that this is the case.
>> > > > > > >
>> > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
>> > > > emkornfi...@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm convinced now that  first-class types seem to be the
>> way
>> > to
>> > > > go
>> > > > > and I'm
>> > > > > > > > > happy to take this approach.
>> > > > > > > >
>> > > > > > > > I agree from an implementation effort it is simpler, but I'm
>> > > still
>> > > > > not
>> > > > > > > > convinced that we should be adding this as a first class
>> type.
>> > > As
>> > > > > noted in
>> > > > > > > > the survey below it appears Complex numbers are not a core
>> > > concept
>> > > > > in many
>> > > > > > > > general purpose coding languages and it doesn't appear to
>> be a
>> > > > > common type
>> > > > > > > > in SQL systems either.
>> > > > > > > >
>> > > > > > > > The reason why I am being nit-picky here is I think that
>> > having a
>> > > > > first
>> > > > > > > > class type indicates that it should eventually be supported
>> by
>> > > all
>> > > > > > > > reference implementations.  An "well known" extension type I
>> > > think
>> > > > > offers
>> > > > > > > > less guarantees which makes it seem more suitable for niche
>> > > types.
>> > > > > > > >
>> > > > > > > > > I don't immediately see a Packed Struct type. Would this
>> need
>> > > to
>> > > > be
>> > > > > > > > > > implemented?
>> > > > > > > > > Not necessarily (*).  But before thinking about
>> > implementation,
>> > > > > this
>> > > > > > > > > proposal must be accepted into the format.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, this is a type that has been proposed in the past and I
>> > > think
>> > > > > handles
>> > > > > > > > a lot of  types not yet in Arrow but have been requested
>> (e.g.
>> > IP
>> > > > > > > > Addresses, Geo coordinates), etc.
>> > > > > > > >
>> > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
>> > > > > simon.perk...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
>> > > > anto...@python.org>
>> > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
>> > > > > > > > > > >
>> > > > > > > > > > > Adding a new first-class type in Arrow requires
>> working
>> > > > > integration
>> > > > > > > > > tests
>> > > > > > > > > > > between C++ and Java libraries (once the idea is
>> > informally
>> > > > > agreed
>> > > > > > > > > upon)
>> > > > > > > > > > > and then a final vote for approval.  We haven't
>> > formalized
>> > > > > extension
>> > > > > > > > > > types
>> > > > > > > > > > > but I imagine a similar cross language requirement
>> would
>> > be
>> > > > > agreed
>> > > > > > > > > upon.
>> > > > > > > > > > > Implementation of computation wouldn't be required for
>> > > adding
>> > > > > a new
>> > > > > > > > > type.
>> > > > > > > > > > > Different language bindings have taken different
>> > approaches
>> > > > on
>> > > > > how much
>> > > > > > > > > > > additional computational elements are packaged in
>> them.
>> > > > > > > > > >
>> > > > > > > > > > While dedicated types are not strictly required, compute
>> > > > > functions would
>> > > > > > > > > > be much easier to add for a first-class dedicated
>> complex
>> > > > > datatype
>> > > > > > > > > > rather than for an extension type.
>> > > > > > > > > >
>> > > > > > > > > > Since complex numbers are quite common in some domains,
>> and
>> > > > > since they
>> > > > > > > > > > are conceptually simply, IMHO it would make sense to add
>> > them
>> > > > to
>> > > > > the
>> > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and
>> COMPLEX128).
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm convinced now that  first-class types seem to be the
>> way
>> > to
>> > > > go
>> > > > > and I'm
>> > > > > > > > > happy to take this approach.
>> > > > > > > > > Regarding compute functions, it looks like the standard
>> set
>> > of
>> > > > > scalar
>> > > > > > > > > arithmetic and reduction functionality
>> > > > > > > > > is desirable for complex numbers:
>> > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
>> > > > > > > > > Perhaps it would be better to split the addition of the
>> Types
>> > > and
>> > > > > addition
>> > > > > > > > > Compute functionality into separate PRs?
>> > > > > > > > >
>> > > > > > > > > Regarding the process for managing this PR, it sounds
>> like a
>> > > > > proposal must
>> > > > > > > > > be voted on?
>> > > > > > > > > i.e. is this proposal still in this phase
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://arrow.apache.org/docs/developers/contributing.html#before-starting
>> > > > > > > > > Regards
>> > > > > > > > >
>> > > > > > > > > Simon
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to