> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>

It would still be desirable to maintain the memory layout of C/C++/NumPy to
maintain zero-copy.
FixedList[2] maintains this layout, while a Struct[re, im] does not.


On Fri, Jun 11, 2021 at 6:34 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Isn't an array of complexes represented by what arrow already supports? In
> particular, I see at least two valid in-memory representations to use, that
> depend on what we are going to do with it:
>
> * Struct[re, im]
> * FixedList[2]
>
> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>
> The first representation is useful for column-based operations (e.g. taking
> the real part in case 1 is trivial; requires a copy in the second case),
> the second representation is useful for row-base operations (e.g. "take"
> and "filter" require a single pass over buffer 1). Case 2 does not support
> Re and Im of different physical types (arguably an issue). Both cases
> support nullability of individual items or combined.
>
> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>
> The case for the complex interval type recently proposed [1] is more
> compelling to me because a complex ops over intervals usually required all
> parts of the interval (and thus the "FixedList" representation is more
> compelling), but each part has a different type. I.e. it is like a
> "FixedTypedList[int32, int32, int64]", which we do not natively support.
>
> [1] https://github.com/apache/arrow/pull/10177
>
> Best,
> Jorge
>
>
>
> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
> >
> > Examples might also help: are there examples of "canonical extension
> > types"?
> >
> > Neal
> >
> > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > >
> > >
> > > If this is a maintainable strategy for Canonical types it sounds good
> to
> > > me.
> > >
> > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > > >
> > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > > >
> > > > > > While dedicated types are not strictly required, compute
> functions
> > > > would
> > > > > > be much easier to add for a first-class dedicated complex
> datatype
> > > > > > rather than for an extension type.
> > > > > @pitrou
> > > > >
> > > > > This is perhaps a naive question (and admittedly, I'm not up to
> speed
> > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > adding a complex addition kernel it seems we would be talking
> > about...
> > > > >
> > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > >
> > > > > vs...
> > > > >
> > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > >
> > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <wesmck...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > I'd be supportive of starting with this as a "canonical"
> extension
> > > > > > type so that all implementations are not expected to support
> > complex
> > > > > > types — this would encourage us to build sufficient integration
> > e.g.
> > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > representation being an extension type. We could certainly choose
> > to
> > > > > > treat the type as "first class" in the C++ library without it
> being
> > > > > > "top level" in the Type union in Flatbuffers.
> > > > > >
> > > > > > I agree that the use cases are more specialized, and the fact
> that
> > we
> > > > > > haven't needed it until now (or at least, its absence suggests
> > this)
> > > > > > shows that this is the case.
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > >
> > > > > > > I agree from an implementation effort it is simpler, but I'm
> > still
> > > > not
> > > > > > > convinced that we should be adding this as a first class type.
> > As
> > > > noted in
> > > > > > > the survey below it appears Complex numbers are not a core
> > concept
> > > > in many
> > > > > > > general purpose coding languages and it doesn't appear to be a
> > > > common type
> > > > > > > in SQL systems either.
> > > > > > >
> > > > > > > The reason why I am being nit-picky here is I think that
> having a
> > > > first
> > > > > > > class type indicates that it should eventually be supported by
> > all
> > > > > > > reference implementations.  An "well known" extension type I
> > think
> > > > offers
> > > > > > > less guarantees which makes it seem more suitable for niche
> > types.
> > > > > > >
> > > > > > > > I don't immediately see a Packed Struct type. Would this need
> > to
> > > be
> > > > > > > > > implemented?
> > > > > > > > Not necessarily (*).  But before thinking about
> implementation,
> > > > this
> > > > > > > > proposal must be accepted into the format.
> > > > > > >
> > > > > > >
> > > > > > > Yes, this is a type that has been proposed in the past and I
> > think
> > > > handles
> > > > > > > a lot of  types not yet in Arrow but have been requested (e.g.
> IP
> > > > > > > Addresses, Geo coordinates), etc.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > simon.perk...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > anto...@python.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > >
> > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > integration
> > > > > > > > tests
> > > > > > > > > > between C++ and Java libraries (once the idea is
> informally
> > > > agreed
> > > > > > > > upon)
> > > > > > > > > > and then a final vote for approval.  We haven't
> formalized
> > > > extension
> > > > > > > > > types
> > > > > > > > > > but I imagine a similar cross language requirement would
> be
> > > > agreed
> > > > > > > > upon.
> > > > > > > > > > Implementation of computation wouldn't be required for
> > adding
> > > > a new
> > > > > > > > type.
> > > > > > > > > > Different language bindings have taken different
> approaches
> > > on
> > > > how much
> > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > >
> > > > > > > > > While dedicated types are not strictly required, compute
> > > > functions would
> > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > datatype
> > > > > > > > > rather than for an extension type.
> > > > > > > > >
> > > > > > > > > Since complex numbers are quite common in some domains, and
> > > > since they
> > > > > > > > > are conceptually simply, IMHO it would make sense to add
> them
> > > to
> > > > the
> > > > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > > > Regarding compute functions, it looks like the standard set
> of
> > > > scalar
> > > > > > > > arithmetic and reduction functionality
> > > > > > > > is desirable for complex numbers:
> > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > Perhaps it would be better to split the addition of the Types
> > and
> > > > addition
> > > > > > > > Compute functionality into separate PRs?
> > > > > > > >
> > > > > > > > Regarding the process for managing this PR, it sounds like a
> > > > proposal must
> > > > > > > > be voted on?
> > > > > > > > i.e. is this proposal still in this phase
> > > > > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Simon
> > > > > > > >
> > > >
> > >
> >
>

Reply via email to