On Mon, Jun 21, 2021 at 4:58 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I certainly don't think we should have extension types with a different
> type id.  IMHO, it's a recipe for confusion.
>

Thanks, I think I got confused by the different perspectives in the thread.
I'll do some more exploratory coding with the pure ExtensionType which
should certainly
make life easier!



>
> Le 21/06/2021 à 15:54, Simon Perkins a écrit :
> > To put it another way, an Extension Type technically has Type::EXTENSION,
> > but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.
> >
> > When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
> > mismatch on ComplexFloatType::Type::type_id, as the latter is
> > Type::EXTENSION?
> >
> > On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <simon.perk...@gmail.com>
> > wrote:
> >
> >> I did some exploratory coding adding Complex Numbers as ExtensionTypes
> in
> >> this PR: https://github.com/apache/arrow/pull/10565
> >>
> >>> My understanding is that it means having COMPLEX as an entry in the
> >> arrow/type_fwd.h Type enum. I agree this would make implementation
> >> work in the C++ library much more straightforward.
> >>
> >> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
> >> entries to the Type enum.
> >> One thing I noted is that at least some portion of the code base
> >> (visitor_inline.h) expects the ExtensionTypes
> >> to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
> >> Visitor definitions).
> >>
> >> My impression at this point, is that the work in implementing a hybrid
> >> approach
> >> (i.e. ExtensionType with a Type enum entry) replicates that of adding a
> >> first-class type.
> >> As I am not extensively familiar with the code base, I thought I'd just
> >> check whether this impression
> >> is correct.
> >>
> >> regards
> >>    Simon
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>>>
> >>>>   It might help this discussion and future discussions like it if we
> >>> could
> >>>> define how it is determined whether a type should be part of the Arrow
> >>>> format, an extension type (and what does it mean to say there is a
> >>>> "canonical" extension type), or just something that a language
> >>>> implementation or downstream library builds for itself with metadata.
> I
> >>>> feel like this has come up before but I don't recall a resolution.
> >>>
> >>>
> >>> There seemed to be  consensus, but I guess we never formally voted on
> the
> >>> decision points here:
> >>>
> >>>
> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
> >>>
> >>> Applying the criteria to complex types:
> >>> 1.  Is the type a new parameterization of an existing type?  No
> >>>
> >>> 2.  Does the type itself have its own specification for processing
> (e.g.
> >>> JSON, BSON, Thrift, Avro, Protobuf)? No
> >>>
> >>> 3.  Is the underlying encoding of the type already semantically
> supported
> >>> by a type?  Yes.  Two have been mentioned in this thread and I would
> also
> >>> support adding a new packed struct type, but it appears isn't necessary
> >>> for
> >>> this. Note that FixedSizeLists have some limitations in regards to
> parquet
> >>> compatibility around nullability, there might be a few other sharp
> edges.
> >>>
> >>> So if we use this criteria we would lean towards an extension type.
> >>>
> >>> We never converged on a standard for "canonical" extension types.  I
> would
> >>> propose it roughly be the same criteria as a first class type:
> >>> 1.  Specification/document update PR that describes the representation
> >>> 2.  Implementation showing working integration tests across two
> languages
> >>> (for canonical types I think this can be any 2 languages instead of C++
> >>> and
> >>> Java)
> >>> 3.  Formal vote accepting the canonical type.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>>
> >>>
> >>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>>> Isn't an array of complexes represented by what arrow already
> supports?
> >>> In
> >>>> particular, I see at least two valid in-memory representations to use,
> >>> that
> >>>> depend on what we are going to do with it:
> >>>>
> >>>> * Struct[re, im]
> >>>> * FixedList[2]
> >>>>
> >>>> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
> >>> ...], in
> >>>> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
> >>>>
> >>>> The first representation is useful for column-based operations (e.g.
> >>> taking
> >>>> the real part in case 1 is trivial; requires a copy in the second
> case),
> >>>> the second representation is useful for row-base operations (e.g.
> "take"
> >>>> and "filter" require a single pass over buffer 1). Case 2 does not
> >>> support
> >>>> Re and Im of different physical types (arguably an issue). Both cases
> >>>> support nullability of individual items or combined.
> >>>>
> >>>> What I conclude is that this does not seem to be a problem about a
> base
> >>>> in-memory representation, but rather on whether we agree on a
> >>>> representation that justifies adding associated metadata to the spec.
> >>>>
> >>>> The case for the complex interval type recently proposed [1] is more
> >>>> compelling to me because a complex ops over intervals usually required
> >>> all
> >>>> parts of the interval (and thus the "FixedList" representation is more
> >>>> compelling), but each part has a different type. I.e. it is like a
> >>>> "FixedTypedList[int32, int32, int64]", which we do not natively
> support.
> >>>>
> >>>> [1] https://github.com/apache/arrow/pull/10177
> >>>>
> >>>> Best,
> >>>> Jorge
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> >>>> neal.p.richard...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>   It might help this discussion and future discussions like it if we
> >>> could
> >>>>> define how it is determined whether a type should be part of the
> Arrow
> >>>>> format, an extension type (and what does it mean to say there is a
> >>>>> "canonical" extension type), or just something that a language
> >>>>> implementation or downstream library builds for itself with metadata.
> >>> I
> >>>>> feel like this has come up before but I don't recall a resolution.
> >>>>>
> >>>>> Examples might also help: are there examples of "canonical extension
> >>>>> types"?
> >>>>>
> >>>>> Neal
> >>>>>
> >>>>> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>>>
> >>>>>>> My understanding is that it means having COMPLEX as an entry in
> >>> the
> >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
> >>>>>>> work in the C++ library much more straightforward.
> >>>>>>
> >>>>>> One idea I proposed would be to do that, and implement the
> >>>>>>> serialization of the complex metadata using Extension types.
> >>>>>>
> >>>>>>
> >>>>>> If this is a maintainable strategy for Canonical types it sounds
> >>> good
> >>>> to
> >>>>>> me.
> >>>>>>
> >>>>>> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> My understanding is that it means having COMPLEX as an entry in
> >>> the
> >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
> >>>>>>> work in the C++ library much more straightforward.
> >>>>>>>
> >>>>>>> One idea I proposed would be to do that, and implement the
> >>>>>>> serialization of the complex metadata using Extension types.
> >>>>>>>
> >>>>>>> On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
> >>> weston.p...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> While dedicated types are not strictly required, compute
> >>>> functions
> >>>>>>> would
> >>>>>>>>> be much easier to add for a first-class dedicated complex
> >>>> datatype
> >>>>>>>>> rather than for an extension type.
> >>>>>>>> @pitrou
> >>>>>>>>
> >>>>>>>> This is perhaps a naive question (and admittedly, I'm not up to
> >>>> speed
> >>>>>>>> on my compute kernels) but why is this the case?  For example,
> >>> if
> >>>>>>>> adding a complex addition kernel it seems we would be talking
> >>>>> about...
> >>>>>>>>
> >>>>>>>> dest_scalar.real = scalar1.real + scalar2.real;
> >>>>>>>> dest_scalar.im = scalar1.im + scalar2.im;
> >>>>>>>>
> >>>>>>>> vs...
> >>>>>>>>
> >>>>>>>> dest_scalar[0] = scalar1[0] + scalar2[0];
> >>>>>>>> dest_scalar[1] = scalar1[1] + scalar2[1];
> >>>>>>>>
> >>>>>>>> On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
> >>> wesmck...@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I'd be supportive of starting with this as a "canonical"
> >>>> extension
> >>>>>>>>> type so that all implementations are not expected to support
> >>>>> complex
> >>>>>>>>> types — this would encourage us to build sufficient
> >>> integration
> >>>>> e.g.
> >>>>>>>>> with NumPy to get things working end-to-end with the on-wire
> >>>>>>>>> representation being an extension type. We could certainly
> >>> choose
> >>>>> to
> >>>>>>>>> treat the type as "first class" in the C++ library without it
> >>>> being
> >>>>>>>>> "top level" in the Type union in Flatbuffers.
> >>>>>>>>>
> >>>>>>>>> I agree that the use cases are more specialized, and the fact
> >>>> that
> >>>>> we
> >>>>>>>>> haven't needed it until now (or at least, its absence suggests
> >>>>> this)
> >>>>>>>>> shows that this is the case.
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> >>>>>> emkornfi...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'm convinced now that  first-class types seem to be the
> >>> way
> >>>> to
> >>>>>> go
> >>>>>>> and I'm
> >>>>>>>>>>> happy to take this approach.
> >>>>>>>>>>
> >>>>>>>>>> I agree from an implementation effort it is simpler, but I'm
> >>>>> still
> >>>>>>> not
> >>>>>>>>>> convinced that we should be adding this as a first class
> >>> type.
> >>>>> As
> >>>>>>> noted in
> >>>>>>>>>> the survey below it appears Complex numbers are not a core
> >>>>> concept
> >>>>>>> in many
> >>>>>>>>>> general purpose coding languages and it doesn't appear to
> >>> be a
> >>>>>>> common type
> >>>>>>>>>> in SQL systems either.
> >>>>>>>>>>
> >>>>>>>>>> The reason why I am being nit-picky here is I think that
> >>>> having a
> >>>>>>> first
> >>>>>>>>>> class type indicates that it should eventually be supported
> >>> by
> >>>>> all
> >>>>>>>>>> reference implementations.  An "well known" extension type I
> >>>>> think
> >>>>>>> offers
> >>>>>>>>>> less guarantees which makes it seem more suitable for niche
> >>>>> types.
> >>>>>>>>>>
> >>>>>>>>>>> I don't immediately see a Packed Struct type. Would this
> >>> need
> >>>>> to
> >>>>>> be
> >>>>>>>>>>>> implemented?
> >>>>>>>>>>> Not necessarily (*).  But before thinking about
> >>>> implementation,
> >>>>>>> this
> >>>>>>>>>>> proposal must be accepted into the format.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, this is a type that has been proposed in the past and I
> >>>>> think
> >>>>>>> handles
> >>>>>>>>>> a lot of  types not yet in Arrow but have been requested
> >>> (e.g.
> >>>> IP
> >>>>>>>>>> Addresses, Geo coordinates), etc.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> >>>>>>> simon.perk...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> >>>>>> anto...@python.org>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Adding a new first-class type in Arrow requires
> >>> working
> >>>>>>> integration
> >>>>>>>>>>> tests
> >>>>>>>>>>>>> between C++ and Java libraries (once the idea is
> >>>> informally
> >>>>>>> agreed
> >>>>>>>>>>> upon)
> >>>>>>>>>>>>> and then a final vote for approval.  We haven't
> >>>> formalized
> >>>>>>> extension
> >>>>>>>>>>>> types
> >>>>>>>>>>>>> but I imagine a similar cross language requirement
> >>> would
> >>>> be
> >>>>>>> agreed
> >>>>>>>>>>> upon.
> >>>>>>>>>>>>> Implementation of computation wouldn't be required for
> >>>>> adding
> >>>>>>> a new
> >>>>>>>>>>> type.
> >>>>>>>>>>>>> Different language bindings have taken different
> >>>> approaches
> >>>>>> on
> >>>>>>> how much
> >>>>>>>>>>>>> additional computational elements are packaged in
> >>> them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> While dedicated types are not strictly required, compute
> >>>>>>> functions would
> >>>>>>>>>>>> be much easier to add for a first-class dedicated
> >>> complex
> >>>>>>> datatype
> >>>>>>>>>>>> rather than for an extension type.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Since complex numbers are quite common in some domains,
> >>> and
> >>>>>>> since they
> >>>>>>>>>>>> are conceptually simply, IMHO it would make sense to add
> >>>> them
> >>>>>> to
> >>>>>>> the
> >>>>>>>>>>>> native Arrow datatypes (at least COMPLEX64 and
> >>> COMPLEX128).
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'm convinced now that  first-class types seem to be the
> >>> way
> >>>> to
> >>>>>> go
> >>>>>>> and I'm
> >>>>>>>>>>> happy to take this approach.
> >>>>>>>>>>> Regarding compute functions, it looks like the standard
> >>> set
> >>>> of
> >>>>>>> scalar
> >>>>>>>>>>> arithmetic and reduction functionality
> >>>>>>>>>>> is desirable for complex numbers:
> >>>>>>>>>>> https://arrow.apache.org/docs/cpp/compute.html#
> >>>>>>>>>>> Perhaps it would be better to split the addition of the
> >>> Types
> >>>>> and
> >>>>>>> addition
> >>>>>>>>>>> Compute functionality into separate PRs?
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding the process for managing this PR, it sounds
> >>> like a
> >>>>>>> proposal must
> >>>>>>>>>>> be voted on?
> >>>>>>>>>>> i.e. is this proposal still in this phase
> >>>>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> >>>>>>>>>>> Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Simon
> >>>>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Reply via email to