Sorry meant to add, that I think the C++ implementation should go
where-ever is most convenient to make it work well in the system (unless
the type requires heavy third-party dependencies).

On Sat, Jan 22, 2022 at 8:53 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

>  Do we need a vote on this?
>
> I was imagining well known types would follow roughly the same process
> that new types follow (requiring two different language implementations and
> an integration test).  I don't think we need to stick to java as the second
> language though.
>
> On Sat, Jan 22, 2022 at 11:27 AM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
>> Thanks for the input Weston!
>>
>> How about arrow/experimental/format/ExtensionTypes.fbs or
>> arrow/format/ExtensionTypes.fbs for language independent schema and
>> loosely arrow/<IMPLEMENTATION>/extensions for implementations?
>>
>> Having machine readable definitions could perhaps be useful for
>> generating implementations in some cases.
>>
>> > * The name of the extension type (to go in ARROW:extension:name)
>> > * A description of the extension type and how it should be used
>> > * The storage type of the extension type
>> > * The format and meaning of the content that will go into
>> ARROW:extension:metadata
>>
>> These sound pretty complete!
>>
>> I'll wait for a couple of days to see if there's more input and then
>> draft a PR. Do we need a vote on this?
>>
>>
>> Best,
>> Rok
>>
>> On Fri, Jan 21, 2022 at 3:07 AM Weston Pace <weston.p...@gmail.com>
>> wrote:
>> >
>> > Those all seem to be C++ locations.  If we want to define
>> > cross-implementation "Well Known Extension Types" then it seems we
>> > would want to come up with some kind of language independent agreement
>> > (could just be a markdown file but maybe there is some advantage to
>> > having something programmatically consumable) describing:
>> >
>> > * The name of the extension type (to go in ARROW:extension:name)
>> > * A description of the extension type and how it should be used
>> > * The storage type of the extension type
>> > * The format and meaning of the content that will go into
>> > ARROW:extension:metadata
>> >
>> > I think (but am not sure) that, since these are metadata keys, we are
>> > supposed to stick to printable ASCII for values (for backwards
>> > compatibility).
>> >
>> > For example, in the docs, we currently have this little blurb about a
>> > theoretical tensor extension type:
>> >
>> > > tensor (multidimensional array) stored as Binary values and
>> > > having serialized metadata indicating the data type and shape
>> > > of each value. This could be JSON like {'type': 'int8', 'shape':
>> > > [4, 5]} for a 4x5 cell tensor.
>> >
>> > In my mind this file would be somewhat analogous to the way that
>> > schema.fbs is the cross implementation "ground truth" for our logical
>> > types.
>> >
>> > Then the C++ implementation would be free to put the implementation
>> > (I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
>> > I'm -1 on arrow/extensions/...)
>> >
>> > On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc <rok.mih...@gmail.com>
>> wrote:
>> > >
>> > > To continue the ExtensionType part of this thread - I would like to
>> > > add TensorArray [1] as an ExtensionType to Arrow but we have not yet
>> > > agreed on an "official" location for "Well Known Extension Types".
>> > >
>> > > Where could we put these? Some suggestions:
>> > >
>> > > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
>> > > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
>> > > * separate repo (e.g.
>> github.com/apache/arrow_extensions/cpp/tensor_array.h)
>> > >
>> > > I'd be happy to also gather other Well Known Extension Types into one
>> > > location if this moves forward.
>> > >
>> > > Rok
>> > >
>> > > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
>> > >
>> > > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb <al...@influxdata.com>
>> wrote:
>> > > >
>> > > > I agree with others on this thread. Thanks for writing this down
>> Micah
>> > > >
>> > > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou <anto...@python.org>
>> wrote:
>> > > >
>> > > > >
>> > > > > I concur with both what Wes and Micah said.
>> > > > >
>> > > > > As for temporal types, they have wide-spread use and their
>> semantics
>> > > > > require dedicated treatment for arithmetic and conversion, so it's
>> > > > > helpful to define dedicated types for them, as opposed to mere
>> annotations.
>> > > > >
>> > > > > Regards
>> > > > >
>> > > > > Antoine.
>> > > > >
>> > > > >
>> > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
>> > > > > > I agree that the bar for adding new types to the Type union in
>> Schema.fbs
>> > > > > > should be quite high going forward. Using extension types
>> increasingly
>> > > > > for
>> > > > > > adding specializations of built-in types will mean less burden
>> for
>> > > > > > implementations to simply "propagate forward" this data (by
>> preserving
>> > > > > the
>> > > > > > extra metadata) even if they don't understand what it does. It
>> would be
>> > > > > > nice, therefore, to put us on a path to expanding our set of
>> "official"
>> > > > > > extension types (which would include things like JSON or UUID)
>> since some
>> > > > > > libraries may choose to implement convenience containers for
>> these for
>> > > > > > usability.
>> > > > > >
>> > > > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette <
>> bhule...@apache.org>
>> > > > > wrote:
>> > > > > >
>> > > > > >> +1 this looks good to me.
>> > > > > >>
>> > > > > >> My only concern is with criteria #3 " Is the underlying
>> encoding of the
>> > > > > >> type already semantically supported by a type?". I think this
>> is a good
>> > > > > >> criteria, but it's inconsistent with the current spec. By that
>> criteria
>> > > > > >> some existing types (Timestamp, Time, Duration, Date) should
>> be well
>> > > > > known
>> > > > > >> extension types, right?
>> > > > > >>
>> > > > > >> Perhaps we should explicitly indicate these types are
>> grandfathered in
>> > > > > [1]
>> > > > > >> because they existed before extension types, to avoid tension
>> with this
>> > > > > >> criteria.
>> > > > > >>
>> > > > > >> Brian
>> > > > > >>
>> > > > > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause
>> > > > > >>
>> > > > > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
>> > > > > >> jorgecarlei...@gmail.com> wrote:
>> > > > > >>
>> > > > > >>> Thanks for writing this.
>> > > > > >>>
>> > > > > >>> I agree. That is a good decision tree. +1
>> > > > > >>>
>> > > > > >>> Best,
>> > > > > >>> Jorge
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield <
>> emkornfi...@gmail.com
>> > > > > >
>> > > > > >>> wrote:
>> > > > > >>>
>> > > > > >>>> The discussion around adding another interval type to the
>> Schema.fbs
>> > > > > >>> raises
>> > > > > >>>> the issue of when do we decide to add a new type to the
>> Schema.fbs vs
>> > > > > >>> using
>> > > > > >>>> other means (primarily extension types [1]).
>> > > > > >>>>
>> > > > > >>>> A few criteria come to mind that could help decide (feedback
>> welcome):
>> > > > > >>>>
>> > > > > >>>> 1.  Is the type a new parameterization of an existing type?
>> > > > > >>>>      - If Yes, and we believe the parameterization is useful
>> and can
>> > > > > be
>> > > > > >>> done
>> > > > > >>>> in a forward/backward compatible manner then we would update
>> > > > > >> Schema.fbs.
>> > > > > >>>>
>> > > > > >>>> 2.  Does the type itself have its own specification for
>> processing
>> > > > > >> (e.g.
>> > > > > >>>> JSON, BSON, Thrift, Avro, Protobuf)?
>> > > > > >>>>    - If yes, we would NOT add them to Schema.fbs.  I think
>> this would
>> > > > > >>>> potentially yield too many new types.
>> > > > > >>>>
>> > > > > >>>> 3.  Is the underlying encoding of the type already
>> semantically
>> > > > > >> supported
>> > > > > >>>> by a type? (e.g. if we want to encode physical lengths like
>> meters
>> > > > > >> these
>> > > > > >>>> can be represented by an integer).
>> > > > > >>>>     - If yes, we would NOT update the specification.  This
>> seems like
>> > > > > >> the
>> > > > > >>>> exact use-case that extension types are meant for.
>> > > > > >>>>
>> > > > > >>>> * How does this apply to Interval? *
>> > > > > >>>> Interval extends an existing type in the specification and
>> multiple
>> > > > > >>> "packed
>> > > > > >>>> fields" cannot be easily communicated with the current
>> version of the
>> > > > > >>>> specification.  Hence, I feel comfortable making the
>> addition to
>> > > > > >>> Schema.fbs
>> > > > > >>>>
>> > > > > >>>> * What does this mean for other common types? *
>> > > > > >>>>
>> > > > > >>>> I think as types come up that are very common but we don't
>> want to add
>> > > > > >> to
>> > > > > >>>> the Schema.fbs we should invest in formalizing them as "Well
>> Known"
>> > > > > >>>> Extension types.  In this scenario, we would update the
>> specification
>> > > > > >> to
>> > > > > >>>> include how to specify the extension type metadata (and
>> still require
>> > > > > >> at
>> > > > > >>>> least two libraries support the Extension type before
>> inclusion as
>> > > > > >> "Well
>> > > > > >>>> Known").
>> > > > > >>>>
>> > > > > >>>> * Practical implications *
>> > > > > >>>>
>> > > > > >>>> I think this means the type system in Schema.fbs is mostly
>> closed
>> > > > > (i.e.
>> > > > > >>>> there is a high bar for adding new types). One potentially
>> useful type
>> > > > > >> to
>> > > > > >>>> have would be a "packed struct" that supports something
>> similar to
>> > > > > >> python
>> > > > > >>>> struct library [2].  I think this would likely cover many
>> extension
>> > > > > >> type
>> > > > > >>>> use-cases.
>> > > > > >>>>
>> > > > > >>>> Thoughts?
>> > > > > >>>>
>> > > > > >>>> -Micah
>> > > > > >>>>
>> > > > > >>>> [1]
>> > > > >
>> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>> > > > > >>>> [2] https://docs.python.org/3/library/struct.html
>> > > > > >>>>
>> > > > > >>>
>> > > > > >>
>> > > > > >
>> > > > >
>>
>

Reply via email to