Sorry meant to add, that I think the C++ implementation should go where-ever is most convenient to make it work well in the system (unless the type requires heavy third-party dependencies).
On Sat, Jan 22, 2022 at 8:53 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Do we need a vote on this? > > I was imagining well known types would follow roughly the same process > that new types follow (requiring two different language implementations and > an integration test). I don't think we need to stick to java as the second > language though. > > On Sat, Jan 22, 2022 at 11:27 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > >> Thanks for the input Weston! >> >> How about arrow/experimental/format/ExtensionTypes.fbs or >> arrow/format/ExtensionTypes.fbs for language independent schema and >> loosely arrow/<IMPLEMENTATION>/extensions for implementations? >> >> Having machine readable definitions could perhaps be useful for >> generating implementations in some cases. >> >> > * The name of the extension type (to go in ARROW:extension:name) >> > * A description of the extension type and how it should be used >> > * The storage type of the extension type >> > * The format and meaning of the content that will go into >> ARROW:extension:metadata >> >> These sound pretty complete! >> >> I'll wait for a couple of days to see if there's more input and then >> draft a PR. Do we need a vote on this? >> >> >> Best, >> Rok >> >> On Fri, Jan 21, 2022 at 3:07 AM Weston Pace <weston.p...@gmail.com> >> wrote: >> > >> > Those all seem to be C++ locations. If we want to define >> > cross-implementation "Well Known Extension Types" then it seems we >> > would want to come up with some kind of language independent agreement >> > (could just be a markdown file but maybe there is some advantage to >> > having something programmatically consumable) describing: >> > >> > * The name of the extension type (to go in ARROW:extension:name) >> > * A description of the extension type and how it should be used >> > * The storage type of the extension type >> > * The format and meaning of the content that will go into >> > ARROW:extension:metadata >> > >> > I think (but am not sure) that, since these are metadata keys, we are >> > supposed to stick to printable ASCII for values (for backwards >> > compatibility). >> > >> > For example, in the docs, we currently have this little blurb about a >> > theoretical tensor extension type: >> > >> > > tensor (multidimensional array) stored as Binary values and >> > > having serialized metadata indicating the data type and shape >> > > of each value. This could be JSON like {'type': 'int8', 'shape': >> > > [4, 5]} for a 4x5 cell tensor. >> > >> > In my mind this file would be somewhat analogous to the way that >> > schema.fbs is the cross implementation "ground truth" for our logical >> > types. >> > >> > Then the C++ implementation would be free to put the implementation >> > (I'd vote for arrow/cpp/extensions but a separate repo is probably ok. >> > I'm -1 on arrow/extensions/...) >> > >> > On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc <rok.mih...@gmail.com> >> wrote: >> > > >> > > To continue the ExtensionType part of this thread - I would like to >> > > add TensorArray [1] as an ExtensionType to Arrow but we have not yet >> > > agreed on an "official" location for "Well Known Extension Types". >> > > >> > > Where could we put these? Some suggestions: >> > > >> > > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h) >> > > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h) >> > > * separate repo (e.g. >> github.com/apache/arrow_extensions/cpp/tensor_array.h) >> > > >> > > I'd be happy to also gather other Well Known Extension Types into one >> > > location if this moves forward. >> > > >> > > Rok >> > > >> > > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389 >> > > >> > > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb <al...@influxdata.com> >> wrote: >> > > > >> > > > I agree with others on this thread. Thanks for writing this down >> Micah >> > > > >> > > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou <anto...@python.org> >> wrote: >> > > > >> > > > > >> > > > > I concur with both what Wes and Micah said. >> > > > > >> > > > > As for temporal types, they have wide-spread use and their >> semantics >> > > > > require dedicated treatment for arithmetic and conversion, so it's >> > > > > helpful to define dedicated types for them, as opposed to mere >> annotations. >> > > > > >> > > > > Regards >> > > > > >> > > > > Antoine. >> > > > > >> > > > > >> > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit : >> > > > > > I agree that the bar for adding new types to the Type union in >> Schema.fbs >> > > > > > should be quite high going forward. Using extension types >> increasingly >> > > > > for >> > > > > > adding specializations of built-in types will mean less burden >> for >> > > > > > implementations to simply "propagate forward" this data (by >> preserving >> > > > > the >> > > > > > extra metadata) even if they don't understand what it does. It >> would be >> > > > > > nice, therefore, to put us on a path to expanding our set of >> "official" >> > > > > > extension types (which would include things like JSON or UUID) >> since some >> > > > > > libraries may choose to implement convenience containers for >> these for >> > > > > > usability. >> > > > > > >> > > > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette < >> bhule...@apache.org> >> > > > > wrote: >> > > > > > >> > > > > >> +1 this looks good to me. >> > > > > >> >> > > > > >> My only concern is with criteria #3 " Is the underlying >> encoding of the >> > > > > >> type already semantically supported by a type?". I think this >> is a good >> > > > > >> criteria, but it's inconsistent with the current spec. By that >> criteria >> > > > > >> some existing types (Timestamp, Time, Duration, Date) should >> be well >> > > > > known >> > > > > >> extension types, right? >> > > > > >> >> > > > > >> Perhaps we should explicitly indicate these types are >> grandfathered in >> > > > > [1] >> > > > > >> because they existed before extension types, to avoid tension >> with this >> > > > > >> criteria. >> > > > > >> >> > > > > >> Brian >> > > > > >> >> > > > > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause >> > > > > >> >> > > > > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão < >> > > > > >> jorgecarlei...@gmail.com> wrote: >> > > > > >> >> > > > > >>> Thanks for writing this. >> > > > > >>> >> > > > > >>> I agree. That is a good decision tree. +1 >> > > > > >>> >> > > > > >>> Best, >> > > > > >>> Jorge >> > > > > >>> >> > > > > >>> >> > > > > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield < >> emkornfi...@gmail.com >> > > > > > >> > > > > >>> wrote: >> > > > > >>> >> > > > > >>>> The discussion around adding another interval type to the >> Schema.fbs >> > > > > >>> raises >> > > > > >>>> the issue of when do we decide to add a new type to the >> Schema.fbs vs >> > > > > >>> using >> > > > > >>>> other means (primarily extension types [1]). >> > > > > >>>> >> > > > > >>>> A few criteria come to mind that could help decide (feedback >> welcome): >> > > > > >>>> >> > > > > >>>> 1. Is the type a new parameterization of an existing type? >> > > > > >>>> - If Yes, and we believe the parameterization is useful >> and can >> > > > > be >> > > > > >>> done >> > > > > >>>> in a forward/backward compatible manner then we would update >> > > > > >> Schema.fbs. >> > > > > >>>> >> > > > > >>>> 2. Does the type itself have its own specification for >> processing >> > > > > >> (e.g. >> > > > > >>>> JSON, BSON, Thrift, Avro, Protobuf)? >> > > > > >>>> - If yes, we would NOT add them to Schema.fbs. I think >> this would >> > > > > >>>> potentially yield too many new types. >> > > > > >>>> >> > > > > >>>> 3. Is the underlying encoding of the type already >> semantically >> > > > > >> supported >> > > > > >>>> by a type? (e.g. if we want to encode physical lengths like >> meters >> > > > > >> these >> > > > > >>>> can be represented by an integer). >> > > > > >>>> - If yes, we would NOT update the specification. This >> seems like >> > > > > >> the >> > > > > >>>> exact use-case that extension types are meant for. >> > > > > >>>> >> > > > > >>>> * How does this apply to Interval? * >> > > > > >>>> Interval extends an existing type in the specification and >> multiple >> > > > > >>> "packed >> > > > > >>>> fields" cannot be easily communicated with the current >> version of the >> > > > > >>>> specification. Hence, I feel comfortable making the >> addition to >> > > > > >>> Schema.fbs >> > > > > >>>> >> > > > > >>>> * What does this mean for other common types? * >> > > > > >>>> >> > > > > >>>> I think as types come up that are very common but we don't >> want to add >> > > > > >> to >> > > > > >>>> the Schema.fbs we should invest in formalizing them as "Well >> Known" >> > > > > >>>> Extension types. In this scenario, we would update the >> specification >> > > > > >> to >> > > > > >>>> include how to specify the extension type metadata (and >> still require >> > > > > >> at >> > > > > >>>> least two libraries support the Extension type before >> inclusion as >> > > > > >> "Well >> > > > > >>>> Known"). >> > > > > >>>> >> > > > > >>>> * Practical implications * >> > > > > >>>> >> > > > > >>>> I think this means the type system in Schema.fbs is mostly >> closed >> > > > > (i.e. >> > > > > >>>> there is a high bar for adding new types). One potentially >> useful type >> > > > > >> to >> > > > > >>>> have would be a "packed struct" that supports something >> similar to >> > > > > >> python >> > > > > >>>> struct library [2]. I think this would likely cover many >> extension >> > > > > >> type >> > > > > >>>> use-cases. >> > > > > >>>> >> > > > > >>>> Thoughts? >> > > > > >>>> >> > > > > >>>> -Micah >> > > > > >>>> >> > > > > >>>> [1] >> > > > > >> https://arrow.apache.org/docs/format/Columnar.html#extension-types >> > > > > >>>> [2] https://docs.python.org/3/library/struct.html >> > > > > >>>> >> > > > > >>> >> > > > > >> >> > > > > > >> > > > > >> >