If I may, I would be really interested to be kept in the loop as well. I have been working on a small library making it easy to declare Python types and automatically getting them supported in Pyarrow as extension types (and then benefit of vecotrized ops) : https://github.com/balancap/arrowbic
The main feature at the moment is the support of dataclass, numpy arrays and enum, but I plan to extend it to as many standard Python patterns as possible. Short story, for now, I am storing metadata in json serialized, but I would be happy to move to any standard defined in Pyarrow, and also use the standard representation for tensor / Numpy array Thanks you! Paul On Tue, 8 Feb 2022, 17:57 Micah Kornfield, <emkornfi...@gmail.com> wrote: > > > > I do not know if we voted on a naming convention, but we may want to > > reserve a namespace for us (e.g. "arrow"). > > +1 to calling out in docs that the arrow namespace should be reserved. > maybe "apache.arrow" to lower the possibility of collisions with people who > already have extension types? (I don't feel too strongly about this). > > Note that we do not have tests on tensor arrays, so testing the extension > > type on these may be hindered by divergences between implementations. I > do > > not think we even have json integration files for them. > > Agree, we'll likely need a little more thought on what it means to validate > extension types (is being able to parse extension metadata sufficient?) > > Also, note that Rust's arrow2 supports extension types (tested part of the > > IPC and c data interface*), and Polars relies on it to allow Python > generic > > "object" in its machinery. > > I think this is great for having external verification of specifications, > but I think for official arrow types, we should be focusing on > implementations that are under ASF governance. > > On Tue, Feb 8, 2022 at 8:32 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Note that we do not have tests on tensor arrays, so testing the extension > > type on these may be hindered by divergences between implementations. I > do > > not think we even have json integration files for them. > > > > If the focus is extension types, maybe it would be best to cover types > > whose physical representations are covered in e.g. IPC or c data > interface > > tests. > > > > I do not know if we voted on a naming convention, but we may want to > > reserve a namespace for us (e.g. "arrow"). > > > > Also, note that Rust's arrow2 supports extension types (tested part of > the > > IPC and c data interface*), and Polars relies on it to allow Python > generic > > "object" in its machinery. > > > > Best, > > Jorge > > > > * pending https://issues.apache.org/jira/browse/ARROW-15613 > > > > > > > > On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche < > > jorisvandenboss...@gmail.com> wrote: > > > > > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc <rok.mih...@gmail.com> wrote: > > > > > > > To follow up the discussion from the bi-weekly Arrow sync: > > > > > > > > - JSON seems the most suitable candidate for the extension metadata. > > > > E.g.: TensorArray > > > > {"key": "ARROW:extension:name", "value": "tensor<type=int64, > shape=(3, > > > > 3, 4), strides=(12, 4, 1)>"}, > > > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64', > > > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"} > > > > > > > > > > I will start a separate thread for the exact encoding of the metadata > > value > > > (i.e. JSON or something else) if that's OK. I already started writing > one > > > last week anyway, and that keeps things a bit separated. > > > > > > For the name of the extension type: > > > - We might want to use something like "arrow.tensor" to follow the > > > recommendation at > > > https://arrow.apache.org/docs/format/Columnar.html#extension-types to > > use > > > a > > > namespace. And so for "well known" extension types that are defined in > > the > > > Arrow project itself, I think we can use the "arrow" namespace? (as > > > example, for the extension types defined in pandas, I used the > "pandas." > > > namespace) > > > - In general, I think it's best to keep the name itself simple, and > leave > > > any parametrization out of it (since this is included in the metadata). > > So > > > in this case that would be just "tensor" instead of "tensor<type=..., > > > shape=..., ..>". > > > - Specifically for this extension type, we might want to use something > > like > > > "fixed_size_tensor" instead of "tensor", to be able to differentiate in > > the > > > future between the tensor type with constant shape vs variable shape ( > > > ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614> vs > > > ARROW-8714 > > > <https://issues.apache.org/jira/browse/ARROW-8714>). But that's > > something > > > to discuss in the relevant JIRA issue / PR. > > > > > > - We want to start with at least one integration test pair. Potential > > > > candidates are cpp, julia, go, rust. > > > > > > > > > > Rust does not yet seem to support extension types? ( > > > https://github.com/apache/arrow-rs/issues/218) > > > > > > > > > > - First well known extension type candidate is TensorArray but other > > > > suggestions are welcome. > > > > > > > > > > Others that I am aware of that have been brought up in the past are > UUID > > ( > > > ARROW-2152 <https://issues.apache.org/jira/browse/ARROW-2152>), > complex > > > numbers (ARROW-638 <https://issues.apache.org/jira/browse/ARROW-638>, > > this > > > has a PR) and 8-bit boolean values (ARROW-1674 > > > <https://issues.apache.org/jira/browse/ARROW-1674>). But I think we > > should > > > mainly look at demand / someone wanting to implement this, and (for > you) > > > this seems to be Tensors, so it's fine to focus on that. > > > > > > Joris > > > > > > > > > > > > > > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou <anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit : > > > > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc <rok.mih...@gmail.com> > > > wrote: > > > > > >> > > > > > >> Thanks for the input Weston! > > > > > >> > > > > > >> How about arrow/experimental/format/ExtensionTypes.fbs or > > > > > >> arrow/format/ExtensionTypes.fbs for language independent schema > > and > > > > > >> loosely arrow/<IMPLEMENTATION>/extensions for implementations? > > > > > >> > > > > > >> Having machine readable definitions could perhaps be useful for > > > > > >> generating implementations in some cases. > > > > > > > > > > > > Is it useful to put this in a flatbuffer file? Based on the list > > from > > > > > > Weston just below, I think this will mostly contain a > *description* > > > of > > > > > > those different aspect (a specification of the extension type), > and > > > > > > there is no data that actually fits in a flatbuffer table? In > that > > > > > > case a plain text (eg markdown) file seems more fitting? > > > > > > > > > > I agree this is mostly a plain text (or, rather, reST :-)) > > > specification > > > > > task. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > >