If I may, I would be really interested to be kept in the loop as well. I
have been working on a small library making it easy to declare Python types
and automatically getting them supported in Pyarrow as extension types (and
then benefit of vecotrized ops) : https://github.com/balancap/arrowbic

The main feature at the moment is the support of dataclass, numpy arrays
and enum, but I plan to extend it to as many standard Python patterns as
possible.

Short story, for now, I am storing metadata in json serialized, but I would
be happy to move to any standard defined in Pyarrow, and also use the
standard representation for tensor / Numpy array

Thanks you!
Paul




On Tue, 8 Feb 2022, 17:57 Micah Kornfield, <emkornfi...@gmail.com> wrote:

> >
> > I do not know if we voted on a naming convention, but we may want to
> > reserve a namespace for us (e.g. "arrow").
>
> +1 to calling out in docs that the arrow namespace should be reserved.
> maybe "apache.arrow" to lower the possibility of collisions with people who
> already have extension types? (I don't feel too strongly about this).
>
> Note that we do not have tests on tensor arrays, so testing the extension
> > type on these may be hindered by divergences between implementations. I
> do
> > not think we even have json integration files for them.
>
> Agree, we'll likely need a little more thought on what it means to validate
> extension types (is being able to parse extension metadata sufficient?)
>
> Also, note that Rust's arrow2 supports extension types (tested part of the
> > IPC and c data interface*), and Polars relies on it to allow Python
> generic
> > "object" in its machinery.
>
> I think this is great for having external verification of  specifications,
> but I think for official arrow types, we should be focusing on
> implementations that are under ASF governance.
>
> On Tue, Feb 8, 2022 at 8:32 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Note that we do not have tests on tensor arrays, so testing the extension
> > type on these may be hindered by divergences between implementations. I
> do
> > not think we even have json integration files for them.
> >
> > If the focus is extension types, maybe it would be best to cover types
> > whose physical representations are covered in e.g. IPC or c data
> interface
> > tests.
> >
> > I do not know if we voted on a naming convention, but we may want to
> > reserve a namespace for us (e.g. "arrow").
> >
> > Also, note that Rust's arrow2 supports extension types (tested part of
> the
> > IPC and c data interface*), and Polars relies on it to allow Python
> generic
> > "object" in its machinery.
> >
> > Best,
> > Jorge
> >
> > * pending https://issues.apache.org/jira/browse/ARROW-15613
> >
> >
> >
> > On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc <rok.mih...@gmail.com> wrote:
> > >
> > > > To follow up the discussion from the bi-weekly Arrow sync:
> > > >
> > > > - JSON seems the most suitable candidate for the extension metadata.
> > > > E.g.: TensorArray
> > > > {"key": "ARROW:extension:name", "value": "tensor<type=int64,
> shape=(3,
> > > > 3, 4), strides=(12, 4, 1)>"},
> > > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> > > >
> > >
> > > I will start a separate thread for the exact encoding of the metadata
> > value
> > > (i.e. JSON or something else) if that's OK. I already started writing
> one
> > > last week anyway, and that keeps things a bit separated.
> > >
> > > For the name of the extension type:
> > > - We might want to use something like "arrow.tensor" to follow the
> > > recommendation at
> > > https://arrow.apache.org/docs/format/Columnar.html#extension-types to
> > use
> > > a
> > > namespace. And so for "well known" extension types that are defined in
> > the
> > > Arrow project itself, I think we can use the "arrow" namespace? (as
> > > example, for the extension types defined in pandas, I used the
> "pandas."
> > > namespace)
> > > - In general, I think it's best to keep the name itself simple, and
> leave
> > > any parametrization out of it (since this is included in the metadata).
> > So
> > > in this case that would be just "tensor" instead of "tensor<type=...,
> > > shape=..., ..>".
> > > - Specifically for this extension type, we might want to use something
> > like
> > > "fixed_size_tensor" instead of "tensor", to be able to differentiate in
> > the
> > > future between the tensor type with constant shape vs variable shape (
> > > ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614> vs
> > > ARROW-8714
> > > <https://issues.apache.org/jira/browse/ARROW-8714>). But that's
> > something
> > > to discuss in the relevant JIRA issue / PR.
> > >
> > > - We want to start with at least one integration test pair. Potential
> > > > candidates are cpp, julia, go, rust.
> > > >
> > >
> > > Rust does not yet seem to support extension types? (
> > > https://github.com/apache/arrow-rs/issues/218)
> > >
> > >
> > > > - First well known extension type candidate is TensorArray but other
> > > > suggestions are welcome.
> > > >
> > >
> > > Others that I am aware of that have been brought up in the past are
> UUID
> > (
> > > ARROW-2152 <https://issues.apache.org/jira/browse/ARROW-2152>),
> complex
> > > numbers (ARROW-638 <https://issues.apache.org/jira/browse/ARROW-638>,
> > this
> > > has a PR) and 8-bit boolean values (ARROW-1674
> > > <https://issues.apache.org/jira/browse/ARROW-1674>). But I think we
> > should
> > > mainly look at demand / someone wanting to implement this, and (for
> you)
> > > this seems to be Tensors, so it's fine to focus on that.
> > >
> > > Joris
> > >
> > >
> > > >
> > > > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou <anto...@python.org>
> > > > wrote:
> > > > >
> > > > >
> > > > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc <rok.mih...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >> Thanks for the input Weston!
> > > > > >>
> > > > > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > > > > >> arrow/format/ExtensionTypes.fbs for language independent schema
> > and
> > > > > >> loosely arrow/<IMPLEMENTATION>/extensions for implementations?
> > > > > >>
> > > > > >> Having machine readable definitions could perhaps be useful for
> > > > > >> generating implementations in some cases.
> > > > > >
> > > > > > Is it useful to put this in a flatbuffer file? Based on the list
> > from
> > > > > > Weston just below, I think this will mostly contain a
> *description*
> > > of
> > > > > > those different aspect (a specification of the extension type),
> and
> > > > > > there is no data that actually fits in a flatbuffer table? In
> that
> > > > > > case a plain text (eg markdown) file seems more fitting?
> > > > >
> > > > > I agree this is mostly a plain text (or, rather, reST :-))
> > > specification
> > > > > task.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >
> > >
> >
>

Reply via email to