Re: [DISCUSS] Approach to generic schema representation

Weston Pace Mon, 08 Jul 2024 09:42:52 -0700

+1 for empty stream/file as schema serialization.  I have used this
approach myself on more than one occasion and it works well.  It can even
be useful for transmitting schemas between different arrow-native libraries
in the same language (e.g. rust->rust) since it allows the different
libraries to use different arrow versions.


There is one other approach if you only need intra-process serialization
(e.g. between threads / libraries in the same process).  You can use the C
data interface (https://arrow.apache.org/docs/format/CDataInterface.html).
It is maybe a slightly more complex API (because of the release callback)
and I think it is unlikely to be significantly faster (unless you have an
abnormally large schema).  However, it has the same advantages and might be
useful if you are already using the C data interface elsewhere.


On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <[email protected]> wrote:

> Hey Jeremy,
>
> Currently the first message of an IPC stream is a Schema message which
> consists solely of a flatbuffer message and defined in the Schema.fbs file
> of the Arrow repo. All of the libraries that can read Arrow IPC should be
> able to also handle converting a single IPC schema message back into an
> Arrow schema without issue. Would that be sufficient for you?
>
> On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <[email protected]> wrote:
>
> > I'm looking for any advice folks may have on a generic way to document
> and
> > represent expected arrow schemas as part of an interface definition.
> >
> > For context, our library provides a cross-language (python, c++, rust)
> SDK
> > for logging semantic multi-modal data (point clouds, images, geometric
> > transforms, bounding boxes, etc.). Each of these primitive types has an
> > associated arrow schema, but to date we have largely abstracted that from
> > our users through language-native object types, and a bunch of generated
> > code to "serialize" stuff into the arrow buffers before transmitting via
> > our IPC.
> >
> > We're trying to take steps in the direction of making it easier for
> > advanced users to write and read data from the store directly using
> arrow,
> > without needing to go in-and-out of an intermediate object-oriented
> > representation. However, doing this means documenting to users, for
> > example: "This is the arrow schema to use when sending a point cloud
> with a
> > color channel".
> >
> > I would love it if, eventually, the arrow project had a way of defining a
> > spec file similar to a .proto or a .fbs, with all libraries supporting
> > loading of a schema object by directly parsing the spec. Has anyone taken
> > steps in this direction?
> >
> > The best alternative I have at the moment is to redundantly define the
> > schema for each of the 3 languages implicitly by directly providing the
> > code to construct a datatype instance with the correct schema. But this
> > feels unfortunately messy and hard to maintain.
> >
> > Thanks,
> > Jeremy
> >
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to