Re: [DISCUSS] Approach to generic schema representation

Jorge Cardoso Leitão Mon, 08 Jul 2024 11:23:25 -0700

Hi,

So, something like a human and computer readable standard for arrow
schemas, e.g. via yaml or a json schema.


We kind of do this in our integration tests / golden tests,  where we have
a non-official json representation of an arrow schema.

The ask here is to standardize such a format in some way.

Imo that makes sense.

Best,
Jorge

On Mon, Jul 8, 2024, 20:06 Jeremy Leibs <jer...@rerun.io> wrote:

> That handles questions of machine-to-machine coordination, and let's me do
> things like validation, but it doesn't address questions of the kind of
> user-facing API documentation someone would need to practically form and/or
> process data when integrating a library into their code.
>
> I want to be able to document for a user the equivalent of: "The API
> contract of this interface is that you must submit an arrow payload that is
> a StructArray with fields x, y, z, w, each of which must be non-nullable
> floats." But I want to be able to do it in a way that concise/formal. Right
> now I basically have to say something like:
>
> If you're a rust user, make sure your payload adheres to the following
> datatype:
>
>     arrow2::datatypes::DataType::Struct(Arc::new(vec![
>         arrow2::datatypes::Field::new("x", Float32, false),
>         arrow2::datatypes::Field::new("y", Float32, false),
>         arrow2::datatypes::Field::new("z", Float32, false),
>         arrow2::datatypes::Field::new("w", Float32, false),
>     ]))
>
> If you're a python user, make sure your payload adheres to the following
> datatype:
>
>     pa.struct([
>         pa.field("x", pa.float32(), nullable=False, metadata={}),
>         pa.field("y", pa.float32(), nullable=False, metadata={}),
>         pa.field("z", pa.float32(), nullable=False, metadata={}),
>         pa.field("w", pa.float32(), nullable=False, metadata={}),
>     ]),
>
> I'd like to just write that once in a way that any user can easily map into
> their own code and arrow library.
>
> On Mon, Jul 8, 2024 at 12:42 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > +1 for empty stream/file as schema serialization.  I have used this
> > approach myself on more than one occasion and it works well.  It can even
> > be useful for transmitting schemas between different arrow-native
> libraries
> > in the same language (e.g. rust->rust) since it allows the different
> > libraries to use different arrow versions.
> >
> > There is one other approach if you only need intra-process serialization
> > (e.g. between threads / libraries in the same process).  You can use the
> C
> > data interface (https://arrow.apache.org/docs/format/CDataInterface.html
> ).
> > It is maybe a slightly more complex API (because of the release callback)
> > and I think it is unlikely to be significantly faster (unless you have an
> > abnormally large schema).  However, it has the same advantages and might
> be
> > useful if you are already using the C data interface elsewhere.
> >
> >
> > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >
> > > Hey Jeremy,
> > >
> > > Currently the first message of an IPC stream is a Schema message which
> > > consists solely of a flatbuffer message and defined in the Schema.fbs
> > file
> > > of the Arrow repo. All of the libraries that can read Arrow IPC should
> be
> > > able to also handle converting a single IPC schema message back into an
> > > Arrow schema without issue. Would that be sufficient for you?
> > >
> > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote:
> > >
> > > > I'm looking for any advice folks may have on a generic way to
> document
> > > and
> > > > represent expected arrow schemas as part of an interface definition.
> > > >
> > > > For context, our library provides a cross-language (python, c++,
> rust)
> > > SDK
> > > > for logging semantic multi-modal data (point clouds, images,
> geometric
> > > > transforms, bounding boxes, etc.). Each of these primitive types has
> an
> > > > associated arrow schema, but to date we have largely abstracted that
> > from
> > > > our users through language-native object types, and a bunch of
> > generated
> > > > code to "serialize" stuff into the arrow buffers before transmitting
> > via
> > > > our IPC.
> > > >
> > > > We're trying to take steps in the direction of making it easier for
> > > > advanced users to write and read data from the store directly using
> > > arrow,
> > > > without needing to go in-and-out of an intermediate object-oriented
> > > > representation. However, doing this means documenting to users, for
> > > > example: "This is the arrow schema to use when sending a point cloud
> > > with a
> > > > color channel".
> > > >
> > > > I would love it if, eventually, the arrow project had a way of
> > defining a
> > > > spec file similar to a .proto or a .fbs, with all libraries
> supporting
> > > > loading of a schema object by directly parsing the spec. Has anyone
> > taken
> > > > steps in this direction?
> > > >
> > > > The best alternative I have at the moment is to redundantly define
> the
> > > > schema for each of the 3 languages implicitly by directly providing
> the
> > > > code to construct a datatype instance with the correct schema. But
> this
> > > > feels unfortunately messy and hard to maintain.
> > > >
> > > > Thanks,
> > > > Jeremy
> > > >
> > >
> >
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to