Hi, So, something like a human and computer readable standard for arrow schemas, e.g. via yaml or a json schema.
We kind of do this in our integration tests / golden tests, where we have a non-official json representation of an arrow schema. The ask here is to standardize such a format in some way. Imo that makes sense. Best, Jorge On Mon, Jul 8, 2024, 20:06 Jeremy Leibs <jer...@rerun.io> wrote: > That handles questions of machine-to-machine coordination, and let's me do > things like validation, but it doesn't address questions of the kind of > user-facing API documentation someone would need to practically form and/or > process data when integrating a library into their code. > > I want to be able to document for a user the equivalent of: "The API > contract of this interface is that you must submit an arrow payload that is > a StructArray with fields x, y, z, w, each of which must be non-nullable > floats." But I want to be able to do it in a way that concise/formal. Right > now I basically have to say something like: > > If you're a rust user, make sure your payload adheres to the following > datatype: > > arrow2::datatypes::DataType::Struct(Arc::new(vec![ > arrow2::datatypes::Field::new("x", Float32, false), > arrow2::datatypes::Field::new("y", Float32, false), > arrow2::datatypes::Field::new("z", Float32, false), > arrow2::datatypes::Field::new("w", Float32, false), > ])) > > If you're a python user, make sure your payload adheres to the following > datatype: > > pa.struct([ > pa.field("x", pa.float32(), nullable=False, metadata={}), > pa.field("y", pa.float32(), nullable=False, metadata={}), > pa.field("z", pa.float32(), nullable=False, metadata={}), > pa.field("w", pa.float32(), nullable=False, metadata={}), > ]), > > I'd like to just write that once in a way that any user can easily map into > their own code and arrow library. > > On Mon, Jul 8, 2024 at 12:42 PM Weston Pace <weston.p...@gmail.com> wrote: > > > +1 for empty stream/file as schema serialization. I have used this > > approach myself on more than one occasion and it works well. It can even > > be useful for transmitting schemas between different arrow-native > libraries > > in the same language (e.g. rust->rust) since it allows the different > > libraries to use different arrow versions. > > > > There is one other approach if you only need intra-process serialization > > (e.g. between threads / libraries in the same process). You can use the > C > > data interface (https://arrow.apache.org/docs/format/CDataInterface.html > ). > > It is maybe a slightly more complex API (because of the release callback) > > and I think it is unlikely to be significantly faster (unless you have an > > abnormally large schema). However, it has the same advantages and might > be > > useful if you are already using the C data interface elsewhere. > > > > > > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com> > wrote: > > > > > Hey Jeremy, > > > > > > Currently the first message of an IPC stream is a Schema message which > > > consists solely of a flatbuffer message and defined in the Schema.fbs > > file > > > of the Arrow repo. All of the libraries that can read Arrow IPC should > be > > > able to also handle converting a single IPC schema message back into an > > > Arrow schema without issue. Would that be sufficient for you? > > > > > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote: > > > > > > > I'm looking for any advice folks may have on a generic way to > document > > > and > > > > represent expected arrow schemas as part of an interface definition. > > > > > > > > For context, our library provides a cross-language (python, c++, > rust) > > > SDK > > > > for logging semantic multi-modal data (point clouds, images, > geometric > > > > transforms, bounding boxes, etc.). Each of these primitive types has > an > > > > associated arrow schema, but to date we have largely abstracted that > > from > > > > our users through language-native object types, and a bunch of > > generated > > > > code to "serialize" stuff into the arrow buffers before transmitting > > via > > > > our IPC. > > > > > > > > We're trying to take steps in the direction of making it easier for > > > > advanced users to write and read data from the store directly using > > > arrow, > > > > without needing to go in-and-out of an intermediate object-oriented > > > > representation. However, doing this means documenting to users, for > > > > example: "This is the arrow schema to use when sending a point cloud > > > with a > > > > color channel". > > > > > > > > I would love it if, eventually, the arrow project had a way of > > defining a > > > > spec file similar to a .proto or a .fbs, with all libraries > supporting > > > > loading of a schema object by directly parsing the spec. Has anyone > > taken > > > > steps in this direction? > > > > > > > > The best alternative I have at the moment is to redundantly define > the > > > > schema for each of the 3 languages implicitly by directly providing > the > > > > code to construct a datatype instance with the correct schema. But > this > > > > feels unfortunately messy and hard to maintain. > > > > > > > > Thanks, > > > > Jeremy > > > > > > > > > >