Thanks for the links. That's very helpful context. It's a shame the flatbuffer <-> json conversion isn't more widely available, though I do see the complexity now.
It sounds like our best path forward for now will be to generate a pair of assets for each of our types: - A binary fbs-encoded IPC schema definition as suggested initially. - An adhoc human-readable textual format just focused on documentation clarity. We can then provide some example code for loading the binary schema definitions generically and probably even a mini utility that generates the human-readable representation from the binary one. We just won't worry about going from text back to the binary format. On Mon, Jul 8, 2024 at 3:19 PM Ian Cook <ianmc...@apache.org> wrote: > This has come up a few times in the past [1][2]. The main concern has been > about cross-version compatibility guarantees. > > [1] https://github.com/apache/arrow/issues/25078 > [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl > > On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG) > <david....@blackrock.com.invalid> wrote: > > > Gah found a bug with my code.. Here's a corrected python version.. > > > > # iterate through possible nested columns > > def _convert_to_arrow_type(field, obj): > > """ > > :param field: > > :param obj: > > :returns: pyarrow datatype > > > > """ > > if isinstance(obj, list): > > for child_obj in obj: > > pa_type = _convert_to_arrow_type(field, child_obj) > > return pa.list_(pa_type) > > elif isinstance(obj, dict): > > items = [] > > for k, child_obj in obj.items(): > > pa_type = _convert_to_arrow_type(k, child_obj) > > items.append((k, pa_type)) > > return pa.struct(items) > > else: > > if isinstance(obj, str): > > if obj == "timestamp": > > # default timestamp to microsecond precision > > obj = "timestamp[us]" > > elif obj == "date": > > # default date to date32 which is an alias for > date32[day] > > obj = "date32" > > elif obj == "int": > > # default int to int32 > > obj = "int32" > > obj = pa.type_for_alias(obj) > > return obj > > > > # iterate through columns to create a schema > > def _convert_to_arrow_schema(fields_dict): > > """ > > > > :param fields_dict: > > :returns: pyarrow schema > > > > """ > > columns = [] > > for field, typ in fields_dict.items(): > > pa_type = _convert_to_arrow_type(field, typ) > > columns.append(pa.field(field, pa_type)) > > schema = pa.schema(columns) > > return schema > > > > -----Original Message----- > > From: Lee, David (PAG) <david....@blackrock.com.INVALID> > > Sent: Monday, July 8, 2024 11:58 AM > > To: dev@arrow.apache.org > > Subject: RE: [DISCUSS] Approach to generic schema representation > > > > External Email: Use caution with links and attachments > > > > > > I came up with my own json representation that I could put into json / > > yaml config files with some python code to convert this into a pyarrow > > schema object.. > > > > ------------- yaml flat example------------- > > fields: > > cusip: string > > start_date: date32 > > end_date: date32 > > purpose: string > > source: string > > flow: float32 > > flow_usd: float32 > > currency: string > > > > -------------yaml nested example------------- > > fields: > > cusip: string > > start_date: date32 > > regions: > > [string] << list of strings > > primary_benchmark: << struct > > id: string > > name: string > > all_benchmarks: << list of structs > > - > > id: string > > name: string > > > > Code: > > > > def _convert_to_arrow_type(field, obj): > > """ > > :param field: > > :param obj: > > :returns: pyarrow datatype > > > > """ > > if isinstance(obj, list): > > for child_obj in obj: > > pa_type = _convert_to_arrow_type(field, child_obj) > > return pa.list_(pa_type) > > elif isinstance(obj, dict): > > items = [] > > for k, child_obj in obj.items(): > > pa_type = _convert_to_arrow_type(k, child_obj) > > items.append((k, pa_type)) > > return pa.struct(items) > > else: > > if isinstance(obj, str): > > obj = pa.type_for_alias(obj) > > return obj > > > > > > def _convert_to_arrow_schema(fields_dict): > > """ > > > > :param fields_dict: > > :returns: pyarrow schema > > > > """ > > columns = [] > > for field, typ in fields_dict.items(): > > if typ == "timestamp": > > # default timestamp to microsecond precision > > typ = "timestamp[us]" > > elif typ == "date": > > # default date to date32 which is an alias for date32[day] > > typ = "date32" > > elif typ == "int": > > # default int to int32 > > typ = "int32" > > pa_type = _convert_to_arrow_type(field, typ) > > columns.append(pa.field(field, pa_type)) > > schema = pa.schema(columns) > > return schema > > > > -----Original Message----- > > From: Weston Pace <weston.p...@gmail.com> > > Sent: Monday, July 8, 2024 9:43 AM > > To: dev@arrow.apache.org > > Subject: Re: [DISCUSS] Approach to generic schema representation > > > > External Email: Use caution with links and attachments > > > > > > +1 for empty stream/file as schema serialization. I have used this > > approach myself on more than one occasion and it works well. It can even > > be useful for transmitting schemas between different arrow-native > libraries > > in the same language (e.g. rust->rust) since it allows the different > > libraries to use different arrow versions. > > > > There is one other approach if you only need intra-process serialization > > (e.g. between threads / libraries in the same process). You can use the > C > > data interface ( > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$ > > ). > > It is maybe a slightly more complex API (because of the release callback) > > and I think it is unlikely to be significantly faster (unless you have an > > abnormally large schema). However, it has the same advantages and might > be > > useful if you are already using the C data interface elsewhere. > > > > > > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com> > wrote: > > > > > Hey Jeremy, > > > > > > Currently the first message of an IPC stream is a Schema message which > > > consists solely of a flatbuffer message and defined in the Schema.fbs > > > file of the Arrow repo. All of the libraries that can read Arrow IPC > > > should be able to also handle converting a single IPC schema message > > > back into an Arrow schema without issue. Would that be sufficient for > > you? > > > > > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote: > > > > > > > I'm looking for any advice folks may have on a generic way to > > > > document > > > and > > > > represent expected arrow schemas as part of an interface definition. > > > > > > > > For context, our library provides a cross-language (python, c++, > > > > rust) > > > SDK > > > > for logging semantic multi-modal data (point clouds, images, > > > > geometric transforms, bounding boxes, etc.). Each of these primitive > > > > types has an associated arrow schema, but to date we have largely > > > > abstracted that from our users through language-native object types, > > > > and a bunch of generated code to "serialize" stuff into the arrow > > > > buffers before transmitting via our IPC. > > > > > > > > We're trying to take steps in the direction of making it easier for > > > > advanced users to write and read data from the store directly using > > > arrow, > > > > without needing to go in-and-out of an intermediate object-oriented > > > > representation. However, doing this means documenting to users, for > > > > example: "This is the arrow schema to use when sending a point cloud > > > with a > > > > color channel". > > > > > > > > I would love it if, eventually, the arrow project had a way of > > > > defining a spec file similar to a .proto or a .fbs, with all > > > > libraries supporting loading of a schema object by directly parsing > > > > the spec. Has anyone taken steps in this direction? > > > > > > > > The best alternative I have at the moment is to redundantly define > > > > the schema for each of the 3 languages implicitly by directly > > > > providing the code to construct a datatype instance with the correct > > > > schema. But this feels unfortunately messy and hard to maintain. > > > > > > > > Thanks, > > > > Jeremy > > > > > > > > > > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender > immediately > > and delete this message. See > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > further information. Please refer to > > http://www.blackrock.com/corporate/compliance/privacy-policy for more > > information about BlackRock’s Privacy Policy. > > > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > © 2024 BlackRock, Inc. All rights reserved. > > >