This has come up a few times in the past [1][2]. The main concern has been about cross-version compatibility guarantees.
[1] https://github.com/apache/arrow/issues/25078 [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG) <david....@blackrock.com.invalid> wrote: > Gah found a bug with my code.. Here's a corrected python version.. > > # iterate through possible nested columns > def _convert_to_arrow_type(field, obj): > """ > :param field: > :param obj: > :returns: pyarrow datatype > > """ > if isinstance(obj, list): > for child_obj in obj: > pa_type = _convert_to_arrow_type(field, child_obj) > return pa.list_(pa_type) > elif isinstance(obj, dict): > items = [] > for k, child_obj in obj.items(): > pa_type = _convert_to_arrow_type(k, child_obj) > items.append((k, pa_type)) > return pa.struct(items) > else: > if isinstance(obj, str): > if obj == "timestamp": > # default timestamp to microsecond precision > obj = "timestamp[us]" > elif obj == "date": > # default date to date32 which is an alias for date32[day] > obj = "date32" > elif obj == "int": > # default int to int32 > obj = "int32" > obj = pa.type_for_alias(obj) > return obj > > # iterate through columns to create a schema > def _convert_to_arrow_schema(fields_dict): > """ > > :param fields_dict: > :returns: pyarrow schema > > """ > columns = [] > for field, typ in fields_dict.items(): > pa_type = _convert_to_arrow_type(field, typ) > columns.append(pa.field(field, pa_type)) > schema = pa.schema(columns) > return schema > > -----Original Message----- > From: Lee, David (PAG) <david....@blackrock.com.INVALID> > Sent: Monday, July 8, 2024 11:58 AM > To: dev@arrow.apache.org > Subject: RE: [DISCUSS] Approach to generic schema representation > > External Email: Use caution with links and attachments > > > I came up with my own json representation that I could put into json / > yaml config files with some python code to convert this into a pyarrow > schema object.. > > ------------- yaml flat example------------- > fields: > cusip: string > start_date: date32 > end_date: date32 > purpose: string > source: string > flow: float32 > flow_usd: float32 > currency: string > > -------------yaml nested example------------- > fields: > cusip: string > start_date: date32 > regions: > [string] << list of strings > primary_benchmark: << struct > id: string > name: string > all_benchmarks: << list of structs > - > id: string > name: string > > Code: > > def _convert_to_arrow_type(field, obj): > """ > :param field: > :param obj: > :returns: pyarrow datatype > > """ > if isinstance(obj, list): > for child_obj in obj: > pa_type = _convert_to_arrow_type(field, child_obj) > return pa.list_(pa_type) > elif isinstance(obj, dict): > items = [] > for k, child_obj in obj.items(): > pa_type = _convert_to_arrow_type(k, child_obj) > items.append((k, pa_type)) > return pa.struct(items) > else: > if isinstance(obj, str): > obj = pa.type_for_alias(obj) > return obj > > > def _convert_to_arrow_schema(fields_dict): > """ > > :param fields_dict: > :returns: pyarrow schema > > """ > columns = [] > for field, typ in fields_dict.items(): > if typ == "timestamp": > # default timestamp to microsecond precision > typ = "timestamp[us]" > elif typ == "date": > # default date to date32 which is an alias for date32[day] > typ = "date32" > elif typ == "int": > # default int to int32 > typ = "int32" > pa_type = _convert_to_arrow_type(field, typ) > columns.append(pa.field(field, pa_type)) > schema = pa.schema(columns) > return schema > > -----Original Message----- > From: Weston Pace <weston.p...@gmail.com> > Sent: Monday, July 8, 2024 9:43 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS] Approach to generic schema representation > > External Email: Use caution with links and attachments > > > +1 for empty stream/file as schema serialization. I have used this > approach myself on more than one occasion and it works well. It can even > be useful for transmitting schemas between different arrow-native libraries > in the same language (e.g. rust->rust) since it allows the different > libraries to use different arrow versions. > > There is one other approach if you only need intra-process serialization > (e.g. between threads / libraries in the same process). You can use the C > data interface ( > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$ > ). > It is maybe a slightly more complex API (because of the release callback) > and I think it is unlikely to be significantly faster (unless you have an > abnormally large schema). However, it has the same advantages and might be > useful if you are already using the C data interface elsewhere. > > > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com> wrote: > > > Hey Jeremy, > > > > Currently the first message of an IPC stream is a Schema message which > > consists solely of a flatbuffer message and defined in the Schema.fbs > > file of the Arrow repo. All of the libraries that can read Arrow IPC > > should be able to also handle converting a single IPC schema message > > back into an Arrow schema without issue. Would that be sufficient for > you? > > > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote: > > > > > I'm looking for any advice folks may have on a generic way to > > > document > > and > > > represent expected arrow schemas as part of an interface definition. > > > > > > For context, our library provides a cross-language (python, c++, > > > rust) > > SDK > > > for logging semantic multi-modal data (point clouds, images, > > > geometric transforms, bounding boxes, etc.). Each of these primitive > > > types has an associated arrow schema, but to date we have largely > > > abstracted that from our users through language-native object types, > > > and a bunch of generated code to "serialize" stuff into the arrow > > > buffers before transmitting via our IPC. > > > > > > We're trying to take steps in the direction of making it easier for > > > advanced users to write and read data from the store directly using > > arrow, > > > without needing to go in-and-out of an intermediate object-oriented > > > representation. However, doing this means documenting to users, for > > > example: "This is the arrow schema to use when sending a point cloud > > with a > > > color channel". > > > > > > I would love it if, eventually, the arrow project had a way of > > > defining a spec file similar to a .proto or a .fbs, with all > > > libraries supporting loading of a schema object by directly parsing > > > the spec. Has anyone taken steps in this direction? > > > > > > The best alternative I have at the moment is to redundantly define > > > the schema for each of the 3 languages implicitly by directly > > > providing the code to construct a datatype instance with the correct > > > schema. But this feels unfortunately messy and hard to maintain. > > > > > > Thanks, > > > Jeremy > > > > > > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender immediately > and delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for > further information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2024 BlackRock, Inc. All rights reserved. >