Re: [DISCUSS] Approach to generic schema representation

Ian Cook Mon, 08 Jul 2024 12:19:43 -0700

This has come up a few times in the past [1][2]. The main concern has been
about cross-version compatibility guarantees.


[1] https://github.com/apache/arrow/issues/25078
[2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl

On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG)
<david....@blackrock.com.invalid> wrote:

> Gah found a bug with my code.. Here's a corrected python version..
>
> # iterate through possible nested columns
> def _convert_to_arrow_type(field, obj):
>     """
>     :param field:
>     :param obj:
>     :returns: pyarrow datatype
>
>     """
>     if isinstance(obj, list):
>         for child_obj in obj:
>             pa_type = _convert_to_arrow_type(field, child_obj)
>         return pa.list_(pa_type)
>     elif isinstance(obj, dict):
>         items = []
>         for k, child_obj in obj.items():
>             pa_type = _convert_to_arrow_type(k, child_obj)
>             items.append((k, pa_type))
>         return pa.struct(items)
>     else:
>         if isinstance(obj, str):
>             if obj == "timestamp":
>                 # default timestamp to microsecond precision
>                 obj = "timestamp[us]"
>             elif obj == "date":
>                 # default date to date32 which is an alias for date32[day]
>                 obj = "date32"
>             elif obj == "int":
>                 # default int to int32
>                 obj = "int32"
>             obj = pa.type_for_alias(obj)
>         return obj
>
> # iterate through columns to create a schema
> def _convert_to_arrow_schema(fields_dict):
>     """
>
>     :param fields_dict:
>     :returns: pyarrow schema
>
>     """
>     columns = []
>     for field, typ in fields_dict.items():
>         pa_type = _convert_to_arrow_type(field, typ)
>         columns.append(pa.field(field, pa_type))
>     schema = pa.schema(columns)
>     return schema
>
> -----Original Message-----
> From: Lee, David (PAG) <david....@blackrock.com.INVALID>
> Sent: Monday, July 8, 2024 11:58 AM
> To: dev@arrow.apache.org
> Subject: RE: [DISCUSS] Approach to generic schema representation
>
> External Email: Use caution with links and attachments
>
>
> I came up with my own json representation that I could put into json /
> yaml config files with some python code to convert this into a pyarrow
> schema object..
>
> ------------- yaml flat example-------------
> fields:
>   cusip: string
>   start_date: date32
>   end_date: date32
>   purpose: string
>   source: string
>   flow: float32
>   flow_usd: float32
>   currency: string
>
> -------------yaml nested example-------------
> fields:
>   cusip: string
>   start_date: date32
>   regions:
>     [string]         << list of strings
>   primary_benchmark: << struct
>     id: string
>     name: string
>   all_benchmarks:    << list of structs
>   -
>     id: string
>     name: string
>
> Code:
>
> def _convert_to_arrow_type(field, obj):
>     """
>     :param field:
>     :param obj:
>     :returns: pyarrow datatype
>
>     """
>     if isinstance(obj, list):
>         for child_obj in obj:
>             pa_type = _convert_to_arrow_type(field, child_obj)
>         return pa.list_(pa_type)
>     elif isinstance(obj, dict):
>         items = []
>         for k, child_obj in obj.items():
>             pa_type = _convert_to_arrow_type(k, child_obj)
>             items.append((k, pa_type))
>         return pa.struct(items)
>     else:
>         if isinstance(obj, str):
>             obj = pa.type_for_alias(obj)
>         return obj
>
>
> def _convert_to_arrow_schema(fields_dict):
>     """
>
>     :param fields_dict:
>     :returns: pyarrow schema
>
>     """
>     columns = []
>     for field, typ in fields_dict.items():
>         if typ == "timestamp":
>             # default timestamp to microsecond precision
>             typ = "timestamp[us]"
>         elif typ == "date":
>             # default date to date32 which is an alias for date32[day]
>             typ = "date32"
>         elif typ == "int":
>             # default int to int32
>             typ = "int32"
>         pa_type = _convert_to_arrow_type(field, typ)
>         columns.append(pa.field(field, pa_type))
>     schema = pa.schema(columns)
>     return schema
>
> -----Original Message-----
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Monday, July 8, 2024 9:43 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Approach to generic schema representation
>
> External Email: Use caution with links and attachments
>
>
> +1 for empty stream/file as schema serialization.  I have used this
> approach myself on more than one occasion and it works well.  It can even
> be useful for transmitting schemas between different arrow-native libraries
> in the same language (e.g. rust->rust) since it allows the different
> libraries to use different arrow versions.
>
> There is one other approach if you only need intra-process serialization
> (e.g. between threads / libraries in the same process).  You can use the C
> data interface (
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$
> ).
> It is maybe a slightly more complex API (because of the release callback)
> and I think it is unlikely to be significantly faster (unless you have an
> abnormally large schema).  However, it has the same advantages and might be
> useful if you are already using the C data interface elsewhere.
>
>
> On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com> wrote:
>
> > Hey Jeremy,
> >
> > Currently the first message of an IPC stream is a Schema message which
> > consists solely of a flatbuffer message and defined in the Schema.fbs
> > file of the Arrow repo. All of the libraries that can read Arrow IPC
> > should be able to also handle converting a single IPC schema message
> > back into an Arrow schema without issue. Would that be sufficient for
> you?
> >
> > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote:
> >
> > > I'm looking for any advice folks may have on a generic way to
> > > document
> > and
> > > represent expected arrow schemas as part of an interface definition.
> > >
> > > For context, our library provides a cross-language (python, c++,
> > > rust)
> > SDK
> > > for logging semantic multi-modal data (point clouds, images,
> > > geometric transforms, bounding boxes, etc.). Each of these primitive
> > > types has an associated arrow schema, but to date we have largely
> > > abstracted that from our users through language-native object types,
> > > and a bunch of generated code to "serialize" stuff into the arrow
> > > buffers before transmitting via our IPC.
> > >
> > > We're trying to take steps in the direction of making it easier for
> > > advanced users to write and read data from the store directly using
> > arrow,
> > > without needing to go in-and-out of an intermediate object-oriented
> > > representation. However, doing this means documenting to users, for
> > > example: "This is the arrow schema to use when sending a point cloud
> > with a
> > > color channel".
> > >
> > > I would love it if, eventually, the arrow project had a way of
> > > defining a spec file similar to a .proto or a .fbs, with all
> > > libraries supporting loading of a schema object by directly parsing
> > > the spec. Has anyone taken steps in this direction?
> > >
> > > The best alternative I have at the moment is to redundantly define
> > > the schema for each of the 3 languages implicitly by directly
> > > providing the code to construct a datatype instance with the correct
> > > schema. But this feels unfortunately messy and hard to maintain.
> > >
> > > Thanks,
> > > Jeremy
> > >
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2024 BlackRock, Inc. All rights reserved.
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to