Re: [DISCUSS] Approach to generic schema representation

Jeremy Leibs Mon, 08 Jul 2024 13:05:36 -0700

Thanks for the links. That's very helpful context.

It's a shame the flatbuffer <-> json conversion isn't more widely
available, though I do see the complexity now.


It sounds like our best path forward for now will be to generate a pair of
assets for each of our types:
 - A binary fbs-encoded IPC schema definition as suggested initially.
 - An adhoc human-readable textual format just focused on documentation
clarity.

We can then provide some example code for loading the binary schema
definitions generically and probably even a mini utility that generates the
human-readable representation from the binary one.

We just won't worry about going from text back to the binary format.

On Mon, Jul 8, 2024 at 3:19 PM Ian Cook <ianmc...@apache.org> wrote:

> This has come up a few times in the past [1][2]. The main concern has been
> about cross-version compatibility guarantees.
>
> [1] https://github.com/apache/arrow/issues/25078
> [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl
>
> On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG)
> <david....@blackrock.com.invalid> wrote:
>
> > Gah found a bug with my code.. Here's a corrected python version..
> >
> > # iterate through possible nested columns
> > def _convert_to_arrow_type(field, obj):
> >     """
> >     :param field:
> >     :param obj:
> >     :returns: pyarrow datatype
> >
> >     """
> >     if isinstance(obj, list):
> >         for child_obj in obj:
> >             pa_type = _convert_to_arrow_type(field, child_obj)
> >         return pa.list_(pa_type)
> >     elif isinstance(obj, dict):
> >         items = []
> >         for k, child_obj in obj.items():
> >             pa_type = _convert_to_arrow_type(k, child_obj)
> >             items.append((k, pa_type))
> >         return pa.struct(items)
> >     else:
> >         if isinstance(obj, str):
> >             if obj == "timestamp":
> >                 # default timestamp to microsecond precision
> >                 obj = "timestamp[us]"
> >             elif obj == "date":
> >                 # default date to date32 which is an alias for
> date32[day]
> >                 obj = "date32"
> >             elif obj == "int":
> >                 # default int to int32
> >                 obj = "int32"
> >             obj = pa.type_for_alias(obj)
> >         return obj
> >
> > # iterate through columns to create a schema
> > def _convert_to_arrow_schema(fields_dict):
> >     """
> >
> >     :param fields_dict:
> >     :returns: pyarrow schema
> >
> >     """
> >     columns = []
> >     for field, typ in fields_dict.items():
> >         pa_type = _convert_to_arrow_type(field, typ)
> >         columns.append(pa.field(field, pa_type))
> >     schema = pa.schema(columns)
> >     return schema
> >
> > -----Original Message-----
> > From: Lee, David (PAG) <david....@blackrock.com.INVALID>
> > Sent: Monday, July 8, 2024 11:58 AM
> > To: dev@arrow.apache.org
> > Subject: RE: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I came up with my own json representation that I could put into json /
> > yaml config files with some python code to convert this into a pyarrow
> > schema object..
> >
> > ------------- yaml flat example-------------
> > fields:
> >   cusip: string
> >   start_date: date32
> >   end_date: date32
> >   purpose: string
> >   source: string
> >   flow: float32
> >   flow_usd: float32
> >   currency: string
> >
> > -------------yaml nested example-------------
> > fields:
> >   cusip: string
> >   start_date: date32
> >   regions:
> >     [string]         << list of strings
> >   primary_benchmark: << struct
> >     id: string
> >     name: string
> >   all_benchmarks:    << list of structs
> >   -
> >     id: string
> >     name: string
> >
> > Code:
> >
> > def _convert_to_arrow_type(field, obj):
> >     """
> >     :param field:
> >     :param obj:
> >     :returns: pyarrow datatype
> >
> >     """
> >     if isinstance(obj, list):
> >         for child_obj in obj:
> >             pa_type = _convert_to_arrow_type(field, child_obj)
> >         return pa.list_(pa_type)
> >     elif isinstance(obj, dict):
> >         items = []
> >         for k, child_obj in obj.items():
> >             pa_type = _convert_to_arrow_type(k, child_obj)
> >             items.append((k, pa_type))
> >         return pa.struct(items)
> >     else:
> >         if isinstance(obj, str):
> >             obj = pa.type_for_alias(obj)
> >         return obj
> >
> >
> > def _convert_to_arrow_schema(fields_dict):
> >     """
> >
> >     :param fields_dict:
> >     :returns: pyarrow schema
> >
> >     """
> >     columns = []
> >     for field, typ in fields_dict.items():
> >         if typ == "timestamp":
> >             # default timestamp to microsecond precision
> >             typ = "timestamp[us]"
> >         elif typ == "date":
> >             # default date to date32 which is an alias for date32[day]
> >             typ = "date32"
> >         elif typ == "int":
> >             # default int to int32
> >             typ = "int32"
> >         pa_type = _convert_to_arrow_type(field, typ)
> >         columns.append(pa.field(field, pa_type))
> >     schema = pa.schema(columns)
> >     return schema
> >
> > -----Original Message-----
> > From: Weston Pace <weston.p...@gmail.com>
> > Sent: Monday, July 8, 2024 9:43 AM
> > To: dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > +1 for empty stream/file as schema serialization.  I have used this
> > approach myself on more than one occasion and it works well.  It can even
> > be useful for transmitting schemas between different arrow-native
> libraries
> > in the same language (e.g. rust->rust) since it allows the different
> > libraries to use different arrow versions.
> >
> > There is one other approach if you only need intra-process serialization
> > (e.g. between threads / libraries in the same process).  You can use the
> C
> > data interface (
> >
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$
> > ).
> > It is maybe a slightly more complex API (because of the release callback)
> > and I think it is unlikely to be significantly faster (unless you have an
> > abnormally large schema).  However, it has the same advantages and might
> be
> > useful if you are already using the C data interface elsewhere.
> >
> >
> > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >
> > > Hey Jeremy,
> > >
> > > Currently the first message of an IPC stream is a Schema message which
> > > consists solely of a flatbuffer message and defined in the Schema.fbs
> > > file of the Arrow repo. All of the libraries that can read Arrow IPC
> > > should be able to also handle converting a single IPC schema message
> > > back into an Arrow schema without issue. Would that be sufficient for
> > you?
> > >
> > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote:
> > >
> > > > I'm looking for any advice folks may have on a generic way to
> > > > document
> > > and
> > > > represent expected arrow schemas as part of an interface definition.
> > > >
> > > > For context, our library provides a cross-language (python, c++,
> > > > rust)
> > > SDK
> > > > for logging semantic multi-modal data (point clouds, images,
> > > > geometric transforms, bounding boxes, etc.). Each of these primitive
> > > > types has an associated arrow schema, but to date we have largely
> > > > abstracted that from our users through language-native object types,
> > > > and a bunch of generated code to "serialize" stuff into the arrow
> > > > buffers before transmitting via our IPC.
> > > >
> > > > We're trying to take steps in the direction of making it easier for
> > > > advanced users to write and read data from the store directly using
> > > arrow,
> > > > without needing to go in-and-out of an intermediate object-oriented
> > > > representation. However, doing this means documenting to users, for
> > > > example: "This is the arrow schema to use when sending a point cloud
> > > with a
> > > > color channel".
> > > >
> > > > I would love it if, eventually, the arrow project had a way of
> > > > defining a spec file similar to a .proto or a .fbs, with all
> > > > libraries supporting loading of a schema object by directly parsing
> > > > the spec. Has anyone taken steps in this direction?
> > > >
> > > > The best alternative I have at the moment is to redundantly define
> > > > the schema for each of the 3 languages implicitly by directly
> > > > providing the code to construct a datatype instance with the correct
> > > > schema. But this feels unfortunately messy and hard to maintain.
> > > >
> > > > Thanks,
> > > > Jeremy
> > > >
> > >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2024 BlackRock, Inc. All rights reserved.
> >
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to