Re: [DISCUSS] Approach to generic schema representation

Weston Pace Mon, 08 Jul 2024 12:57:11 -0700

> but it doesn't address questions of the kind of
> user-facing API documentation someone would need to practically form
and/or
> process data when integrating a library into their code.


Agreed that IPC / flatbuffers / proto are not useful here.  JSON might help
and YAML would be more pleasantly concise.

> This has come up a few times in the past [1][2]. The main concern has been
> about cross-version compatibility guarantees.

I think the biggest obstacle has actually been that people proposing the
idea haven't taken it to the PR / proposal stage.  I suspect, once people
get something working for themselves, they often don't need
interoperability and the motivation to upstream and maintain drops.

> user-facing API documentation someone would need to practically form
and/or
> process data when integrating a library into their code.

If we are thinking API contract / programmatic access then I'd offer yet
another alternative.  At Lance we have found that many of our
non-data-engineer users don't want to think "columnar" or "arrow" at all
(both of these things are "database internals").  They come from
traditional database backgrounds and are used to either working with native
types or with more traditional ORM style approaches.

In python we use Pydantic[1] (LanceModel extends pydantic.BaseModel):

```

class Movie(LanceModel):
<https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-4>
 movie_id: int 
<https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-5>
   vector: Vector(128)         # extension types
<https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-6>
 genres: Optional[List[str]] # nullability
<https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-7>
 title: str <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-8>
   imdb_id: int
<https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-9>

```

Then, in your API contracts, you can just use `Move` or `List[Movie]` as
the argument type or return type and you can optionally accept / return
RecordBatch for users that DO want to use Arrow directly.  By combining
schema introspection & pydantic you should be able to do things like
"verify a record batch satisfies the schema defined by the pydantic model".

In Rust we don't yet have an equivalent but I am interested in serde_arrow
[2] which would allow the well known `serde` library to fulfill the same
role as `pydantic` (it's not as exact of a fit and we may still end up
needing to do our own thing).

[1] https://docs.pydantic.dev/latest/
[2] https://docs.rs/serde_arrow/latest/serde_arrow/

On Mon, Jul 8, 2024 at 12:19 PM Ian Cook <ianmc...@apache.org> wrote:

> This has come up a few times in the past [1][2]. The main concern has been
> about cross-version compatibility guarantees.
>
> [1] https://github.com/apache/arrow/issues/25078
> [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl
>
> On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG)
> <david....@blackrock.com.invalid> wrote:
>
> > Gah found a bug with my code.. Here's a corrected python version..
> >
> > # iterate through possible nested columns
> > def _convert_to_arrow_type(field, obj):
> >     """
> >     :param field:
> >     :param obj:
> >     :returns: pyarrow datatype
> >
> >     """
> >     if isinstance(obj, list):
> >         for child_obj in obj:
> >             pa_type = _convert_to_arrow_type(field, child_obj)
> >         return pa.list_(pa_type)
> >     elif isinstance(obj, dict):
> >         items = []
> >         for k, child_obj in obj.items():
> >             pa_type = _convert_to_arrow_type(k, child_obj)
> >             items.append((k, pa_type))
> >         return pa.struct(items)
> >     else:
> >         if isinstance(obj, str):
> >             if obj == "timestamp":
> >                 # default timestamp to microsecond precision
> >                 obj = "timestamp[us]"
> >             elif obj == "date":
> >                 # default date to date32 which is an alias for
> date32[day]
> >                 obj = "date32"
> >             elif obj == "int":
> >                 # default int to int32
> >                 obj = "int32"
> >             obj = pa.type_for_alias(obj)
> >         return obj
> >
> > # iterate through columns to create a schema
> > def _convert_to_arrow_schema(fields_dict):
> >     """
> >
> >     :param fields_dict:
> >     :returns: pyarrow schema
> >
> >     """
> >     columns = []
> >     for field, typ in fields_dict.items():
> >         pa_type = _convert_to_arrow_type(field, typ)
> >         columns.append(pa.field(field, pa_type))
> >     schema = pa.schema(columns)
> >     return schema
> >
> > -----Original Message-----
> > From: Lee, David (PAG) <david....@blackrock.com.INVALID>
> > Sent: Monday, July 8, 2024 11:58 AM
> > To: dev@arrow.apache.org
> > Subject: RE: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I came up with my own json representation that I could put into json /
> > yaml config files with some python code to convert this into a pyarrow
> > schema object..
> >
> > ------------- yaml flat example-------------
> > fields:
> >   cusip: string
> >   start_date: date32
> >   end_date: date32
> >   purpose: string
> >   source: string
> >   flow: float32
> >   flow_usd: float32
> >   currency: string
> >
> > -------------yaml nested example-------------
> > fields:
> >   cusip: string
> >   start_date: date32
> >   regions:
> >     [string]         << list of strings
> >   primary_benchmark: << struct
> >     id: string
> >     name: string
> >   all_benchmarks:    << list of structs
> >   -
> >     id: string
> >     name: string
> >
> > Code:
> >
> > def _convert_to_arrow_type(field, obj):
> >     """
> >     :param field:
> >     :param obj:
> >     :returns: pyarrow datatype
> >
> >     """
> >     if isinstance(obj, list):
> >         for child_obj in obj:
> >             pa_type = _convert_to_arrow_type(field, child_obj)
> >         return pa.list_(pa_type)
> >     elif isinstance(obj, dict):
> >         items = []
> >         for k, child_obj in obj.items():
> >             pa_type = _convert_to_arrow_type(k, child_obj)
> >             items.append((k, pa_type))
> >         return pa.struct(items)
> >     else:
> >         if isinstance(obj, str):
> >             obj = pa.type_for_alias(obj)
> >         return obj
> >
> >
> > def _convert_to_arrow_schema(fields_dict):
> >     """
> >
> >     :param fields_dict:
> >     :returns: pyarrow schema
> >
> >     """
> >     columns = []
> >     for field, typ in fields_dict.items():
> >         if typ == "timestamp":
> >             # default timestamp to microsecond precision
> >             typ = "timestamp[us]"
> >         elif typ == "date":
> >             # default date to date32 which is an alias for date32[day]
> >             typ = "date32"
> >         elif typ == "int":
> >             # default int to int32
> >             typ = "int32"
> >         pa_type = _convert_to_arrow_type(field, typ)
> >         columns.append(pa.field(field, pa_type))
> >     schema = pa.schema(columns)
> >     return schema
> >
> > -----Original Message-----
> > From: Weston Pace <weston.p...@gmail.com>
> > Sent: Monday, July 8, 2024 9:43 AM
> > To: dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > +1 for empty stream/file as schema serialization.  I have used this
> > approach myself on more than one occasion and it works well.  It can even
> > be useful for transmitting schemas between different arrow-native
> libraries
> > in the same language (e.g. rust->rust) since it allows the different
> > libraries to use different arrow versions.
> >
> > There is one other approach if you only need intra-process serialization
> > (e.g. between threads / libraries in the same process).  You can use the
> C
> > data interface (
> >
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$
> > ).
> > It is maybe a slightly more complex API (because of the release callback)
> > and I think it is unlikely to be significantly faster (unless you have an
> > abnormally large schema).  However, it has the same advantages and might
> be
> > useful if you are already using the C data interface elsewhere.
> >
> >
> > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >
> > > Hey Jeremy,
> > >
> > > Currently the first message of an IPC stream is a Schema message which
> > > consists solely of a flatbuffer message and defined in the Schema.fbs
> > > file of the Arrow repo. All of the libraries that can read Arrow IPC
> > > should be able to also handle converting a single IPC schema message
> > > back into an Arrow schema without issue. Would that be sufficient for
> > you?
> > >
> > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> wrote:
> > >
> > > > I'm looking for any advice folks may have on a generic way to
> > > > document
> > > and
> > > > represent expected arrow schemas as part of an interface definition.
> > > >
> > > > For context, our library provides a cross-language (python, c++,
> > > > rust)
> > > SDK
> > > > for logging semantic multi-modal data (point clouds, images,
> > > > geometric transforms, bounding boxes, etc.). Each of these primitive
> > > > types has an associated arrow schema, but to date we have largely
> > > > abstracted that from our users through language-native object types,
> > > > and a bunch of generated code to "serialize" stuff into the arrow
> > > > buffers before transmitting via our IPC.
> > > >
> > > > We're trying to take steps in the direction of making it easier for
> > > > advanced users to write and read data from the store directly using
> > > arrow,
> > > > without needing to go in-and-out of an intermediate object-oriented
> > > > representation. However, doing this means documenting to users, for
> > > > example: "This is the arrow schema to use when sending a point cloud
> > > with a
> > > > color channel".
> > > >
> > > > I would love it if, eventually, the arrow project had a way of
> > > > defining a spec file similar to a .proto or a .fbs, with all
> > > > libraries supporting loading of a schema object by directly parsing
> > > > the spec. Has anyone taken steps in this direction?
> > > >
> > > > The best alternative I have at the moment is to redundantly define
> > > > the schema for each of the 3 languages implicitly by directly
> > > > providing the code to construct a datatype instance with the correct
> > > > schema. But this feels unfortunately messy and hard to maintain.
> > > >
> > > > Thanks,
> > > > Jeremy
> > > >
> > >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2024 BlackRock, Inc. All rights reserved.
> >
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to