Re: [DISCUSS] Approach to generic schema representation

Jeremy Leibs Mon, 08 Jul 2024 13:42:18 -0700

On Mon, Jul 8, 2024 at 3:57 PM Weston Pace <weston.p...@gmail.com> wrote:


> > user-facing API documentation someone would need to practically form
> and/or
> > process data when integrating a library into their code.
>
> If we are thinking API contract / programmatic access then I'd offer yet
> another alternative.  At Lance we have found that many of our
> non-data-engineer users don't want to think "columnar" or "arrow" at all
> (both of these things are "database internals").  They come from
> traditional database backgrounds and are used to either working with native
> types or with more traditional ORM style approaches.
>
> In python we use Pydantic[1] (LanceModel extends pydantic.BaseModel):
>

Yes, providing pydantic definitions for our python type representations has
been in the back of my mind for a bit now. Great to hear that Lance is
having success there mapping to arrow. Our current object-APIs generally
satisfy that class of users, but agree that pydantic definitely has some
benefits in terms of being more standard and familiar.


>
> ```
>
> class Movie(LanceModel):
> <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-4>
>  movie_id: int <
> https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-5>
>    vector: Vector(128)         # extension types
> <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-6>
>  genres: Optional[List[str]] # nullability
> <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-7>
>  title: str <
> https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-8>
>    imdb_id: int
> <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-9>
>
> ```
>
> Then, in your API contracts, you can just use `Move` or `List[Movie]` as
> the argument type or return type and you can optionally accept / return
> RecordBatch for users that DO want to use Arrow directly.  By combining
> schema introspection & pydantic you should be able to do things like
> "verify a record batch satisfies the schema defined by the pydantic model".
>
> In Rust we don't yet have an equivalent but I am interested in serde_arrow
> [2] which would allow the well known `serde` library to fulfill the same
> role as `pydantic` (it's not as exact of a fit and we may still end up
> needing to do our own thing).
>

We haven't used serde_arrow, but we were previously using arrow2_convert
[1] to similar effect.

The main issue we run into with conversion into and out of a non-columnar
representation ends up being the overhead whenever you do the conversion.
At some point data throughput ends up bottle-necked on the structural
conversions, which then leads us back to wanting to give performance-minded
users a way of submitting the arrow data directly, and in turn needing to
define how they should be structuring that data when working directly with
the arrow libraries.

[1] https://github.com/DataEngineeringLabs/arrow2-convert


>
> [1] https://docs.pydantic.dev/latest/
> [2] https://docs.rs/serde_arrow/latest/serde_arrow/
>
> On Mon, Jul 8, 2024 at 12:19 PM Ian Cook <ianmc...@apache.org> wrote:
>
> > This has come up a few times in the past [1][2]. The main concern has
> been
> > about cross-version compatibility guarantees.
> >
> > [1] https://github.com/apache/arrow/issues/25078
> > [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl
> >
> > On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG)
> > <david....@blackrock.com.invalid> wrote:
> >
> > > Gah found a bug with my code.. Here's a corrected python version..
> > >
> > > # iterate through possible nested columns
> > > def _convert_to_arrow_type(field, obj):
> > >     """
> > >     :param field:
> > >     :param obj:
> > >     :returns: pyarrow datatype
> > >
> > >     """
> > >     if isinstance(obj, list):
> > >         for child_obj in obj:
> > >             pa_type = _convert_to_arrow_type(field, child_obj)
> > >         return pa.list_(pa_type)
> > >     elif isinstance(obj, dict):
> > >         items = []
> > >         for k, child_obj in obj.items():
> > >             pa_type = _convert_to_arrow_type(k, child_obj)
> > >             items.append((k, pa_type))
> > >         return pa.struct(items)
> > >     else:
> > >         if isinstance(obj, str):
> > >             if obj == "timestamp":
> > >                 # default timestamp to microsecond precision
> > >                 obj = "timestamp[us]"
> > >             elif obj == "date":
> > >                 # default date to date32 which is an alias for
> > date32[day]
> > >                 obj = "date32"
> > >             elif obj == "int":
> > >                 # default int to int32
> > >                 obj = "int32"
> > >             obj = pa.type_for_alias(obj)
> > >         return obj
> > >
> > > # iterate through columns to create a schema
> > > def _convert_to_arrow_schema(fields_dict):
> > >     """
> > >
> > >     :param fields_dict:
> > >     :returns: pyarrow schema
> > >
> > >     """
> > >     columns = []
> > >     for field, typ in fields_dict.items():
> > >         pa_type = _convert_to_arrow_type(field, typ)
> > >         columns.append(pa.field(field, pa_type))
> > >     schema = pa.schema(columns)
> > >     return schema
> > >
> > > -----Original Message-----
> > > From: Lee, David (PAG) <david....@blackrock.com.INVALID>
> > > Sent: Monday, July 8, 2024 11:58 AM
> > > To: dev@arrow.apache.org
> > > Subject: RE: [DISCUSS] Approach to generic schema representation
> > >
> > > External Email: Use caution with links and attachments
> > >
> > >
> > > I came up with my own json representation that I could put into json /
> > > yaml config files with some python code to convert this into a pyarrow
> > > schema object..
> > >
> > > ------------- yaml flat example-------------
> > > fields:
> > >   cusip: string
> > >   start_date: date32
> > >   end_date: date32
> > >   purpose: string
> > >   source: string
> > >   flow: float32
> > >   flow_usd: float32
> > >   currency: string
> > >
> > > -------------yaml nested example-------------
> > > fields:
> > >   cusip: string
> > >   start_date: date32
> > >   regions:
> > >     [string]         << list of strings
> > >   primary_benchmark: << struct
> > >     id: string
> > >     name: string
> > >   all_benchmarks:    << list of structs
> > >   -
> > >     id: string
> > >     name: string
> > >
> > > Code:
> > >
> > > def _convert_to_arrow_type(field, obj):
> > >     """
> > >     :param field:
> > >     :param obj:
> > >     :returns: pyarrow datatype
> > >
> > >     """
> > >     if isinstance(obj, list):
> > >         for child_obj in obj:
> > >             pa_type = _convert_to_arrow_type(field, child_obj)
> > >         return pa.list_(pa_type)
> > >     elif isinstance(obj, dict):
> > >         items = []
> > >         for k, child_obj in obj.items():
> > >             pa_type = _convert_to_arrow_type(k, child_obj)
> > >             items.append((k, pa_type))
> > >         return pa.struct(items)
> > >     else:
> > >         if isinstance(obj, str):
> > >             obj = pa.type_for_alias(obj)
> > >         return obj
> > >
> > >
> > > def _convert_to_arrow_schema(fields_dict):
> > >     """
> > >
> > >     :param fields_dict:
> > >     :returns: pyarrow schema
> > >
> > >     """
> > >     columns = []
> > >     for field, typ in fields_dict.items():
> > >         if typ == "timestamp":
> > >             # default timestamp to microsecond precision
> > >             typ = "timestamp[us]"
> > >         elif typ == "date":
> > >             # default date to date32 which is an alias for date32[day]
> > >             typ = "date32"
> > >         elif typ == "int":
> > >             # default int to int32
> > >             typ = "int32"
> > >         pa_type = _convert_to_arrow_type(field, typ)
> > >         columns.append(pa.field(field, pa_type))
> > >     schema = pa.schema(columns)
> > >     return schema
> > >
> > > -----Original Message-----
> > > From: Weston Pace <weston.p...@gmail.com>
> > > Sent: Monday, July 8, 2024 9:43 AM
> > > To: dev@arrow.apache.org
> > > Subject: Re: [DISCUSS] Approach to generic schema representation
> > >
> > > External Email: Use caution with links and attachments
> > >
> > >
> > > +1 for empty stream/file as schema serialization.  I have used this
> > > approach myself on more than one occasion and it works well.  It can
> even
> > > be useful for transmitting schemas between different arrow-native
> > libraries
> > > in the same language (e.g. rust->rust) since it allows the different
> > > libraries to use different arrow versions.
> > >
> > > There is one other approach if you only need intra-process
> serialization
> > > (e.g. between threads / libraries in the same process).  You can use
> the
> > C
> > > data interface (
> > >
> >
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$
> > > ).
> > > It is maybe a slightly more complex API (because of the release
> callback)
> > > and I think it is unlikely to be significantly faster (unless you have
> an
> > > abnormally large schema).  However, it has the same advantages and
> might
> > be
> > > useful if you are already using the C data interface elsewhere.
> > >
> > >
> > > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com>
> > wrote:
> > >
> > > > Hey Jeremy,
> > > >
> > > > Currently the first message of an IPC stream is a Schema message
> which
> > > > consists solely of a flatbuffer message and defined in the Schema.fbs
> > > > file of the Arrow repo. All of the libraries that can read Arrow IPC
> > > > should be able to also handle converting a single IPC schema message
> > > > back into an Arrow schema without issue. Would that be sufficient for
> > > you?
> > > >
> > > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io>
> wrote:
> > > >
> > > > > I'm looking for any advice folks may have on a generic way to
> > > > > document
> > > > and
> > > > > represent expected arrow schemas as part of an interface
> definition.
> > > > >
> > > > > For context, our library provides a cross-language (python, c++,
> > > > > rust)
> > > > SDK
> > > > > for logging semantic multi-modal data (point clouds, images,
> > > > > geometric transforms, bounding boxes, etc.). Each of these
> primitive
> > > > > types has an associated arrow schema, but to date we have largely
> > > > > abstracted that from our users through language-native object
> types,
> > > > > and a bunch of generated code to "serialize" stuff into the arrow
> > > > > buffers before transmitting via our IPC.
> > > > >
> > > > > We're trying to take steps in the direction of making it easier for
> > > > > advanced users to write and read data from the store directly using
> > > > arrow,
> > > > > without needing to go in-and-out of an intermediate object-oriented
> > > > > representation. However, doing this means documenting to users, for
> > > > > example: "This is the arrow schema to use when sending a point
> cloud
> > > > with a
> > > > > color channel".
> > > > >
> > > > > I would love it if, eventually, the arrow project had a way of
> > > > > defining a spec file similar to a .proto or a .fbs, with all
> > > > > libraries supporting loading of a schema object by directly parsing
> > > > > the spec. Has anyone taken steps in this direction?
> > > > >
> > > > > The best alternative I have at the moment is to redundantly define
> > > > > the schema for each of the 3 languages implicitly by directly
> > > > > providing the code to construct a datatype instance with the
> correct
> > > > > schema. But this feels unfortunately messy and hard to maintain.
> > > > >
> > > > > Thanks,
> > > > > Jeremy
> > > > >
> > > >
> > >
> > >
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> > immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > further information.  Please refer to
> > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > information about BlackRock’s Privacy Policy.
> > >
> > >
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > >
> > > © 2024 BlackRock, Inc. All rights reserved.
> > >
> >
>

Re: [DISCUSS] Approach to generic schema representation

Reply via email to