On Mon, Jul 8, 2024 at 3:57 PM Weston Pace <weston.p...@gmail.com> wrote:
> > user-facing API documentation someone would need to practically form > and/or > > process data when integrating a library into their code. > > If we are thinking API contract / programmatic access then I'd offer yet > another alternative. At Lance we have found that many of our > non-data-engineer users don't want to think "columnar" or "arrow" at all > (both of these things are "database internals"). They come from > traditional database backgrounds and are used to either working with native > types or with more traditional ORM style approaches. > > In python we use Pydantic[1] (LanceModel extends pydantic.BaseModel): > Yes, providing pydantic definitions for our python type representations has been in the back of my mind for a bit now. Great to hear that Lance is having success there mapping to arrow. Our current object-APIs generally satisfy that class of users, but agree that pydantic definitely has some benefits in terms of being more standard and familiar. > > ``` > > class Movie(LanceModel): > <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-4> > movie_id: int < > https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-5> > vector: Vector(128) # extension types > <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-6> > genres: Optional[List[str]] # nullability > <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-7> > title: str < > https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-8> > imdb_id: int > <https://lancedb.github.io/lancedb/guides/tables/#__codelineno-12-9> > > ``` > > Then, in your API contracts, you can just use `Move` or `List[Movie]` as > the argument type or return type and you can optionally accept / return > RecordBatch for users that DO want to use Arrow directly. By combining > schema introspection & pydantic you should be able to do things like > "verify a record batch satisfies the schema defined by the pydantic model". > > In Rust we don't yet have an equivalent but I am interested in serde_arrow > [2] which would allow the well known `serde` library to fulfill the same > role as `pydantic` (it's not as exact of a fit and we may still end up > needing to do our own thing). > We haven't used serde_arrow, but we were previously using arrow2_convert [1] to similar effect. The main issue we run into with conversion into and out of a non-columnar representation ends up being the overhead whenever you do the conversion. At some point data throughput ends up bottle-necked on the structural conversions, which then leads us back to wanting to give performance-minded users a way of submitting the arrow data directly, and in turn needing to define how they should be structuring that data when working directly with the arrow libraries. [1] https://github.com/DataEngineeringLabs/arrow2-convert > > [1] https://docs.pydantic.dev/latest/ > [2] https://docs.rs/serde_arrow/latest/serde_arrow/ > > On Mon, Jul 8, 2024 at 12:19 PM Ian Cook <ianmc...@apache.org> wrote: > > > This has come up a few times in the past [1][2]. The main concern has > been > > about cross-version compatibility guarantees. > > > > [1] https://github.com/apache/arrow/issues/25078 > > [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl > > > > On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG) > > <david....@blackrock.com.invalid> wrote: > > > > > Gah found a bug with my code.. Here's a corrected python version.. > > > > > > # iterate through possible nested columns > > > def _convert_to_arrow_type(field, obj): > > > """ > > > :param field: > > > :param obj: > > > :returns: pyarrow datatype > > > > > > """ > > > if isinstance(obj, list): > > > for child_obj in obj: > > > pa_type = _convert_to_arrow_type(field, child_obj) > > > return pa.list_(pa_type) > > > elif isinstance(obj, dict): > > > items = [] > > > for k, child_obj in obj.items(): > > > pa_type = _convert_to_arrow_type(k, child_obj) > > > items.append((k, pa_type)) > > > return pa.struct(items) > > > else: > > > if isinstance(obj, str): > > > if obj == "timestamp": > > > # default timestamp to microsecond precision > > > obj = "timestamp[us]" > > > elif obj == "date": > > > # default date to date32 which is an alias for > > date32[day] > > > obj = "date32" > > > elif obj == "int": > > > # default int to int32 > > > obj = "int32" > > > obj = pa.type_for_alias(obj) > > > return obj > > > > > > # iterate through columns to create a schema > > > def _convert_to_arrow_schema(fields_dict): > > > """ > > > > > > :param fields_dict: > > > :returns: pyarrow schema > > > > > > """ > > > columns = [] > > > for field, typ in fields_dict.items(): > > > pa_type = _convert_to_arrow_type(field, typ) > > > columns.append(pa.field(field, pa_type)) > > > schema = pa.schema(columns) > > > return schema > > > > > > -----Original Message----- > > > From: Lee, David (PAG) <david....@blackrock.com.INVALID> > > > Sent: Monday, July 8, 2024 11:58 AM > > > To: dev@arrow.apache.org > > > Subject: RE: [DISCUSS] Approach to generic schema representation > > > > > > External Email: Use caution with links and attachments > > > > > > > > > I came up with my own json representation that I could put into json / > > > yaml config files with some python code to convert this into a pyarrow > > > schema object.. > > > > > > ------------- yaml flat example------------- > > > fields: > > > cusip: string > > > start_date: date32 > > > end_date: date32 > > > purpose: string > > > source: string > > > flow: float32 > > > flow_usd: float32 > > > currency: string > > > > > > -------------yaml nested example------------- > > > fields: > > > cusip: string > > > start_date: date32 > > > regions: > > > [string] << list of strings > > > primary_benchmark: << struct > > > id: string > > > name: string > > > all_benchmarks: << list of structs > > > - > > > id: string > > > name: string > > > > > > Code: > > > > > > def _convert_to_arrow_type(field, obj): > > > """ > > > :param field: > > > :param obj: > > > :returns: pyarrow datatype > > > > > > """ > > > if isinstance(obj, list): > > > for child_obj in obj: > > > pa_type = _convert_to_arrow_type(field, child_obj) > > > return pa.list_(pa_type) > > > elif isinstance(obj, dict): > > > items = [] > > > for k, child_obj in obj.items(): > > > pa_type = _convert_to_arrow_type(k, child_obj) > > > items.append((k, pa_type)) > > > return pa.struct(items) > > > else: > > > if isinstance(obj, str): > > > obj = pa.type_for_alias(obj) > > > return obj > > > > > > > > > def _convert_to_arrow_schema(fields_dict): > > > """ > > > > > > :param fields_dict: > > > :returns: pyarrow schema > > > > > > """ > > > columns = [] > > > for field, typ in fields_dict.items(): > > > if typ == "timestamp": > > > # default timestamp to microsecond precision > > > typ = "timestamp[us]" > > > elif typ == "date": > > > # default date to date32 which is an alias for date32[day] > > > typ = "date32" > > > elif typ == "int": > > > # default int to int32 > > > typ = "int32" > > > pa_type = _convert_to_arrow_type(field, typ) > > > columns.append(pa.field(field, pa_type)) > > > schema = pa.schema(columns) > > > return schema > > > > > > -----Original Message----- > > > From: Weston Pace <weston.p...@gmail.com> > > > Sent: Monday, July 8, 2024 9:43 AM > > > To: dev@arrow.apache.org > > > Subject: Re: [DISCUSS] Approach to generic schema representation > > > > > > External Email: Use caution with links and attachments > > > > > > > > > +1 for empty stream/file as schema serialization. I have used this > > > approach myself on more than one occasion and it works well. It can > even > > > be useful for transmitting schemas between different arrow-native > > libraries > > > in the same language (e.g. rust->rust) since it allows the different > > > libraries to use different arrow versions. > > > > > > There is one other approach if you only need intra-process > serialization > > > (e.g. between threads / libraries in the same process). You can use > the > > C > > > data interface ( > > > > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$ > > > ). > > > It is maybe a slightly more complex API (because of the release > callback) > > > and I think it is unlikely to be significantly faster (unless you have > an > > > abnormally large schema). However, it has the same advantages and > might > > be > > > useful if you are already using the C data interface elsewhere. > > > > > > > > > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol <zotthewiz...@gmail.com> > > wrote: > > > > > > > Hey Jeremy, > > > > > > > > Currently the first message of an IPC stream is a Schema message > which > > > > consists solely of a flatbuffer message and defined in the Schema.fbs > > > > file of the Arrow repo. All of the libraries that can read Arrow IPC > > > > should be able to also handle converting a single IPC schema message > > > > back into an Arrow schema without issue. Would that be sufficient for > > > you? > > > > > > > > On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs <jer...@rerun.io> > wrote: > > > > > > > > > I'm looking for any advice folks may have on a generic way to > > > > > document > > > > and > > > > > represent expected arrow schemas as part of an interface > definition. > > > > > > > > > > For context, our library provides a cross-language (python, c++, > > > > > rust) > > > > SDK > > > > > for logging semantic multi-modal data (point clouds, images, > > > > > geometric transforms, bounding boxes, etc.). Each of these > primitive > > > > > types has an associated arrow schema, but to date we have largely > > > > > abstracted that from our users through language-native object > types, > > > > > and a bunch of generated code to "serialize" stuff into the arrow > > > > > buffers before transmitting via our IPC. > > > > > > > > > > We're trying to take steps in the direction of making it easier for > > > > > advanced users to write and read data from the store directly using > > > > arrow, > > > > > without needing to go in-and-out of an intermediate object-oriented > > > > > representation. However, doing this means documenting to users, for > > > > > example: "This is the arrow schema to use when sending a point > cloud > > > > with a > > > > > color channel". > > > > > > > > > > I would love it if, eventually, the arrow project had a way of > > > > > defining a spec file similar to a .proto or a .fbs, with all > > > > > libraries supporting loading of a schema object by directly parsing > > > > > the spec. Has anyone taken steps in this direction? > > > > > > > > > > The best alternative I have at the moment is to redundantly define > > > > > the schema for each of the 3 languages implicitly by directly > > > > > providing the code to construct a datatype instance with the > correct > > > > > schema. But this feels unfortunately messy and hard to maintain. > > > > > > > > > > Thanks, > > > > > Jeremy > > > > > > > > > > > > > > > > > > This message may contain information that is confidential or > privileged. > > > If you are not the intended recipient, please advise the sender > > immediately > > > and delete this message. See > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > > further information. Please refer to > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more > > > information about BlackRock’s Privacy Policy. > > > > > > > > > For a list of BlackRock's office addresses worldwide, see > > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > > > © 2024 BlackRock, Inc. All rights reserved. > > > > > >