Re: [FlightSQL] Higher-level facade API to increase adoption/audience? Or does this belong as a personal project

Micah Kornfield Mon, 14 Mar 2022 21:06:58 -0700

>
> Could you expand on what exactly you mean by this?



Still a bit blurry on the best-practices behind sending
> the Arrow response in Flight and seems like an important point.

My understanding is that VectorSchemaRoot was designed to act as a buffer
in various parts of a data processing pipeline.  Roughly one per processing
node and data flows from one VSR to another.  This is why
Flight/VectorLoader/VectorUnloader all take a single single Root and load
it on each "next()" method call.  I believe the intent is for these types
of pipelines to minimize garbage collection/memory churn (something hard to
do with bags of Java objects but sometimes convenience trumps performance).

Where would be the best place to post this?
> I was thinking about GitHub issues but I am GitHub-centric,

not sure if JIRA or mailing list would be better.
> FWIW, I filed an RFC issue here, along with a prototype implementation and
> sample usage + console output code:
> https://github.com/apache/arrow/issues/12618

Different languages have different preferences.  For Java, generally for
early design something like a Google Doc to allow for initial comments is
useful.  Since this is already a Github issue I'll try to take a closer
look.   A couple of things I noticed just browsing:
1.   It looks like you use String.getBytes() to extract String values into
Arrow.  This is brittle, we should explicitly pass UTF-8 encoding.
2.   We haven't established an official policy, but we have been supporting
LTS versions in Java that aren't EOL, so I don't think we should be using
Java 14 features to implement this.
3.   Having a complete mapping of expected Java Type -> Arrow type might
reduce churn on code reviews (and something people are likely fairly
opinionated about).  Along these lines would you expect this to be
extensible to allow "plugin" conversions?


> we are currently working to hide some of the lower level gRPC details,
> which may not be so cumbersome in other language implementations.

I think a decent amount of effort has been put into abstracting gRPC (and
maybe even protobuf) to support other transports.  I think both Yibo and
David have been prototyping on this  to support things like UCX (some
discussion has happened on the ML).

Cheers,
Micah


On Sun, Mar 13, 2022 at 11:46 PM Andrew Lamb <al...@influxdata.com> wrote:

> It may be only tangentially related but as the Rust implementation works
> on arrow Flight (e.g. [1]) we are also working to make the API easier to
> work with. In the Rust case, however, we are currently working to hide some
> of the lower level gRPC details, which may not be so cumbersome in other
> language implementations.
>
> Andrew
>
> [1] https://github.com/apache/arrow-rs/pull/1386
>
> On Sun, Mar 13, 2022 at 7:15 PM Gavin Ray <ray.gavi...@gmail.com> wrote:
>
>> FWIW, I filed an RFC issue here, along with a prototype implementation and
>> sample usage + console output code:
>>
>> https://github.com/apache/arrow/issues/12618
>>
>> On Sun, Mar 13, 2022 at 10:43 AM Gavin Ray <ray.gavi...@gmail.com> wrote:
>>
>> > Generally, the preferred pattern is one VectorSchemaRoot that
>> >> gets reloaded each time.  So an API like
>> "df.loadVectorSchemaRoot(root)"
>> >> probably makes more sense but we can iterate on this.
>> >>
>> >
>> > Could you expand on what exactly you mean by this?
>> >
>> > Still a bit blurry on the best-practices behind sending
>> > the Arrow response in Flight and seems like an important point.
>> >
>> >
>> > ... creating a new contrib module that maps
>> >> from java objects (just like there are JDBC and Avro ones) seems
>> >> worthwhile.  If you are interested in contributing something like this
>> I
>> >> think a short design doc would be worth-while.
>> >>
>> >
>> > Where would be the best place to post this?
>> >
>> > I was thinking about GitHub issues but I am GitHub-centric,
>> > not sure if JIRA or mailing list would be better.
>> >
>> > Thanks, Micah!
>> >
>> >
>> > On Sun, Mar 13, 2022 at 12:46 AM Micah Kornfield <emkornfi...@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Gavin,
>> >>
>> >> > Just curious whether there is any interest/intention of possibly
>> making
>> >> a
>> >> > higher level API around the basic FlightSQL one?
>> >>
>> >>
>> >> IIUC, I don't think this is an issue with Flight but one with generic
>> >> conversion between data into Arrow.  I don't think anyone is actively
>> >> working on something like this, but creating a new contrib module that
>> >> maps
>> >> from java objects (just like there are JDBC and Avro ones) seems
>> >> worthwhile.  If you are interested in contributing something like this
>> I
>> >> think a short design doc would be worth-while.
>> >>
>> >> VectorSchemaRoot root = df.toVectorSchemaRoot();
>> >> > listener.setVectorSchemaRoot(root);
>> >> > listener.sendVectorSchemaRootContents();
>> >>
>> >>
>> >> A small nit.  Generally, the preferred pattern is one VectorSchemaRoot
>> >> that
>> >> gets reloaded each time.  So an API like
>> "df.loadVectorSchemaRoot(root)"
>> >> probably makes more sense but we can iterate on this.  This wasn't
>> >> commonly
>> >> understood when some of the other contrib modules were developed.
>> >>
>> >> Cheers,
>> >> Micah
>> >>
>> >>
>> >> On Sat, Mar 12, 2022 at 12:15 PM Gavin Ray <ray.gavi...@gmail.com>
>> wrote:
>> >>
>> >> > While trying to implement and introduce the idea of adopting
>> FlightSQL,
>> >> the
>> >> > largest challenge was the API itself
>> >> >
>> >> > I know it's meant to be low-level. But I found that most of the
>> >> development
>> >> > time was in code to convert to/from
>> >> > row-based data (IE Map<String, Object>) and Java types, and columnar
>> >> data +
>> >> > Arrow types.
>> >> >
>> >> > I'm likely in the minority position here -- I know that Arrow and
>> >> FlightSQL
>> >> > users are largely looking at transferring large volumes of data and
>> >> > servicing OLAP-type workloads
>> >> > But the thing that excites me most about FlightSQL, isn't its
>> >> performance
>> >> > (always nice to have), but that it's a language-agnostic standard for
>> >> data
>> >> > access.
>> >> >
>> >> > That has broad implications -- for all kinds of data-access workloads
>> >> and
>> >> > business usecases.
>> >> >
>> >> > The challenge is that in trying to advocate for it, when presenting a
>> >> > proof-of-concept,
>> >> > rather than what a developer might expect to see, something like:
>> >> >
>> >> > // FlightSQL handler code
>> >> > List<Map<String, Object>> results = ....;
>> >> > results.add(Map.of("id", 1, "name", "Person 1");
>> >> > return results;
>> >> >
>> >> > A significant portion of the code is in Arrow-specific implementation
>> >> > details:
>> >> > creating a VectorSchemaRoot, FieldVector, de-serializing the results
>> on
>> >> the
>> >> > client, etc.
>> >> >
>> >> > Just curious whether there is any interest/intention of possibly
>> making
>> >> a
>> >> > higher level API around the basic FlightSQL one?
>> >> > Maybe something closer to the traditional notion of a row-based
>> >> "DataFrame"
>> >> > or "Table", like:
>> >> >
>> >> > DataFrame df = new DataFrame();
>> >> > df.addColumn("id", ArrowTypes.Int);
>> >> > df.addColumn("name", ArrowTypes.VarChar);
>> >> > df.addRow(Map.of("id", 1, "name", "Person 1"));
>> >> > VectorSchemaRoot root = df.toVectorSchemaRoot();
>> >> > listener.setVectorSchemaRoot(root);
>> >> > listener.sendVectorSchemaRootContents();
>> >> >
>> >>
>> >
>>
>

Re: [FlightSQL] Higher-level facade API to increase adoption/audience? Or does this belong as a personal project

Reply via email to