Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread Gavin Ray
Yay, congrats Andy! Well-deserved!

On Mon, Nov 27, 2023 at 9:13 AM Kevin Gurney 
wrote:

> Congratulations, Andy!
> 
> From: Raúl Cumplido 
> Sent: Monday, November 27, 2023 8:58 AM
> To: dev@arrow.apache.org 
> Subject: Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove
>
> Congratulations Andy and thanks for the effort during last year Andrew!
>
> El lun, 27 nov 2023 a las 14:54, David Li ()
> escribió:
> >
> > Congrats Andy!
> >
> > On Mon, Nov 27, 2023, at 08:02, Mehmet Ozan Kabak wrote:
> > > Congratulations Andy. I am sure we will keep building great tech this
> > > year, just like last year, under your watch.
> > >
> > > Mehmet Ozan Kabak
> > >
> > >
> > >> On Nov 27, 2023, at 3:47 PM, Daniël Heres 
> wrote:
> > >>
> > >> Congrats Andy!
> > >>
> > >> Op ma 27 nov 2023 om 13:47 schreef Andrew Lamb  >:
> > >>
> > >>> I am pleased to announce that the Arrow Project has a new PMC chair
> and VP
> > >>> as per our tradition of rotating the chair once a year. I have
> resigned and
> > >>> Andy Grove was duly elected by the PMC and approved unanimously by
> the
> > >>> board.
> > >>>
> > >>> Please join me in congratulating Andy Grove!
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> > >>
> > >>
> > >> --
> > >> Daniël Heres
>
>


Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-18 Thread Gavin Ray
I know that myself and a number of folks I work with would be interested in
this.

gRPC is a bit of a barrier for a lot of services.
Having a spec for doing Arrow over HTTP API's would be solid.

In my opinion, it doesn't necessarily need to be REST-ful.
Something like JSON-RPC might fit well with the existing model for Arrow
over the wire that's been implemented in things like Flight/FlightSQL.

Something else I've been interested in (I think Matt Topol has done work in
this area) is Arrow over GraphQL, too:
GraphQL and Apache Arrow: A Match Made in Data (youtube.com)


On Sat, Nov 18, 2023 at 1:52 PM Ian Cook  wrote:

> Hi Kou,
>
> I think it is too early to make a specific proposal. I hope to use this
> discussion to collect more information about existing approaches. If
> several viable approaches emerge from this discussion, then I think we
> should make a document listing them, like you suggest.
>
> Thank you for the information about Groonga. This type of straightforward
> HTTP-based approach would work in the context of a REST API, as I
> understand it.
>
> But how is the performance? Have you measured the throughput of this
> approach to see if it is comparable to using Flight SQL? Is this approach
> able to saturate a fast network connection?
>
> And what about the case in which the server wants to begin sending batches
> to the client before the total number of result batches / records is known?
> Would this approach work in that case? I think so but I am not sure.
>
> If this HTTP-based type of approach is sufficiently performant and it works
> in a sufficient proportion of the envisioned use cases, then perhaps the
> proposed spec / protocol could be based on this approach. If so, then we
> could refocus this discussion on which best practices to incorporate /
> recommend, such as:
> - server should not return the result data in the body of a response to a
> query request; instead server should return a response body that gives
> URI(s) at which clients can GET the result data
> - transmit result data in chunks (Transfer-Encoding: chunked), with
> recommendations about chunk size
> - support range requests (Accept-Range: bytes) to allow clients to request
> result ranges (or not?)
> - recommendations about compression
> - recommendations about TCP receive window size
> - recommendation to open multiple TCP connections on very fast networks
> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
>
> On the other hand, if the performance and functionality of this HTTP-based
> type of approach is not sufficient, then we might consider fundamentally
> different approaches.
>
> Ian
>


Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-30 Thread Gavin Ray
This isn't particularly efficient, but could you do something like this?

https://replit.com/@GavinRay97/EnlightenedRichAdministration#main.py

On Fri, Jun 30, 2023 at 1:10 PM Aldrin  wrote:

> > But I found out very quickly that I won't be able to... using only
> Apache Arrow without resorting to other libraries.
>
> > I am aiming to assess the viability of Apache Arrow for graph algorithms
> and data structures...
>
> > I also gave a shot at doing it similar to a certain SQL method...
>
> I'm curious about these portions of what you've said.
>
> Could you share what you have tried and what roadblocks you're hitting?
> Are you struggling with mutability? How are you representing your data? You
> mention heapq, but it's not clear if you're using an adjacency matrix or
> adjacency lists or if you're using a more normalized relational format.
>
> Thanks!
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene


Re: DISCUSS: [FlightSQL] Catalog support

2022-11-30 Thread Gavin Ray
Just to chime in on this, one thing I'm curious about is whether there
will be support for user-defined catalog/schema hierarchy depth?

This comment that James made does seem reasonable to me
> scheme://:/path-1/path-2/.../path-n

Trino/Presto does a similar thing (jdbc:trino://localhost:8080/tpch/sf1)

At Hasura, what we do is have an alias "FullyQualifiedName" which is
just "Array"
and the identifier to some element in a data source is always fully-qualified:

https://github.com/hasura/graphql-engine/tree/master/dc-agents#schema

["postgres_1", "db1", "schema2", "my_table", "col_a"]
["mongo", "db1",  "collection_a", "field_a"]
["csv_adapter", "myfile.csv", "col_x"]

On Wed, Nov 30, 2022 at 6:31 PM James Duong
 wrote:
>
> Our current convention of sending connection properties as headers with
> every request has the benefit of making statefulness optional, but has the
> drawback of sending redundant, unused properties on requests after the
> first, which increases the payload size unnecessarily.
>
> I'd suggest we define session management features explicitly in Flight
> (while being optional). The suggestion is to make this part of Flight as an
> optional feature, rather than Flight SQL due to its applicability outside
> of just database access.
>
> Creating a session:
> - The Flight client supplies a New-Session header which has key-value pairs
> for initial session options. This header can be applied to any RPC call,
> but logically should be the first one the client makes.
> - The server should send a Set-Cookie header back containing some
> server-side representation of the session that the client can use in
> subsequent requests.
> - The path specified in the URI is sent as a "Catalog" session option.
>
> Modifying session options:
> - A separate RPC call that takes in a Stream representing
> each session option that is being modified and returns a stream of statuses
> to indicate if the setting change was accepted.
> - This RPC call is only valid when the Cookie header is used.
> - It is up to the server to define if a failed session property change is
> fatal or if other properties can continue to be set.
>
> Closing a session:
> - A separate RPC call that tells the server to drop the session specified
> by the Cookie header.
>
> Notes:
> A Flight SQL client would check if session management RPCs are supported
> through a new GetSqlInfo property. A Flight client doesn't have a way to do
> this generically, but there could be an application-specific RPC or header
> that reports this metadata.
>
> The O/JDBC and ADBC drivers would need to be updated to programmatically
> check for session management RPCs. If unsupported, then use the old
> behavior of sending all properties as headers with each request. If
> supported, make use of the New-Session header and drop the session when
> closing the client-side connection.
>
> It's a bit asymmetric that creating a new session is done by applying a
> header, but closing a session is an RPC call. This was so that session
> creation doesn't introduce another round trip before the first real data
> request. If there's a way to batch RPC calls it might be better to make
> session creation an RPC call.
>
> On Tue, Nov 22, 2022 at 3:16 PM David Li  wrote:
>
> > It sounds reasonable - then there are three points:
> >
> > - A standard URI scheme for Flight SQL that can be used by multiple client
> > APIs (JDBC, ADBC, etc.)
> > - A standard scheme for session data (likely header/cookie-based)
> > - A mapping from URI parameters and fields to session data
> >
> >
> >
> > On Tue, Nov 22, 2022, at 17:45, James Duong wrote:
> > > Just following up on this and if there are any thoughts.
> > >
> > > The purpose would be to standardize how we specify access to some named
> > > logical grouping of data. This would make it easy to model catalog/schema
> > > semantics in Flight SQL.
> > >
> > > Having this be part of the connection URI makes it similar to specifying
> > a
> > > resource in an HTTP URL (ie an endpoint) which should make it easy for
> > end
> > > users to work with and modify.
> > >
> > > On Fri, Nov 18, 2022 at 3:17 PM James Duong 
> > wrote:
> > >
> > >> As for surfacing catalogs itself, perhaps we allow the URI take in a
> > path
> > >> and treat that as a way of specifying a multi-level resource that which
> > the
> > >> FlightClient is connecting to:
> > >>
> > >> eg a connection URI of the form:
> > >> scheme://:/path-1/path-2/.../path-n
> > >>
> > >> The FlightClient could send this path as either a header or a session
> > >> property (with a neutral name like 'resource-path'). Flight SQL
> > Producers
> > >> could interpret this as a catalog or schema.
> > >> eg
> > >> grpc://:/catalog/schema
> > >>
> > >> On Fri, Nov 11, 2022 at 2:07 AM James Henderson  wrote:
> > >>
> > >>> Sounds good to me.
> > >>>
> > >>> > Are you interested in writing up a (sketch of a) proposal?
> > >>>
> > >>> Yep, can do - I'm OoO over the next couple of weeks so might be a 

Re: [Rust][Blog] Fast and Memory Efficient Multi-Column Sorts

2022-11-07 Thread Gavin Ray
This is awesome, thanks for sharing!

I was at the All Things Open conference recently, and Influx had a booth
there.
I went over to try to ask about the IOx/Datafusion stuff but unfortunately
nobody at the booth knew anything about the technical details.

Maybe next time =)


On Mon, Nov 7, 2022 at 8:46 AM Andrew Lamb  wrote:

> The blog has been published:
>
>
> https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
>
> https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/
>
> Thank you to all who contributed content and suggestions
>
> Andrew
>
> On Tue, Nov 1, 2022 at 4:56 PM Andrew Lamb  wrote:
>
> > In case anyone is interested, Raphael  and I are working on (yet another)
> > blog post about technology added to arrow-rs (and used in datafusion)[1]
> .
> >
> > We would love any feedback
> >
> > Thank you,
> > Andrew
> >
> > [1]: https://github.com/apache/arrow-site/pull/264
> >
>


Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-22 Thread Gavin Ray
Ah yeah that's true, good point



On Thu, Sep 22, 2022 at 2:38 PM David Li  wrote:

> I suppose the separator would have to be known to the client somehow
> (perhaps as metadata) - you'd have the same problem in the opposite
> direction if the result were a list right? You wouldn't be able to
> concatenate the parts together without knowing a safe separator to use.
>
> On Thu, Sep 22, 2022, at 14:23, Gavin Ray wrote:
> > Wait, what happens if a datasource's spec allows dots as valid
> identifiers?
> >
> > On Thu, Sep 22, 2022 at 2:22 PM Gavin Ray  wrote:
> >
> >> Ah okay, yeah that's a reasonable angle too haha
> >>
> >>
> >> On Thu, Sep 22, 2022 at 1:59 PM David Li  wrote:
> >>
> >>> Frankly it was from a "not drastically refactoring things" perspective
> :)
> >>>
> >>> At least for Arrow: list[utf8] is effectively a utf8 array with an
> extra
> >>> array of offsets, so there's relatively little overhead. (In
> particular,
> >>> there's not an extra allocation per array; there's just an overall
> >>> allocation of a bitmap/offsets buffer.)
> >>>
> >>> On Thu, Sep 22, 2022, at 13:46, Gavin Ray wrote:
> >>> > I suppose you're thinking from a memory/performance perspective
> right?
> >>> > Allocating a dot character is a lot better than allocating multiple
> >>> arrays
> >>> >
> >>> > Yeah I don't see why not -- this could even be a library internal
> where
> >>> the
> >>> > fact that it's dotted is an implementation detail
> >>> > Then in the Java implementation or whatnot, you can call
> >>> > ".getFullyQualifiedTableName()" which will do the allocating parse
> to a
> >>> > List for you, or whatnot
> >>> >
> >>> > The array was mostly for convenience's sake (our API is JSON and not
> >>> > particularly performance-oriented)
> >>> >
> >>> > On Thu, Sep 22, 2022 at 1:40 PM David Li 
> wrote:
> >>> >
> >>> >> Ah, interesting…
> >>> >>
> >>> >> A self-recursive schema wouldn't work in Arrow's schema system, so
> it'd
> >>> >> have to be the latter solution. Or, would it work to have a dotted
> >>> name in
> >>> >> the schema name column? Would parsing that back out (for
> applications
> >>> that
> >>> >> want to work with the full hierarchy) be too much trouble?
> >>> >>
> >>> >> On Thu, Sep 22, 2022, at 13:14, Gavin Ray wrote:
> >>> >> > Antoine, I can't comment on the Go code (not qualified) but to me,
> >>> the
> >>> >> > "verification" test
> >>> >> > examples look like a mixture between JDBC and Java FlightSQL
> driver
> >>> >> usage,
> >>> >> > and seem solid.
> >>> >> >
> >>> >> > There was one reservation I had about the ability to handle
> >>> datasource
> >>> >> > namespacing that I brought up early on in the proposal discussions
> >>> >> > (David responded to it but I got busy and forgot to reply again)
> >>> >> >
> >>> >> > If you have a datasource which provides possibly arbitrary levels
> of
> >>> >> schema
> >>> >> > namespace (something like Apache Calcite, for example)
> >>> >> > How do you represent the table/schema names?
> >>> >> >
> >>> >> > Suppose I have a service with a DB layout like this:
> >>> >> >
> >>> >> > / foo
> >>> >> > / bar
> >>> >> > / baz
> >>> >> > /qux
> >>> >> >   / table1
> >>> >> > - column1
> >>> >> >
> >>> >> > At my dayjob, we have a technology which is very similar to
> >>> >> > ADBC/FlightSQL
> >>> >> > (would be great to adopt Substrait + ADBC once they're mature
> enough)
> >>> >> > -
> >>> >> >
> >>> >>
> >>>
> https://github.com/hasura/graphql-engine/blob/master/dc-agents/README.md#data-connectors
> >>> >> > -
> >>> >> >
> >>> >>
> >>>
> https:/

Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-22 Thread Gavin Ray
Wait, what happens if a datasource's spec allows dots as valid identifiers?

On Thu, Sep 22, 2022 at 2:22 PM Gavin Ray  wrote:

> Ah okay, yeah that's a reasonable angle too haha
>
>
> On Thu, Sep 22, 2022 at 1:59 PM David Li  wrote:
>
>> Frankly it was from a "not drastically refactoring things" perspective :)
>>
>> At least for Arrow: list[utf8] is effectively a utf8 array with an extra
>> array of offsets, so there's relatively little overhead. (In particular,
>> there's not an extra allocation per array; there's just an overall
>> allocation of a bitmap/offsets buffer.)
>>
>> On Thu, Sep 22, 2022, at 13:46, Gavin Ray wrote:
>> > I suppose you're thinking from a memory/performance perspective right?
>> > Allocating a dot character is a lot better than allocating multiple
>> arrays
>> >
>> > Yeah I don't see why not -- this could even be a library internal where
>> the
>> > fact that it's dotted is an implementation detail
>> > Then in the Java implementation or whatnot, you can call
>> > ".getFullyQualifiedTableName()" which will do the allocating parse to a
>> > List for you, or whatnot
>> >
>> > The array was mostly for convenience's sake (our API is JSON and not
>> > particularly performance-oriented)
>> >
>> > On Thu, Sep 22, 2022 at 1:40 PM David Li  wrote:
>> >
>> >> Ah, interesting…
>> >>
>> >> A self-recursive schema wouldn't work in Arrow's schema system, so it'd
>> >> have to be the latter solution. Or, would it work to have a dotted
>> name in
>> >> the schema name column? Would parsing that back out (for applications
>> that
>> >> want to work with the full hierarchy) be too much trouble?
>> >>
>> >> On Thu, Sep 22, 2022, at 13:14, Gavin Ray wrote:
>> >> > Antoine, I can't comment on the Go code (not qualified) but to me,
>> the
>> >> > "verification" test
>> >> > examples look like a mixture between JDBC and Java FlightSQL driver
>> >> usage,
>> >> > and seem solid.
>> >> >
>> >> > There was one reservation I had about the ability to handle
>> datasource
>> >> > namespacing that I brought up early on in the proposal discussions
>> >> > (David responded to it but I got busy and forgot to reply again)
>> >> >
>> >> > If you have a datasource which provides possibly arbitrary levels of
>> >> schema
>> >> > namespace (something like Apache Calcite, for example)
>> >> > How do you represent the table/schema names?
>> >> >
>> >> > Suppose I have a service with a DB layout like this:
>> >> >
>> >> > / foo
>> >> > / bar
>> >> > / baz
>> >> > /qux
>> >> >   / table1
>> >> > - column1
>> >> >
>> >> > At my dayjob, we have a technology which is very similar to
>> >> > ADBC/FlightSQL
>> >> > (would be great to adopt Substrait + ADBC once they're mature enough)
>> >> > -
>> >> >
>> >>
>> https://github.com/hasura/graphql-engine/blob/master/dc-agents/README.md#data-connectors
>> >> > -
>> >> >
>> >>
>> https://techcrunch.com/2022/06/28/hasura-now-lets-developers-turn-any-data-source-into-a-graphql-api/
>> >> >
>> >> > We wound up having to redesign the specification to handle
>> datasources
>> >> that
>> >> > don't fit the "database-schema-table" or "database-table" mould
>> >> >
>> >> > In the ADBC schema for schema metadata, it looks like it expects a
>> >> > single
>> >> > "schema" struct:
>> >> >
>> >>
>> https://github.com/apache/arrow-adbc/blob/7866a566f5b7b635267bfb7a87ea49b01dfe89fa/java/core/src/main/java/org/apache/arrow/adbc/core/StandardSchemas.java#L132-L152
>> >> >
>> >> > If you want to be flexible, IMO it would be good to either:
>> >> >
>> >> > 1. Have DB_SCHEMA_SCHEMA be self-recursive, so that schemas (with or
>> >> > without tables) can be nested arbitrarily deep underneath each other
>> >> >   - Fully-Qualified-Table-Name (FQTN) can then be computed by
>> walking
>> >> > up from a table and concating 

Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-22 Thread Gavin Ray
Ah okay, yeah that's a reasonable angle too haha


On Thu, Sep 22, 2022 at 1:59 PM David Li  wrote:

> Frankly it was from a "not drastically refactoring things" perspective :)
>
> At least for Arrow: list[utf8] is effectively a utf8 array with an extra
> array of offsets, so there's relatively little overhead. (In particular,
> there's not an extra allocation per array; there's just an overall
> allocation of a bitmap/offsets buffer.)
>
> On Thu, Sep 22, 2022, at 13:46, Gavin Ray wrote:
> > I suppose you're thinking from a memory/performance perspective right?
> > Allocating a dot character is a lot better than allocating multiple
> arrays
> >
> > Yeah I don't see why not -- this could even be a library internal where
> the
> > fact that it's dotted is an implementation detail
> > Then in the Java implementation or whatnot, you can call
> > ".getFullyQualifiedTableName()" which will do the allocating parse to a
> > List for you, or whatnot
> >
> > The array was mostly for convenience's sake (our API is JSON and not
> > particularly performance-oriented)
> >
> > On Thu, Sep 22, 2022 at 1:40 PM David Li  wrote:
> >
> >> Ah, interesting…
> >>
> >> A self-recursive schema wouldn't work in Arrow's schema system, so it'd
> >> have to be the latter solution. Or, would it work to have a dotted name
> in
> >> the schema name column? Would parsing that back out (for applications
> that
> >> want to work with the full hierarchy) be too much trouble?
> >>
> >> On Thu, Sep 22, 2022, at 13:14, Gavin Ray wrote:
> >> > Antoine, I can't comment on the Go code (not qualified) but to me, the
> >> > "verification" test
> >> > examples look like a mixture between JDBC and Java FlightSQL driver
> >> usage,
> >> > and seem solid.
> >> >
> >> > There was one reservation I had about the ability to handle datasource
> >> > namespacing that I brought up early on in the proposal discussions
> >> > (David responded to it but I got busy and forgot to reply again)
> >> >
> >> > If you have a datasource which provides possibly arbitrary levels of
> >> schema
> >> > namespace (something like Apache Calcite, for example)
> >> > How do you represent the table/schema names?
> >> >
> >> > Suppose I have a service with a DB layout like this:
> >> >
> >> > / foo
> >> > / bar
> >> > / baz
> >> > /qux
> >> >   / table1
> >> > - column1
> >> >
> >> > At my dayjob, we have a technology which is very similar to
> >> > ADBC/FlightSQL
> >> > (would be great to adopt Substrait + ADBC once they're mature enough)
> >> > -
> >> >
> >>
> https://github.com/hasura/graphql-engine/blob/master/dc-agents/README.md#data-connectors
> >> > -
> >> >
> >>
> https://techcrunch.com/2022/06/28/hasura-now-lets-developers-turn-any-data-source-into-a-graphql-api/
> >> >
> >> > We wound up having to redesign the specification to handle datasources
> >> that
> >> > don't fit the "database-schema-table" or "database-table" mould
> >> >
> >> > In the ADBC schema for schema metadata, it looks like it expects a
> >> > single
> >> > "schema" struct:
> >> >
> >>
> https://github.com/apache/arrow-adbc/blob/7866a566f5b7b635267bfb7a87ea49b01dfe89fa/java/core/src/main/java/org/apache/arrow/adbc/core/StandardSchemas.java#L132-L152
> >> >
> >> > If you want to be flexible, IMO it would be good to either:
> >> >
> >> > 1. Have DB_SCHEMA_SCHEMA be self-recursive, so that schemas (with or
> >> > without tables) can be nested arbitrarily deep underneath each other
> >> >   - Fully-Qualified-Table-Name (FQTN) can then be computed by
> walking
> >> > up from a table and concating the schema name until the root schema is
> >> > reached
> >> >
> >> > 2. Make "catalog" and "schema" go away entirely, and tables just have
> a
> >> > FQTN that is an array, a database is a collection of tables
> >> >  - You can compute what would have been the catalog + schema
> >> hierarchy
> >> > by doing a .reduce() over the list of tables and
> >> >
> >> > Or maybe there is another, bette

Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-22 Thread Gavin Ray
I suppose you're thinking from a memory/performance perspective right?
Allocating a dot character is a lot better than allocating multiple arrays

Yeah I don't see why not -- this could even be a library internal where the
fact that it's dotted is an implementation detail
Then in the Java implementation or whatnot, you can call
".getFullyQualifiedTableName()" which will do the allocating parse to a
List for you, or whatnot

The array was mostly for convenience's sake (our API is JSON and not
particularly performance-oriented)

On Thu, Sep 22, 2022 at 1:40 PM David Li  wrote:

> Ah, interesting…
>
> A self-recursive schema wouldn't work in Arrow's schema system, so it'd
> have to be the latter solution. Or, would it work to have a dotted name in
> the schema name column? Would parsing that back out (for applications that
> want to work with the full hierarchy) be too much trouble?
>
> On Thu, Sep 22, 2022, at 13:14, Gavin Ray wrote:
> > Antoine, I can't comment on the Go code (not qualified) but to me, the
> > "verification" test
> > examples look like a mixture between JDBC and Java FlightSQL driver
> usage,
> > and seem solid.
> >
> > There was one reservation I had about the ability to handle datasource
> > namespacing that I brought up early on in the proposal discussions
> > (David responded to it but I got busy and forgot to reply again)
> >
> > If you have a datasource which provides possibly arbitrary levels of
> schema
> > namespace (something like Apache Calcite, for example)
> > How do you represent the table/schema names?
> >
> > Suppose I have a service with a DB layout like this:
> >
> > / foo
> > / bar
> > / baz
> > /qux
> >   / table1
> > - column1
> >
> > At my dayjob, we have a technology which is very similar to
> > ADBC/FlightSQL
> > (would be great to adopt Substrait + ADBC once they're mature enough)
> > -
> >
> https://github.com/hasura/graphql-engine/blob/master/dc-agents/README.md#data-connectors
> > -
> >
> https://techcrunch.com/2022/06/28/hasura-now-lets-developers-turn-any-data-source-into-a-graphql-api/
> >
> > We wound up having to redesign the specification to handle datasources
> that
> > don't fit the "database-schema-table" or "database-table" mould
> >
> > In the ADBC schema for schema metadata, it looks like it expects a
> > single
> > "schema" struct:
> >
> https://github.com/apache/arrow-adbc/blob/7866a566f5b7b635267bfb7a87ea49b01dfe89fa/java/core/src/main/java/org/apache/arrow/adbc/core/StandardSchemas.java#L132-L152
> >
> > If you want to be flexible, IMO it would be good to either:
> >
> > 1. Have DB_SCHEMA_SCHEMA be self-recursive, so that schemas (with or
> > without tables) can be nested arbitrarily deep underneath each other
> >   - Fully-Qualified-Table-Name (FQTN) can then be computed by walking
> > up from a table and concating the schema name until the root schema is
> > reached
> >
> > 2. Make "catalog" and "schema" go away entirely, and tables just have a
> > FQTN that is an array, a database is a collection of tables
> >  - You can compute what would have been the catalog + schema
> hierarchy
> > by doing a .reduce() over the list of tables and
> >
> > Or maybe there is another, better way. But that's my $0.02 and the only
> > real concern about the API I have, without actually trying to build
> > something with it.
> >
> >
> >
> >
> >
> > On Thu, Sep 22, 2022 at 5:40 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Hello,
> >>
> >> I would urge people to review the proposed ADBC APIs, especially the Go
> >> and Java APIs which probably benefitted from less feedback than the C
> one.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 21/09/2022 à 17:40, David Li a écrit :
> >> > Hello,
> >> >
> >> > We have been discussing [1] standard interfaces for Arrow-based
> database
> >> access and have been working on implementations of the proposed
> interfaces
> >> [2], all under the name "ADBC". This proposal aims to provide a unified
> >> client abstraction across Arrow-native database protocols (like Flight
> SQL)
> >> and non-Arrow database protocols, which can then be used by Arrow
> projects
> >> like Dataset/Acero and ecosystem projects like Ibis.
> >> >
> >> > For details, see t

Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-22 Thread Gavin Ray
Antoine, I can't comment on the Go code (not qualified) but to me, the
"verification" test
examples look like a mixture between JDBC and Java FlightSQL driver usage,
and seem solid.

There was one reservation I had about the ability to handle datasource
namespacing that I brought up early on in the proposal discussions
(David responded to it but I got busy and forgot to reply again)

If you have a datasource which provides possibly arbitrary levels of schema
namespace (something like Apache Calcite, for example)
How do you represent the table/schema names?

Suppose I have a service with a DB layout like this:

/ foo
/ bar
/ baz
/qux
  / table1
- column1

At my dayjob, we have a technology which is very similar to ADBC/FlightSQL
(would be great to adopt Substrait + ADBC once they're mature enough)
-
https://github.com/hasura/graphql-engine/blob/master/dc-agents/README.md#data-connectors
-
https://techcrunch.com/2022/06/28/hasura-now-lets-developers-turn-any-data-source-into-a-graphql-api/

We wound up having to redesign the specification to handle datasources that
don't fit the "database-schema-table" or "database-table" mould

In the ADBC schema for schema metadata, it looks like it expects a single
"schema" struct:
https://github.com/apache/arrow-adbc/blob/7866a566f5b7b635267bfb7a87ea49b01dfe89fa/java/core/src/main/java/org/apache/arrow/adbc/core/StandardSchemas.java#L132-L152

If you want to be flexible, IMO it would be good to either:

1. Have DB_SCHEMA_SCHEMA be self-recursive, so that schemas (with or
without tables) can be nested arbitrarily deep underneath each other
  - Fully-Qualified-Table-Name (FQTN) can then be computed by walking
up from a table and concating the schema name until the root schema is
reached

2. Make "catalog" and "schema" go away entirely, and tables just have a
FQTN that is an array, a database is a collection of tables
 - You can compute what would have been the catalog + schema hierarchy
by doing a .reduce() over the list of tables and

Or maybe there is another, better way. But that's my $0.02 and the only
real concern about the API I have, without actually trying to build
something with it.





On Thu, Sep 22, 2022 at 5:40 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I would urge people to review the proposed ADBC APIs, especially the Go
> and Java APIs which probably benefitted from less feedback than the C one.
>
> Regards
>
> Antoine.
>
>
> Le 21/09/2022 à 17:40, David Li a écrit :
> > Hello,
> >
> > We have been discussing [1] standard interfaces for Arrow-based database
> access and have been working on implementations of the proposed interfaces
> [2], all under the name "ADBC". This proposal aims to provide a unified
> client abstraction across Arrow-native database protocols (like Flight SQL)
> and non-Arrow database protocols, which can then be used by Arrow projects
> like Dataset/Acero and ecosystem projects like Ibis.
> >
> > For details, see the RFC here:
> https://github.com/apache/arrow/pull/14079
> >
> > I would like to propose that the Arrow project adopt this RFC, along
> with apache/arrow-adbc commit 7866a56 [3], as version 1.0.0 of the ADBC API
> standard.
> >
> > Please vote to adopt the specification as described above. (This is not
> a vote to release any components.)
> >
> > This vote will be open for at least 72 hours.
> >
> > [ ] +1 Adopt the ADBC specification
> > [ ]  0
> > [ ] -1 Do not adopt the specification because...
> >
> > Thanks to the DuckDB and R DBI projects for providing feedback on and
> implementations of the proposal.
> >
> > [1]: https://lists.apache.org/thread/cq7t9s5p7dw4vschylhwsfgqwkr5fmf2
> > [2]: https://github.com/apache/arrow-adbc
> > [3]:
> https://github.com/apache/arrow-adbc/commit/7866a566f5b7b635267bfb7a87ea49b01dfe89fa
> >
> > Thank you,
> > David
>


Re: [VOTE] Adopt ADBC database client connectivity specification

2022-09-21 Thread Gavin Ray
+1 (non-binding/I'm not important)

On Wed, Sep 21, 2022 at 11:40 AM David Li  wrote:

> Hello,
>
> We have been discussing [1] standard interfaces for Arrow-based database
> access and have been working on implementations of the proposed interfaces
> [2], all under the name "ADBC". This proposal aims to provide a unified
> client abstraction across Arrow-native database protocols (like Flight SQL)
> and non-Arrow database protocols, which can then be used by Arrow projects
> like Dataset/Acero and ecosystem projects like Ibis.
>
> For details, see the RFC here: https://github.com/apache/arrow/pull/14079
>
> I would like to propose that the Arrow project adopt this RFC, along with
> apache/arrow-adbc commit 7866a56 [3], as version 1.0.0 of the ADBC API
> standard.
>
> Please vote to adopt the specification as described above. (This is not a
> vote to release any components.)
>
> This vote will be open for at least 72 hours.
>
> [ ] +1 Adopt the ADBC specification
> [ ]  0
> [ ] -1 Do not adopt the specification because...
>
> Thanks to the DuckDB and R DBI projects for providing feedback on and
> implementations of the proposal.
>
> [1]: https://lists.apache.org/thread/cq7t9s5p7dw4vschylhwsfgqwkr5fmf2
> [2]: https://github.com/apache/arrow-adbc
> [3]:
> https://github.com/apache/arrow-adbc/commit/7866a566f5b7b635267bfb7a87ea49b01dfe89fa
>
> Thank you,
> David
>


Re: Request for help with node/yarn in Docker image

2022-09-17 Thread Gavin Ray
(I omitted the part where you'd need to run the "apt-get install nginx"
above in last, single-file Docker build, whoops)

That would of course go after the "COPY --from=ui-build" and before the
CMD/ENTRYPOINT 

On Sat, Sep 17, 2022 at 9:46 PM Gavin Ray  wrote:

> Hey Andy,
>
> Happy to be useful in some way, I have a fair amount of experience here.
>
> Since you already have a Dockerfile next to this one that is building the
> React app and serving it on NGINX:
> "/workspaces/arrow-ballista/dev/docker/ballista-scheduler-ui.dockerfile"
>
> You can just copy the built assets out of it:
>
> ARG VERSION
> FROM apache/arrow-ballista:$VERSION
> COPY --from=ballista-scheduler-ui:0.8.0 /usr/share/nginx/html
> /usr/share/nginx/html
>
> # TODO start nginx in background to serve the UI
>
> ENV RUST_LOG=info
> ENV RUST_BACKTRACE=full
>
> CMD ["/scheduler", "&&", "nginx", "-g", "daemon off;"]
>
> That run command is probably wrong, you'd want to use a shell script that
> does both things with ENTRYPOINT, not CMD but you get the point
> It seems to build anyways:
>
> [image: image.png]
>
> If you want to have it in a single step, you can rewrite the Dockerfile
> like this:
> ARG VERSION
>
> FROM node:18-alpine as ui-build
> WORKDIR /app
> ENV PATH /app/node_modules/.bin:$PATH
>
> COPY package.json ./
> COPY yarn.lock ./
> RUN yarn
>
> COPY . ./
> RUN yarn build
>
> FROM apache/arrow-ballista:$VERSION
> COPY --from=ui-build /app/build /usr/share/nginx/html
>
> # TODO start nginx in background to serve the UI
> ENV RUST_LOG=info
> ENV RUST_BACKTRACE=full
>
> CMD ["/scheduler", "&&", "nginx", "-g", "daemon off;"]
>
> On Sat, Sep 17, 2022 at 4:07 PM Andy Grove  wrote:
>
>> The Ballista project had a scheduler UI contributed a while back [1], and
>> I
>> can get this working locally but am running into errors when trying to
>> build this in a Docker image along with the Rust scheduler process. I have
>> zero experience with node/yarn, so am wondering if anyone could spare some
>> time to help point me in the right direction.
>>
>> I have a PR up with a comment where I am stuck. [2]
>>
>> Thanks,
>>
>> Andy.
>>
>> [1]
>>
>> https://github.com/apache/arrow-ballista/commit/372ba5fadf3c6c645b98185589996c849e42aac5
>> [2] https://github.com/apache/arrow-ballista/pull/238
>>
>


Re: Request for help with node/yarn in Docker image

2022-09-17 Thread Gavin Ray
Hey Andy,

Happy to be useful in some way, I have a fair amount of experience here.

Since you already have a Dockerfile next to this one that is building the
React app and serving it on NGINX:
"/workspaces/arrow-ballista/dev/docker/ballista-scheduler-ui.dockerfile"

You can just copy the built assets out of it:

ARG VERSION
FROM apache/arrow-ballista:$VERSION
COPY --from=ballista-scheduler-ui:0.8.0 /usr/share/nginx/html
/usr/share/nginx/html

# TODO start nginx in background to serve the UI

ENV RUST_LOG=info
ENV RUST_BACKTRACE=full

CMD ["/scheduler", "&&", "nginx", "-g", "daemon off;"]

That run command is probably wrong, you'd want to use a shell script that
does both things with ENTRYPOINT, not CMD but you get the point
It seems to build anyways:

[image: image.png]

If you want to have it in a single step, you can rewrite the Dockerfile
like this:
ARG VERSION

FROM node:18-alpine as ui-build
WORKDIR /app
ENV PATH /app/node_modules/.bin:$PATH

COPY package.json ./
COPY yarn.lock ./
RUN yarn

COPY . ./
RUN yarn build

FROM apache/arrow-ballista:$VERSION
COPY --from=ui-build /app/build /usr/share/nginx/html

# TODO start nginx in background to serve the UI
ENV RUST_LOG=info
ENV RUST_BACKTRACE=full

CMD ["/scheduler", "&&", "nginx", "-g", "daemon off;"]

On Sat, Sep 17, 2022 at 4:07 PM Andy Grove  wrote:

> The Ballista project had a scheduler UI contributed a while back [1], and I
> can get this working locally but am running into errors when trying to
> build this in a Docker image along with the Rust scheduler process. I have
> zero experience with node/yarn, so am wondering if anyone could spare some
> time to help point me in the right direction.
>
> I have a PR up with a comment where I am stuck. [2]
>
> Thanks,
>
> Andy.
>
> [1]
>
> https://github.com/apache/arrow-ballista/commit/372ba5fadf3c6c645b98185589996c849e42aac5
> [2] https://github.com/apache/arrow-ballista/pull/238
>


Re: [VOTE] Substrait for Flight SQL

2022-09-16 Thread Gavin Ray
Hooray!

On Fri, Sep 16, 2022 at 11:08 AM David Li  wrote:

> The PR is now merged:
> https://github.com/apache/arrow/commit/3ce40143f8a836df058ec5fe1b29d9da5ede169d
>
> Thanks all!
>
> On Sat, Sep 10, 2022, at 18:15, David Li wrote:
> > The vote passes with 5 binding votes and 7 non-binding votes. Thanks all!
> >
> > I will rebase the PR and ensure CI passes before merging.
> >
> > On Fri, Sep 9, 2022, at 16:14, Wes McKinney wrote:
> >> +1 (binding)
> >>
> >> On Thu, Sep 8, 2022 at 9:12 PM Jacques Nadeau 
> wrote:
> >>>
> >>> My vote continues to be +1
> >>>
> >>> On Thu, Sep 8, 2022 at 11:44 AM Neal Richardson <
> neal.p.richard...@gmail.com>
> >>> wrote:
> >>>
> >>> > +1
> >>> >
> >>> > Neal
> >>> >
> >>> > On Thu, Sep 8, 2022 at 2:15 PM Ashish 
> wrote:
> >>> >
> >>> > > +1 (non-binding)
> >>> > >
> >>> > > On Thu, Sep 8, 2022 at 9:41 AM Gavin Ray 
> wrote:
> >>> > >
> >>> > > > Oh, so that's what "non-binding" means in vote threads
> >>> > > > Those threads make a lot more sense now, thanks for the heads-up
> =)
> >>> > > >
> >>> > > > On Thu, Sep 8, 2022 at 12:31 PM David Li 
> wrote:
> >>> > > >
> >>> > > > > Non-binding votes are always welcome and encouraged! Was just
> trying
> >>> > to
> >>> > > > > make sure we have the minimum 3 binding votes here but it
> turns out I
> >>> > > > can't
> >>> > > > > count and I make three.
> >>> > > > >
> >>> > > > > On Thu, Sep 8, 2022, at 12:14, Gavin Ray wrote:
> >>> > > > > > If non-PMC can vote, I'll also give a huge +1
> >>> > > > > >
> >>> > > > > > On Thu, Sep 8, 2022 at 11:34 AM Matthew Topol
> >>> > > > > 
> >>> > > > > > wrote:
> >>> > > > > >
> >>> > > > > >> I'm not PMC but i'll give a +1 (non-binding) vote. I like
> the idea
> >>> > > of
> >>> > > > > >> integrating Substrait plans into Flight SQL if possible and
> it
> >>> > > aligns
> >>> > > > > >> with the arrow-adbc work.
> >>> > > > > >>
> >>> > > > > >> On Thu, Sep 8 2022 at 11:31:59 AM -0400, David Li <
> >>> > > > lidav...@apache.org>
> >>> > > > > >> wrote:
> >>> > > > > >> > My vote: +1 (binding)
> >>> > > > > >> >
> >>> > > > > >> > Are any other PMC members available to take a look?
> >>> > > > > >> >
> >>> > > > > >> > On Wed, Sep 7, 2022, at 09:18, Antoine Pitrou wrote:
> >>> > > > > >> >>  Fair enough. For the record, my main concern with ad-hoc
> >>> > > > conventions
> >>> > > > > >> >>  such as "number of milliseconds expressed as an
> integer" is
> >>> > the
> >>> > > > poor
> >>> > > > > >> >>  usability and the potential for confusion (not to
> mention that
> >>> > > > > >> >> sometimes
> >>> > > > > >> >>  the need for a higher precision can lead to add another
> set of
> >>> > > > > >> >> APIs, but
> >>> > > > > >> >>  that's unlikely to be the case here :-)).
> >>> > > > > >> >>
> >>> > > > > >> >>  Regards
> >>> > > > > >> >>
> >>> > > > > >> >>  Antoine.
> >>> > > > > >> >>
> >>> > > > > >> >>
> >>> > > > > >> >>  Le 07/09/2022 à 14:21, David Li a écrit :
> >>> > > > > >> >>>  Absent further comments on this I would rather avoid
> adding a
> >>> > > > > >> >>> potentially breaking (even if likely compatible) change
> to the
> >>> > > > > >> >>> schema of this endpoint, if that's acceptable. I don't
> think a
> >>> > > > > >> >>> millisecond timeout is all too different from
> floating-point
> >>> > > > > >> >>> seconds (especially at the scale of network RPCs).
> >>> > > > > >> >>>
> >>> > > > > >> >>>  On Tue, Sep 6, 2022, at 12:44, David Li wrote:
> >>> > > > > >> >>>>  We could add a new type code to the union. Presumably
> >>> > > consumers
> >>> > > > > >> >>>> would
> >>> > > > > >> >>>>  just error on or ignore such values (the libraries
> just hand
> >>> > > the
> >>> > > > > >> >>>> Arrow
> >>> > > > > >> >>>>  array to the application, so it's up to the
> application what
> >>> > > to
> >>> > > > > >> >>>> do with
> >>> > > > > >> >>>>  an unknown type code). (And for a new consumer
> talking to an
> >>> > > old
> >>> > > > > >> >>>>  server, the new type code would just never come up,
> so the
> >>> > > only
> >>> > > > > >> >>>> issue
> >>> > > > > >> >>>>  would be if it strictly validates the returned
> schema.)
> >>> > > > > >> >>>>
> >>> > > > > >> >>>>  If there's support, I can make this revision as well.
> >>> > > > > >> >>>>
> >>> > > > > >> >>>>  On Tue, Sep 6, 2022, at 12:37, Antoine Pitrou wrote:
> >>> > > > > >> >>>>>  Le 06/09/2022 à 17:21, David Li a écrit :
> >>> > > > > >> >>>>>>  Thanks Antoine!
> >>> > > > > >> >>>>>>
> >>> > > > > >> >>>>>>  I've updated the PR (except for the comment about
> timeout
> >>> > > > > >> >>>>>> units, since SqlInfo values can't be doubles/floats
> unless
> >>> > we
> >>> > > > > >> >>>>>> change the schema there)
> >>> > > > > >> >>>>>
> >>> > > > > >> >>>>>  Can we change the schema in a backwards-compatible
> way?
> >>> > > > > >>
> >>> > > > > >>
> >>> > > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > thanks
> >>> > > ashish
> >>> > >
> >>> >
>


Re: [VOTE] Substrait for Flight SQL

2022-09-08 Thread Gavin Ray
Oh, so that's what "non-binding" means in vote threads
Those threads make a lot more sense now, thanks for the heads-up =)

On Thu, Sep 8, 2022 at 12:31 PM David Li  wrote:

> Non-binding votes are always welcome and encouraged! Was just trying to
> make sure we have the minimum 3 binding votes here but it turns out I can't
> count and I make three.
>
> On Thu, Sep 8, 2022, at 12:14, Gavin Ray wrote:
> > If non-PMC can vote, I'll also give a huge +1
> >
> > On Thu, Sep 8, 2022 at 11:34 AM Matthew Topol
> 
> > wrote:
> >
> >> I'm not PMC but i'll give a +1 (non-binding) vote. I like the idea of
> >> integrating Substrait plans into Flight SQL if possible and it aligns
> >> with the arrow-adbc work.
> >>
> >> On Thu, Sep 8 2022 at 11:31:59 AM -0400, David Li 
> >> wrote:
> >> > My vote: +1 (binding)
> >> >
> >> > Are any other PMC members available to take a look?
> >> >
> >> > On Wed, Sep 7, 2022, at 09:18, Antoine Pitrou wrote:
> >> >>  Fair enough. For the record, my main concern with ad-hoc conventions
> >> >>  such as "number of milliseconds expressed as an integer" is the poor
> >> >>  usability and the potential for confusion (not to mention that
> >> >> sometimes
> >> >>  the need for a higher precision can lead to add another set of
> >> >> APIs, but
> >> >>  that's unlikely to be the case here :-)).
> >> >>
> >> >>  Regards
> >> >>
> >> >>  Antoine.
> >> >>
> >> >>
> >> >>  Le 07/09/2022 à 14:21, David Li a écrit :
> >> >>>  Absent further comments on this I would rather avoid adding a
> >> >>> potentially breaking (even if likely compatible) change to the
> >> >>> schema of this endpoint, if that's acceptable. I don't think a
> >> >>> millisecond timeout is all too different from floating-point
> >> >>> seconds (especially at the scale of network RPCs).
> >> >>>
> >> >>>  On Tue, Sep 6, 2022, at 12:44, David Li wrote:
> >> >>>>  We could add a new type code to the union. Presumably consumers
> >> >>>> would
> >> >>>>  just error on or ignore such values (the libraries just hand the
> >> >>>> Arrow
> >> >>>>  array to the application, so it's up to the application what to
> >> >>>> do with
> >> >>>>  an unknown type code). (And for a new consumer talking to an old
> >> >>>>  server, the new type code would just never come up, so the only
> >> >>>> issue
> >> >>>>  would be if it strictly validates the returned schema.)
> >> >>>>
> >> >>>>  If there's support, I can make this revision as well.
> >> >>>>
> >> >>>>  On Tue, Sep 6, 2022, at 12:37, Antoine Pitrou wrote:
> >> >>>>>  Le 06/09/2022 à 17:21, David Li a écrit :
> >> >>>>>>  Thanks Antoine!
> >> >>>>>>
> >> >>>>>>  I've updated the PR (except for the comment about timeout
> >> >>>>>> units, since SqlInfo values can't be doubles/floats unless we
> >> >>>>>> change the schema there)
> >> >>>>>
> >> >>>>>  Can we change the schema in a backwards-compatible way?
> >>
> >>
>


Re: [VOTE] Substrait for Flight SQL

2022-09-08 Thread Gavin Ray
If non-PMC can vote, I'll also give a huge +1

On Thu, Sep 8, 2022 at 11:34 AM Matthew Topol 
wrote:

> I'm not PMC but i'll give a +1 (non-binding) vote. I like the idea of
> integrating Substrait plans into Flight SQL if possible and it aligns
> with the arrow-adbc work.
>
> On Thu, Sep 8 2022 at 11:31:59 AM -0400, David Li 
> wrote:
> > My vote: +1 (binding)
> >
> > Are any other PMC members available to take a look?
> >
> > On Wed, Sep 7, 2022, at 09:18, Antoine Pitrou wrote:
> >>  Fair enough. For the record, my main concern with ad-hoc conventions
> >>  such as "number of milliseconds expressed as an integer" is the poor
> >>  usability and the potential for confusion (not to mention that
> >> sometimes
> >>  the need for a higher precision can lead to add another set of
> >> APIs, but
> >>  that's unlikely to be the case here :-)).
> >>
> >>  Regards
> >>
> >>  Antoine.
> >>
> >>
> >>  Le 07/09/2022 à 14:21, David Li a écrit :
> >>>  Absent further comments on this I would rather avoid adding a
> >>> potentially breaking (even if likely compatible) change to the
> >>> schema of this endpoint, if that's acceptable. I don't think a
> >>> millisecond timeout is all too different from floating-point
> >>> seconds (especially at the scale of network RPCs).
> >>>
> >>>  On Tue, Sep 6, 2022, at 12:44, David Li wrote:
>   We could add a new type code to the union. Presumably consumers
>  would
>   just error on or ignore such values (the libraries just hand the
>  Arrow
>   array to the application, so it's up to the application what to
>  do with
>   an unknown type code). (And for a new consumer talking to an old
>   server, the new type code would just never come up, so the only
>  issue
>   would be if it strictly validates the returned schema.)
> 
>   If there's support, I can make this revision as well.
> 
>   On Tue, Sep 6, 2022, at 12:37, Antoine Pitrou wrote:
> >  Le 06/09/2022 à 17:21, David Li a écrit :
> >>  Thanks Antoine!
> >>
> >>  I've updated the PR (except for the comment about timeout
> >> units, since SqlInfo values can't be doubles/floats unless we
> >> change the schema there)
> >
> >  Can we change the schema in a backwards-compatible way?
>
>


Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Gavin Ray
Well-earned mate!

On Mon, Sep 5, 2022 at 6:09 PM Sasha Krassovsky 
wrote:

> Congratulations Weston!! Very well deserved!
>
>
> > On Sep 5, 2022, at 11:04 AM, Ian Joiner  wrote:
> >
> > Congrats Weston!
> >
> > On Mon, Sep 5, 2022 at 1:56 AM Sutou Kouhei  wrote:
> >
> >> The Project Management Committee (PMC) for Apache Arrow has invited
> >> Weston Pace to become a PMC member and we are pleased to announce
> >> that Weston Pace has accepted.
> >>
> >> Congratulations and welcome!
> >>
>
>


Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Gavin Ray
> there are scalar api functions that can be logically used to process rows
of data, but they are executed on columnar batches of data.
> As mentioned previously it is better to have an API that applies row
level transformations than to have an intermediary row level memory format.

Another way of thinking about this maybe is that the API would be something
of a "Row-based Facade" over underlying columnar memory, right?

As an end-user, for instance, I probably don't mind much about what happens
under the hood.
On the surface, I'd just like to be able to mentally work with rows and be
able to load data in the shape of "Map<>", and "Collection", etc

Given the disclaimer that it'd be more efficient not to start from the
row-based data (IE, in your JDBC ResultSet processing, construct Arrow
results directly instead of serializing to List)
But if you already have data in this shape, maybe from an API you don't
control, then having a row-based facade over columnar API's would be really
convenient and ergonomic.

That's my $0.02 anyways


On Fri, Jul 29, 2022 at 9:56 AM Lee, David 
wrote:

> In pyarrow.compute which is an extension of the C++ implementation there
> are scalar api functions that can be logically used to process rows of
> data, but they are executed on columnar batches of data.
>
> As mentioned previously it is better to have an API that applies row level
> transformations than to have an intermediary row level memory format.
>
> Sent from my iPad
>
> > On Jul 29, 2022, at 3:43 AM, Andrew Lamb  wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I am +0 on a standard API -- in the Rust arrow-rs implementation we tend
> to
> > borrow inspiration from the C++ / Java interfaces and then create
> > appropriate Rust APIs.
> >
> > There is also a row based format in DataFusion [1] (Rust) and it is used
> to
> > implement certain GroupBy and Sorts (similarly to what Sasha Krassovsky
> > describes for Acero).
> >
> > I think row based formats are common in vectorized query engines for
> > operations that can't be easily vectorized (sorts, groups and joins),
> > though I am not sure how reusable those formats would be
> >
> > There are at least three uses that require slightly different layouts
> > 1. Comparing row formatted data for equality (where space efficiency is
> > important)
> > 2. Comparing row formatted data for comparisons (where collation is
> > important)
> > 3. Using row formatted data to hold intermediate aggregates (where word
> > alignment is important)
> >
> > So in other words, I am not sure how easy it would be to define a common
> > in-memory layout for rows.
> >
> > Andrew
> >
> > [1]
> >
> https://urldefense.com/v3/__https://github.com/apache/arrow-datafusion/blob/3cd62e9/datafusion/row/src/layout.rs*L29-L75__;Iw!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q1tHhSXS$
> >
> >
> >
> >> On Fri, Jul 29, 2022 at 2:06 AM Laurent Quérel <
> laurent.que...@gmail.com>
> >> wrote:
> >>
> >> Hi Sasha,
> >> Thank you very much for this informative comment. It's interesting to
> see
> >> another use of a row-based API in the context of a query engine. I think
> >> that there is some thought to be given to whether or not it is possible
> to
> >> converge these two use cases into a single public row-based API.
> >>
> >> As a first reaction I would say that it is not necessarily easy to
> >> reconcile because the constraints and the goals to be optimized are
> >> relatively disjoint. If you see a way to do it I'm extremely interested.
> >>
> >> If I understand correctly, in your case, you want to optimize the
> >> conversion from column to row representation and vice versa (a kind of
> >> bidirectional projection). Having a SIMD implementation of these
> >> conversions is just fantastic. However it seems that in your case there
> is
> >> no support for nested types yet and I feel like there is no public API
> to
> >> build rows in a simple and ergonomic way outside this bridge with the
> >> column-based representation.
> >>
> >> In the use case I'm trying to solve, the criteria to optimize are 1)
> expose
> >> a row-based API that offers the least amount of friction in the process
> of
> >> converting any row-based source to Arrow, which implies an easy-to-use
> API
> >> and support for nested types, 2) make it easy to create an efficient
> Arrow
> >> schema by automating dictionary creation and multi-column sorting in a
> way
> >> that makes Arrow easy to use for the casual user.
> >>
> >> The criteria to be optimized seem relatively disjointed to me but again
> I
> >> would be willing to dig with you a solution that offers a good
> compromise
> >> for these two use cases.
> >>
> >> Best,
> >> Laurent
> >>
> >>
> >>
> >> On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <
> >> krassovskysa...@gmail.com>
> >> wrote:
> >>
> >>> Hi everyone,
> >>> I just wanted to chime in that we already do have a form of
> row-oriented
> 

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Gavin Ray
This is essentially the same idea as the proposal here I think --
row/map-based representation & conversion functions for ease of use:

[RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry,
increase adoption/audience and productivity. · Issue #12618 · apache/arrow
(github.com) 

Definitely a worthwhile pursuit IMO.

On Thu, Jul 28, 2022 at 4:46 PM Sasha Krassovsky 
wrote:

> Hi everyone,
> I just wanted to chime in that we already do have a form of row-oriented
> storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> rows inside of GroupBy and Join within Acero. We also have utilities for
> converting to/from columnar storage (and AVX2 implementations of these
> conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> useful to standardize this row-oriented format?
>
> As far as I understand fixed-width rows would be trivially convertible
> into this representation (just a pointer to your array of structs), while
> variable-width rows would need a little bit of massaging (though not too
> much) to be put into this representation.
>
> Sasha Krassovsky
>
> > On Jul 28, 2022, at 1:10 PM, Laurent Quérel 
> wrote:
> >
> > Thank you Micah for a very clear summary of the intent behind this
> > proposal. Indeed, I think that clarifying from the beginning that this
> > approach aims at facilitating experimentation more than efficiency in
> terms
> > of performance of the transformation phase would have helped to better
> > understand my objective.
> >
> > Regarding your question, I don't think there is a specific technical
> reason
> > for such an integration in the core library. I was just thinking that it
> > would make this infrastructure easier to find for the users and that this
> > topic was general enough to find its place in the standard library.
> >
> > Best,
> > Laurent
> >
> > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield 
> > wrote:
> >
> >> Hi Laurent,
> >> I'm retitling this thread to include the specific languages you seem to
> be
> >> targeting in the subject line to hopefully get more eyes from
> maintainers
> >> in those languages.
> >>
> >> Thanks for clarifying the goals.  If I can restate my understanding, the
> >> intended use-case here is to provide easy (from the developer point of
> >> view) adaptation of row based formats to Arrow.  The means of achieving
> >> this is creating an API for a row-base structure, and having utility
> >> classes that can manipulate the interface to build up batches (there
> are no
> >> serialization or in memory spec associated with this API).  People
> wishing
> >> to integrate a specific row based format, can extend that API at
> whatever
> >> level makes sense for the format.
> >>
> >> I think this would be useful infrastructure as long as it was made clear
> >> that in many cases this wouldn't be the most efficient way to convert to
> >> Arrow from other formats.
> >>
> >> I don't work much with either the Rust or Go implementation, so I can't
> >> speak to if there is maintainer support for incorporating the changes
> >> directly in Arrow.  Is there any technical reasons for preferring to
> have
> >> this included directly in Arrow vs a separate library?
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <
> laurent.que...@gmail.com>
> >> wrote:
> >>
> >>> Far be it from me to think that I know more than Jorge or Wes on this
> >>> subject. Sorry if my post gives that perception, that is clearly not my
> >>> intention. I'm just trying to defend the idea that when designing this
> >> kind
> >>> of transformation, it might be interesting to have a library to test
> >>> several mappings and evaluate them before doing a more direct
> >>> implementation if the performance is not there.
> >>>
> >>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
> >>> benjaminblodg...@gmail.com> wrote:
> >>>
>  He was trying to nicely say he knows way more than you, and your ideas
>  will result in a low performance scheme no one will use in production
>  ai/machine learning.
> 
>  Sent from my iPhone
> 
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
>  benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being
> >> humble
>  is just being tactful.  Probably listen to Jorge on performance and
>  architecture, even over Wes as he’s contributed more than anyone else
> >> and
>  know the bleeding edge of low level performance stuff more than
> anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
> >>> laurent.que...@gmail.com>
>  wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the
> >> same
> >> between using a row-oriented internal representation and "Arrow"
> >> which
>  not
> >> only changes 

Re: [FlightSql] Spark Flight SQL

2022-07-23 Thread Gavin Ray
This sounds pretty darn nifty!
I don't have much of value to offer, but the idea sounds like a great one
to me =)

On Sat, Jul 23, 2022 at 5:18 PM Tornike Gurgenidze 
wrote:

> David, thank you for the reply.
>
> I recently managed to find the time to get back to the repo. I thought I
> would post the status update for anyone interested.
>
> The project started out as just FlightSql implementation, but I ended up
> splitting it into smaller components:
>
> 1. SparkFlightManager - a lower-level, more of a utility class, that
> enables easier development of Spark-backed FlightServers. It is supposed to
> take care of FlightServer cluster management, distribution of Spark query
> results to the FlightServer nodes, service discovery and so on, permitting
> a developer to focus on just expressing the intended business logic in
> Spark. There's a reference FlightServer implementation (
>
> https://github.com/tokoko/SparkFlightSql/blob/main/src/main/scala/com/tokoko/spark/flight/example/SparkParquetFlightProducer.scala
> )
> that illustrates how a simple parquet reader server can be implemented
> using SparkFlightManager.
>
> 2. SparkFlightSql - SparkFlightSqlProducer class that relies on
> SparkFlightManager for most of the technical stuff and focuses on simply
> mapping Spark Catalog API metadata to the FlightSql specification.
>
> 3. FlightSql DataSourceV2 - pretty self-explanatory, there's now also the
> beginnings of a DataSourceV2 implementation supporting BATCH_READ.
>
> Once again, if anyone's interested enough to contribute or maybe has a use
> case for SparkFlightManager, please feel free to reach out.
> --
> Tornike
>
> On Sun, May 29, 2022 at 5:26 AM David Li  wrote:
>
> > Hi Tornike,
> >
> > I'll have to take a closer look later when I can get back in front of a
> > real computer but I just want to say that this is super awesome, and
> thank
> > you for sharing!
> >
> > I think we've kicked around the idea of "contrib" projects in the past.
> > Maybe this can be the impetus to take up that idea? Regardless I want to
> > say that if you have any questions or feedback about Arrow and Flight SQL
> > please feel free to post it here.
> >
> > -David
> >
> > On Sat, May 28, 2022, at 18:48, Tornike Gurgenidze wrote:
> > > Hi,
> > >
> > > I'm not sure this is the right place to be posting this, so I apologize
> > in
> > > advance.
> > >
> > > Recently I started a PoC for Arrow Flight SQL Server with Spark
> backend (
> > > https://github.com/tokoko/SparkFlightSql). The main goal is to create
> a
> > > SparkThriftServer alternative that will benefit from FlightSql protocol
> > and
> > > will also be distributed in nature, i.e. query results won't have to
> pass
> > > through a single server.
> > >
> > > I thought it might be interesting for those of you who are also
> familiar
> > > with Spark. I don't have much of an experience with Arrow, so I would
> > > appreciate any sort of involvement from Arrow community.
> > >
> > > Regards,
> > > Tornike
>


Re: Arrow sync call July 20 at 12:00 US/Eastern, 16:00 UTC

2022-07-20 Thread Gavin Ray
Awesome, thanks for the clarification David!

On Wed, Jul 20, 2022 at 2:40 PM David Li  wrote:

> It was pulled out of the ADBC project so you can see an example at [1]
> (API changed slightly when ported though).
>
> Yes, it'll bind one row of values at a time, and your description is
> correct.
>
> [1]:
> https://github.com/apache/arrow-adbc/blob/cf43e0cc2ae15ad0ce669b531d475ee218698100/java/driver/jdbc/src/main/java/org/apache/arrow/adbc/driver/jdbc/JdbcStatement.java#L160
>
> -David
>
> On Wed, Jul 20, 2022, at 14:22, Gavin Ray wrote:
> > That JDBC PreparedStatement binding utility looks super useful!
> > I had one question about the behavior of it, if that's alright:
> >
> > The doc says:
> >
> > "Each call to next() will bind parameters
> >> from the next row of data, and then the application can execute the
> >> statement, call addBatch(), etc. as desired."
> >
> >
> > And shows the code:
> >
> > final JdbcParameterBinder binder =
> >> JdbcParameterBinder.builder(statement, root).bindAll().build();
> >
> >
> >
> >> while (binder.next()) {
> >> statement.executeUpdate();
> >> }
> >
> >
> > Could someone elaborate what happens here with some simple example
> > VectorSchemaRoot?
> > I'm having trouble following the meaning. Does this perform
> executeUpdate()
> > once for each row-wise set of column values?
> >
> > On Wed, Jul 20, 2022 at 1:41 PM Will Jones 
> wrote:
> >
> >> Attendees:
> >>
> >>- Jacob Wujciak-Jens
> >>- James Duong
> >>- Rok Mihevc
> >>- Raul Cumplido
> >>- Eduardo Ponce
> >>- Jeremy Parr-Pearson
> >>- Will Jones
> >>- Joris Van den Bossche
> >>
> >> Discussion
> >>
> >> Arrow 9.0.0 Release
> >>
> >> Increased capacity for crossbow, like 3x, including Macs. Devs are
> >> encouraged to use more crossbow runs to make sure their PRs keep master
> as
> >> release-able as possible.
> >>
> >>
> >> Also working on more caching improvements, but likely won’t make it in
> by
> >> release.
> >>
> >> There's been a large decline in passing nightly tests in recent days
> (see
> >> the Nightly Dashboard [1]). This is mostly caused by race conditions in
> the
> >> scanner (ARROW-17127 [2]). This issue is a blocker for release, but does
> >> not yet have any progress.
> >>
> >> We still have 8 blocker issues for release (see release dashboard [3]).
> >> Only 4 of them appear to be actively worked on, so more attention is
> needed
> >> on these issues.
> >>
> >>
> >> From monday onwards, there will be a feature freeze for the 9.0.0
> release.
> >> The release managers will only cherry pick commits that fix blocker
> issues.
> >> Contributors should continue to merge stuff to master as normal; it will
> >> just not be included in release.
> >>
> >> One nightly failure is due to a known issue with protobuf ABI on MacOS.
> >> There is an upstream release coming soon that will fix it [4].
> >>
> >> Any reviews that should be prioritized? David noted that the JDBC
> module in
> >> Java needs to be reviewed by additional maintainers [5]. Key question:
> Do
> >> we want to support this?
> >> Win32 PR
> >>
> >> James has been working on the Win32 PR fix [6]. Mostly build warnings
> being
> >> fixed. Focusing on 2017 32-bit MSVC. Most of the library code has been
> >> fixed; now fixing test cases.
> >>
> >> One failing test is the R ubuntu test. Rok noted it is likely unrelated
> as
> >> he is seeing elsewhere [7].
> >> Flight SQL JDBC Contribution Update
> >>
> >> Apache side was scrutinizing the vote, but now that discussion is
> settled.
> >> Still waiting on one ICLA, which may have been submitted but gotten
> lost.
> >> This contribution likely won’t hit 9.0.0.
> >>
> >>
> >> [1] https://crossbow.voltrondata.com/
> >> [2] https://issues.apache.org/jira/browse/ARROW-17127
> >> [3]
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+9.0.0+Release
> >> [4] https://github.com/protocolbuffers/protobuf/pull/10271
> >> [5] https://github.com/apache/arrow/pull/13589
> >> [6] https://github.com/apache/arrow/pull/13532
> >> [7]
> https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true
> >>
> >>
> >>
> >> On Tue, Jul 19, 2022 at 12:50 PM Ian Cook 
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
> >> >
> >> > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> >> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> >> >
> >> > Alternatively, enter this information into the Zoom website or app to
> >> > join the call:
> >> > Meeting ID: 876 4903 3008
> >> > Passcode: 958092
> >> >
> >> > Thanks,
> >> > Ian
> >> >
> >>
>


Re: Arrow sync call July 20 at 12:00 US/Eastern, 16:00 UTC

2022-07-20 Thread Gavin Ray
That JDBC PreparedStatement binding utility looks super useful!
I had one question about the behavior of it, if that's alright:

The doc says:

"Each call to next() will bind parameters
> from the next row of data, and then the application can execute the
> statement, call addBatch(), etc. as desired."


And shows the code:

final JdbcParameterBinder binder =
> JdbcParameterBinder.builder(statement, root).bindAll().build();



> while (binder.next()) {
> statement.executeUpdate();
> }


Could someone elaborate what happens here with some simple example
VectorSchemaRoot?
I'm having trouble following the meaning. Does this perform executeUpdate()
once for each row-wise set of column values?

On Wed, Jul 20, 2022 at 1:41 PM Will Jones  wrote:

> Attendees:
>
>- Jacob Wujciak-Jens
>- James Duong
>- Rok Mihevc
>- Raul Cumplido
>- Eduardo Ponce
>- Jeremy Parr-Pearson
>- Will Jones
>- Joris Van den Bossche
>
> Discussion
>
> Arrow 9.0.0 Release
>
> Increased capacity for crossbow, like 3x, including Macs. Devs are
> encouraged to use more crossbow runs to make sure their PRs keep master as
> release-able as possible.
>
>
> Also working on more caching improvements, but likely won’t make it in by
> release.
>
> There's been a large decline in passing nightly tests in recent days (see
> the Nightly Dashboard [1]). This is mostly caused by race conditions in the
> scanner (ARROW-17127 [2]). This issue is a blocker for release, but does
> not yet have any progress.
>
> We still have 8 blocker issues for release (see release dashboard [3]).
> Only 4 of them appear to be actively worked on, so more attention is needed
> on these issues.
>
>
> From monday onwards, there will be a feature freeze for the 9.0.0 release.
> The release managers will only cherry pick commits that fix blocker issues.
> Contributors should continue to merge stuff to master as normal; it will
> just not be included in release.
>
> One nightly failure is due to a known issue with protobuf ABI on MacOS.
> There is an upstream release coming soon that will fix it [4].
>
> Any reviews that should be prioritized? David noted that the JDBC module in
> Java needs to be reviewed by additional maintainers [5]. Key question: Do
> we want to support this?
> Win32 PR
>
> James has been working on the Win32 PR fix [6]. Mostly build warnings being
> fixed. Focusing on 2017 32-bit MSVC. Most of the library code has been
> fixed; now fixing test cases.
>
> One failing test is the R ubuntu test. Rok noted it is likely unrelated as
> he is seeing elsewhere [7].
> Flight SQL JDBC Contribution Update
>
> Apache side was scrutinizing the vote, but now that discussion is settled.
> Still waiting on one ICLA, which may have been submitted but gotten lost.
> This contribution likely won’t hit 9.0.0.
>
>
> [1] https://crossbow.voltrondata.com/
> [2] https://issues.apache.org/jira/browse/ARROW-17127
> [3] https://cwiki.apache.org/confluence/display/ARROW/Arrow+9.0.0+Release
> [4] https://github.com/protocolbuffers/protobuf/pull/10271
> [5] https://github.com/apache/arrow/pull/13589
> [6] https://github.com/apache/arrow/pull/13532
> [7] https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true
>
>
>
> On Tue, Jul 19, 2022 at 12:50 PM Ian Cook  wrote:
>
> > Hi all,
> >
> > Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
> >
> > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> >
> > Alternatively, enter this information into the Zoom website or app to
> > join the call:
> > Meeting ID: 876 4903 3008
> > Passcode: 958092
> >
> > Thanks,
> > Ian
> >
>


Re: Arrow Flight usage with graph databases

2022-07-20 Thread Gavin Ray
>
> We considered the option to analyze data to build a schema on the fly,
> however it will be quite an expensive operation which will not allow us to
> get performance benefits from using Arrow Flight.


I'm not sure if you'll be able to avoid generating a schema on the fly, if
it's anything like SQL or GraphQL queries
since each query would have a unique shape based on the user's selection.

Have you benchmarked this out of curiosity?
(It's not an uncommon usecase from what I've seen)

For example, Matt Topol does this to dynamically generate response schemas
in his implementation
of GraphQL-via-Flight and he says the overhead is negligible.

On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk
 wrote:

> Hi David,
>
> We are planning to use Flight for the prototype. We are also planning to
> use Flight SQL as a reference, however we wanted to explore ideas whether
> Arrow Flight Graph can be implemented on top of Arrow Flight (similar to
> Arrow Flight SQL).
>
> Graph databases generally do not expose or enforce schema, which indeed
> makes it challenging. While we do have ideas on building extensions for
> graph databases to add schema, and we do see some other ideas related to
> this, we will not be able to rely on this as part of the initial prototype.
> We considered the option to analyze data to build a schema on the fly,
> however it will be quite an expensive operation which will not allow us to
> get performance benefits from using Arrow Flight.
>
> >What type/size metadata are you referring to?
> Metadata usually includes information about data type, size and
> type-specific properties. Some complex types are made up of 10 or more
> parts. Each Vertex or Edge of graph can have its own distinct set of
> properties, but the total number of types is several dozen and this can
> serve as a basis for constructing a schema. The total size of metadata can
> be quite big, as we wanted to support cases where the graph database can be
> very large (e.g. hundreds of GBs, with vertices and edges possibly
> containing different properties).
> More information about the serialization format we are using right now can
> be found at https://tinkerpop.apache.org/docs/3.5.4/dev/io/#graphbinary.
>
> >So effectively, the internal format is being carried in a string/binary
> column?
> Yes, I am considering this option for the first stage of implementation.
>
> David, thank you again for your reply, and please let me know your thoughts
> or whether you might have any suggestions around adopting Arrow Flight for
> schema-less databases.
>
> Regards, Valentyn.
>
> On Mon, Jul 18, 2022 at 5:23 PM David Li  wrote:
>
> > Hi Valentyn,
> >
> > Just to make sure, is this Flight or Flight SQL? I ask since Flight
> itself
> > does not have a notion of transactions in the first place. I'm also
> curious
> > what the intended target client application is.
> >
> > Not being familiar with graph databases myself, I'll try to give some
> > comments…
> >
> > Lack of a schema does make things hard. There were some prior discussions
> > about schema evolution during a (Flight) data stream, which would let you
> > add/remove fields as the query progresses. And unions would let you
> > accommodate inconsistent types. But if the changes are frequent, you'd
> > negate many of the benefits of Arrow/Flight. And both of these could make
> > client-side usage inconvenient.
> >
> > What type/size metadata are you referring to? Presumably, this would
> > instead end up in the schema, once using Arrow?
> >
> > Is there any possibility to (say) unify (chunks of) the result to a
> > consistent schema at least? Or possibly, encoding (some) properties as a
> > Map> instead of as columns. (This negates the benefits
> > of columnar data, of course, if you are interested in a particular
> > property, but if you know those properties up front, the server could
> pull
> > those out into (consistently typed) columns.)
> >
> > > We are currently working on a prototype in which we are trying to use
> > Arrow Flight as a transport for transmitting requests and data to Gremlin
> > Server. Serialization is still based on an internal format due to schema
> > creation complexity.
> >
> > So effectively, the internal format is being carried in a string/binary
> > column?
> >
> > On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
> > > Hi All,
> > >
> > > I'm investigating the possibility of using Arrow Flight with graph
> > databases, and exploring how to enable Arrow Flight endpoint in Apache
> > Tinkerpop Gremlin server.
> > >
> > > Now graph databases use several incompatible protocols that make it
> > difficult to use and spread the technology.
> > > A common features for graph databases are
> > > 1. Lack of a scheme. Each vertex of the graph can have its own set of
> > properties, including properties with the same name but different types.
> > Metadata such as type and size are also passed with each value, which
> > increases the amount of data 

Re: Arrow sync call June 8 at 12:00 US/Eastern, 16:00 UTC

2022-06-09 Thread Gavin Ray
This is awesome, thanks so much for the comprehensive reply

RE: point #9, also holding my breath for data update operations
(INSERT/UPDATE/DELETE) to be added to Substrait
Have an open issue about it, it needs design work (which I don't think I'm
qualified to do)

Add Insert/Update/Delete basic functionality to specification · Issue #128
· substrait-io/substrait (github.com)
<https://github.com/substrait-io/substrait/issues/128>

On Wed, Jun 8, 2022 at 11:09 PM Ian Cook  wrote:

> Hi Gavin,
>
> There was no detailed discussion in the meeting about this, just some
> general comments, but I'll share a few areas of collaboration that I'm
> aware of:
> - There is work ongoing to enable the Arrow C++ compute engine (aka
> "Acero") to consume Substrait plans, change them into ExecPlans, and
> execute them. Work started on this late last year [1] and has
> continued since then [2].
> - There are plans to adopt Substrait in DataFusion [3] and Ballista [4]
>
> There are also several other Sustrait-related projects not directly in
> Arrow repos that engineers at Voltron Data are working on:
> - Creating a Substrait compiler for Ibis [5], to allow Python users to
> write code in a convenient analytics DSL and have it execute on
> engines that can consume Substrait
> - Creating a Substrait compiler for dplyr [6], to allow R users to
> write dplyr code that can execute on engines that can consume
> Substrait
> - Creating a Substrait plan validator [7]
> - Planning for "ADBC" to support Substrait [8]
> - Defining more functions in the Substrait specification [9] <-- This
> is an area where we could use more help
>
> Thanks,
> Ian
>
> [1] https://github.com/apache/arrow/pull/11707
> [2]
> https://github.com/apache/arrow/pulls?q=is%3Apr+substrait+label%3Alang-c%2B%2B
> [3] https://github.com/apache/arrow-datafusion/issues/2646
> [4] https://github.com/apache/arrow-ballista/issues/32
> [5] https://github.com/ibis-project/ibis-substrait/
> [6] https://github.com/voltrondata/substrait-r
> [7] http://github.com/substrait-io/substrait-validator
> [8]
> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
> [9] https://github.com/substrait-io/substrait/tree/main/extensions
>
>
>
> On Wed, Jun 8, 2022 at 5:41 PM Gavin Ray  wrote:
> >
> > Thanks Ian -- can I ask whether there was any discussion of note that
> > happened around Arrow + Substrait stuff?
> >
> >
> > On Wed, Jun 8, 2022 at 5:31 PM Ian Cook  wrote:
> >
> > > Attendees:
> > >
> > > Ian Cook
> > > Raúl Cumplido
> > > Alenka Frim
> > > Ian Joiner
> > > Will Jones
> > > Jorge Leitão
> > > David Li
> > > Rok Mihevc
> > > Ashish Paliwal
> > > Matthew Topol
> > > Jacob Wujciak
> > >
> > >
> > > Discussion:
> > >
> > > Recent changes to the merge script for apache/arrow PRs
> > > - Now uses a personal access token (PAT) to authenticate to the ASF
> Jira
> > > - Now requires the GitHub PAT to have workflow scope
> > > - See discussion about this on Zulip [1]
> > >
> > > Stabilizing the C Stream interface
> > > - It has been 20 months since its introduction, with no changes
> > > - See the ML discussion [2] about this
> > > - Will Jones has put up two PRs [3][4] and started a vote [5] about
> > > this on the mailing list
> > >
> > > Changes to release management guide
> > > - Most of the content from the release management guide has been moved
> > > [6] from Confluence [7] to the Arrow repo [8] where it is built as
> > > part of the Arrow docs site [9]
> > >
> > > Proposed changes to release process
> > > -  Raúl has proposed [10] a change to the release process to simplify
> > > creation of release candidates and has opened a PR [11] to update the
> > > release management guide to reflect this change
> > >
> > > Substrait project
> > > - There is more collaboration happening between the Arrow and Substrait
> > > projects
> > > - There is a Substrait Community page [12] with details about how to
> > > get involved in Substrait
> > >
> > > Proposal to Dockerize the integration tests:
> > > - Jorge opened a PR proposing this [13] that Raúl and Jacob are
> reviewing
> > >
> > > [1]
> > >
> https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Merge.20script.20with.20API.20keys/near/285049925
> > > [2] https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55
> > > [3] https://

Re: Arrow sync call June 8 at 12:00 US/Eastern, 16:00 UTC

2022-06-08 Thread Gavin Ray
Thanks Ian -- can I ask whether there was any discussion of note that
happened around Arrow + Substrait stuff?


On Wed, Jun 8, 2022 at 5:31 PM Ian Cook  wrote:

> Attendees:
>
> Ian Cook
> Raúl Cumplido
> Alenka Frim
> Ian Joiner
> Will Jones
> Jorge Leitão
> David Li
> Rok Mihevc
> Ashish Paliwal
> Matthew Topol
> Jacob Wujciak
>
>
> Discussion:
>
> Recent changes to the merge script for apache/arrow PRs
> - Now uses a personal access token (PAT) to authenticate to the ASF Jira
> - Now requires the GitHub PAT to have workflow scope
> - See discussion about this on Zulip [1]
>
> Stabilizing the C Stream interface
> - It has been 20 months since its introduction, with no changes
> - See the ML discussion [2] about this
> - Will Jones has put up two PRs [3][4] and started a vote [5] about
> this on the mailing list
>
> Changes to release management guide
> - Most of the content from the release management guide has been moved
> [6] from Confluence [7] to the Arrow repo [8] where it is built as
> part of the Arrow docs site [9]
>
> Proposed changes to release process
> -  Raúl has proposed [10] a change to the release process to simplify
> creation of release candidates and has opened a PR [11] to update the
> release management guide to reflect this change
>
> Substrait project
> - There is more collaboration happening between the Arrow and Substrait
> projects
> - There is a Substrait Community page [12] with details about how to
> get involved in Substrait
>
> Proposal to Dockerize the integration tests:
> - Jorge opened a PR proposing this [13] that Raúl and Jacob are reviewing
>
> [1]
> https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Merge.20script.20with.20API.20keys/near/285049925
> [2] https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55
> [3] https://github.com/apache/arrow/pull/13345
> [4] https://github.com/apache/arrow-rs/pull/1821
> [5] https://lists.apache.org/thread/5bvk6m3y3wl0m4jdsnyhdylt1w5j288k
> [6] https://github.com/apache/arrow/pull/13272
> [7]
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> [8]
> https://github.com/apache/arrow/blob/master/docs/source/developers/release.rst
> [9] https://arrow.apache.org/docs/dev/developers/release.html
> [10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
> [11] https://github.com/apache/arrow/pull/13308
> [12] https://substrait.io/community/
> [13] https://github.com/apache/arrow/pull/12407
>
> On Wed, Jun 8, 2022 at 10:44 AM Ian Cook  wrote:
> >
> > Hi all,
> >
> > Our biweekly sync call is today at 12:00 noon Eastern time.
> >
> > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> >
> > Alternatively, enter this information into the Zoom website or app to
> > join the call:
> > Meeting ID: 876 4903 3008
> > Passcode: 958092
> >
> > Thanks,
> > Ian
>


Re: [DISC] Improving Arrow's database support

2022-06-01 Thread Gavin Ray
This sounds great, but I had one question:

Read the initial ADBC proposal and it mentioned that OLTP was not a
targeted usecase
If this work is intended to take on the role of a sort of standard ABI/SDK,
does that mean that building OLTP-oriented drivers/tooling with it is off
the table?

On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney  wrote:

> I went ahead and created
>
> https://github.com/apache/arrow-adbc
>
> I directed issue comments / PRs to issues@
>
> On Tue, May 31, 2022 at 8:49 PM Wes McKinney  wrote:
> >
> > I think spinning up a new repository while this exploratory work
> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
> > similar (the name can always be changed later). That would bubble up
> > discussions in a way that's easier for people to follow (watching your
> > fork isn't ideal!). If it makes sense to move code later, it can
> > always be moved.
> >
> >
> > On Tue, May 31, 2022 at 1:02 PM David Li  wrote:
> > >
> > > Some updates:
> > >
> > > The proposal is being updated based on feedback from contributors to
> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
> API design and how to implement data ingestion/bound parameters:
> https://github.com/lidavidm/arrow/issues
> > >
> > > If anyone has suggestions/ideas/questions, or would like to jump in as
> well, please feel free to chime in there too.
> > >
> > > I have also been wondering if we might want to plan to split off a new
> repo for this work? In particular, some components might be easiest to
> consume if they didn't also have a hard dependency on the Arrow C++
> libraries. And we could use the repo to manage contributed drivers (some of
> which may individually leverage the Arrow libraries). Of course,
> maintaining a parallel build system, setting up releases, etc. is also a
> lot of work.
> > >
> > > -David
> > >
> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
> > > > I don't have major new things to add on this topic except that I've
> > > > long had the aspiration of creating something like Python's DBAPI 2.0
> > > > [1] at the C or C++ level to enable a measure of API standardization
> > > > for Arrow-native read/write interfaces with database drivers. It
> seems
> > > > like a natural complement to the wire-protocol standardization work
> > > > with FlightSQL. I had previously brought in some code that I had
> > > > worked on related to interfacing with the HiveServer2 wire protocol
> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
> > > > intention of prototyping but never was able to find the time.
> > > >
> > > > From an external messaging standpoint, one thing that will be
> > > > important is to assert that this is not intended to displace or
> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
> > > > Arrow-native APIs could be added somehow to existing driver libraries
> > > > where it made sense, so that if they are used in an application that
> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
> > > > result sets, or doing bulk inserts, etc.
> > > >
> > > > [1]: https://peps.python.org/pep-0249/
> > > >
> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou 
> wrote:
> > > >>
> > > >>
> > > >> Do we want something more flexible than dlopen() and runtime symbol
> > > >> lookup (a mechanism which constrains the way you can organize and
> > > >> distribute drivers)?
> > > >>
> > > >> For example, perhaps we could expose an API struct of function
> pointers
> > > >> that could be obtained through driver-specific means.
> > > >>
> > > >>
> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
> > > >> > Hello,
> > > >> >
> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
> [1], and long-standing tickets/discussions about database support in Arrow
> [2], it seems there's an opportunity to define standard database interfaces
> for Arrow that could unify these efforts. So we've put together a proposal
> for "ADBC", a common Arrow-based database client API:
> > > >> >
> > > >> >
> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> > > >> >
> > > >> > A common API and implementations could help combine/simplify
> client-side projects like pgeon, or what DBI is considering [3], and help
> them take advantage of developments like Flight SQL and existing columnar
> APIs.
> > > >> >
> > > >> > We'd appreciate any feedback. (Comments should be open, please
> let me know if not.)
> > > >> >
> > > >> > [1]: https://github.com/0x0L/pgeon
> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
> > > >> >
> > > >> > Thanks,
> > > >> > David
>


Re: Datafusion's Java binding is available in Maven Central

2022-05-16 Thread Gavin Ray
On that note, you should be able to use the "jextract" tool
from Project Panama to auto generate the glue code and types if you have C
headers

panama-foreign/panama_jextract.md at foreign-jextract ·
openjdk/panama-foreign (github.com)


On Mon, May 16, 2022 at 12:24 PM Larry White  wrote:

> Hi,
> Since this is a recent improvement, I'm curious about what motivated the
> decision to not use the c-data-interface? Was it strictly a matter of
> timing or familiarity, or is there some advantage to the approach you took?
>
> I ask because I'm in the process of moving other JNI interfaces to c-data.
>
> Thanks very much.
>
> On Mon, May 16, 2022 at 6:30 AM Jiayu Liu  wrote:
>
> > Thanks for the question Atonine,
> >
> > So far the data is copied over (not IPC per-se, since it's the same
> > process), because I haven't found time (and motivation) to migrate to
> > Arrow C interface just yet.
> >
> > A next step, is to allow the project to depend on arrow-c-data [1], and
> > also optimize how .so and .dylib files are shipped: currently it's
> > packaged separately, I had doubts about shipping both into a .jar,
> > because combined they exceed > 50MB.
> >
> > [1]: https://repo1.maven.org/maven2/org/apache/arrow/arrow-c-data/8.0.0/
> >
> > On May 11, 2022, Jiayu Liu  wrote:
> > > Hi dev@arrow,
> > >
> > > Recently I've created and published a Java binding[1] to
> > > datafusion[2], as part of datafusion-contrib projects[3]. I've updated
> > > the README.md[4] so people can pick it up via maven[5] or gradle.
> > >
> > > Any feedback or contributions are welcome!
> > >
> > > [1]: https://github.com/datafusion-contrib/datafusion-java
> > > [2]: https://github.com/apache/arrow-datafusion
> > > [3]: https://github.com/datafusion-contrib
> > > [4]: https://github.com/datafusion-contrib/datafusion-
> > > java/blob/main/README.md
> > > [5]: https://repo.maven.apache.org/maven2/io/github/datafusion-
> > > contrib/datafusion-java/
> >
>


Re: Datafusion's Java binding is available in Maven Central

2022-05-16 Thread Gavin Ray
This is awesome, thank you!

On Mon, May 16, 2022 at 6:30 AM Jiayu Liu  wrote:

> Thanks for the question Atonine,
>
> So far the data is copied over (not IPC per-se, since it's the same
> process), because I haven't found time (and motivation) to migrate to
> Arrow C interface just yet.
>
> A next step, is to allow the project to depend on arrow-c-data [1], and
> also optimize how .so and .dylib files are shipped: currently it's
> packaged separately, I had doubts about shipping both into a .jar,
> because combined they exceed > 50MB.
>
> [1]: https://repo1.maven.org/maven2/org/apache/arrow/arrow-c-data/8.0.0/
>
> On May 11, 2022, Jiayu Liu  wrote:
> > Hi dev@arrow,
> >
> > Recently I've created and published a Java binding[1] to
> > datafusion[2], as part of datafusion-contrib projects[3]. I've updated
> > the README.md[4] so people can pick it up via maven[5] or gradle.
> >
> > Any feedback or contributions are welcome!
> >
> > [1]: https://github.com/datafusion-contrib/datafusion-java
> > [2]: https://github.com/apache/arrow-datafusion
> > [3]: https://github.com/datafusion-contrib
> > [4]: https://github.com/datafusion-contrib/datafusion-
> > java/blob/main/README.md
> > [5]: https://repo.maven.apache.org/maven2/io/github/datafusion-
> > contrib/datafusion-java/
>


Re: June 23 virtual conference to highlight work in the Arrow ecosystem

2022-05-13 Thread Gavin Ray
Super neat, saw the announcement post on Twitter and signed up the other
day!

If folks would find it interesting, I could do a short talk on a
use-case for FlightSQL (and Substrait)
The gist of it is having a central API that allows users/vendors to write
"plugins" to register new data sources:

[image: image.png]

You lose a lot of the benefits of Arrow in the serialization to JSON, but
FlightSQL as a specification is a great language-agnostic way to share
schema metadata and handle queries.
With Substrait you get a spec for expressing data compute operations as
well, so you can have things solved on both the "tell me what you have" and
"give me what you have" fronts.

(Have to wait for write operations in Substrait though, for full
functionality)

On Fri, May 13, 2022 at 9:51 AM Wes McKinney  wrote:

> hi all,
>
> My employer (Voltron Data) is organizing a free virtual conference on
> June 23 to highlight development work and usage of Apache Arrow — you
> can register for this or apply to give a talk here:
>
> https://thedatathread.com/
>
> We are especially interested in hearing from users (as opposed to only
> project developers/contributors!) about how they are using Arrow in
> their downstream applications. If you would be interested in speaking
> (talks will be pre-recorded, so you don't need to be available on June
> 23), please apply to give a short talk (~15 min) on the website!
>
> Thanks,
> Wes
>


Re: Arrow sync call May 11 at 12:00 US/Eastern, 16:00 UTC

2022-05-13 Thread Gavin Ray
I agree with this as well, and I it's also along the lines of what I was
trying to propose here:

"[RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry,
increase adoption/audience and productivity."
https://github.com/apache/arrow/issues/12618

It would be really nice if there was a canonical, language-independent
specification (or something close to it) for what a DataFrame-like API on
top of Arrow should look like.
Then you get continuity between languages and (in theory) it should be
easier to make contributions since they wouldn't be locked to a particular
language implementation.

On Fri, May 13, 2022 at 10:30 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> I think Arrow should definitely consider adding a DataFrame-like API.
>
> There are multiple reasons why exposing Arrow to end users instead of
> restricting it to developers of framework would be beneficial for the Arrow
> project itself.
>
> A rough approximation of DataFrame like API has been growing during the
> years anyway in many bindings and it's probably better to consolidate that
> effort in a structured process.
> The main thing I'm concerned about is adding one more interface for users.
> If we want to grow DataFrame like APIs we should grow them on top of
> Dataset (Table probably wouldn't give us enough memory management
> flexibility)  as for most users it's already confusing enough to understand
> why they should use Table or Dataset. Figure if we add one more tabular
> data structure.
>
> On Thu, May 12, 2022 at 7:14 PM Wes McKinney  wrote:
>
> > > Discussion about whether the community around Arrow would like to have
> > DataFrame-like APIs for Arrow in more languages, for example C++
> >
> > We've discussed this a bit on the mailing list in the past, see
> >
> >
> >
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
> >
> > for example. It's a complicated subject because the problems that need
> > solving in a "data frame library" are much more than defining an API —
> > they involve establishing execution and mutation/copy-on-write
> > semantics (the latter which has been a huge topic of discussion in the
> > pandas community, for example). The API would be driving an internal
> > data management logic engine (similar to pandas's internal logic
> > engine — but hopefully we could make something without as many
> > problems) which would manipulate chunks of in-memory and out-of-core
> > Arrow data internally.
> >
> > I still would be interested in an Arrow-native "data frame library"
> > similar to the SFrame library that's part of Apple's (now defunct?)
> > Turi Create library [1]
> >
> > It's a can of worms but a problem not approached lightly (thinking of
> > that "one does not simply..." meme right now) and best done in heavy
> > consultation with communities that have experience supporting
> > production use of data frames for data science use cases for many
> > years.
> >
> > [1]: https://github.com/apple/turicreate
> >
> > On Wed, May 11, 2022 at 11:38 PM Ian Cook  wrote:
> > >
> > > Attendees:
> > >
> > > Joris Van den Bossche
> > > Ian Cook
> > > Nic Crane
> > > Raul Cumplido
> > > Ian Joiner
> > > David Li
> > > Rok Mihevc
> > > Dragoș Moldovan-Grünfeld
> > > Aldrin Montana
> > > Weston Pace
> > > Eduardo Ponce
> > > Matthew Topol
> > > Jacob Wujciak
> > >
> > >
> > > Discussion:
> > >
> > > Eduardo: Draft PR with a guide showing how to create a new Arrow C++
> > > compute kernel [1]
> > >  - Review requested
> > >
> > > Weston: Proposed changes to ExecPlan in Arrow C++ compute engine [2]
> > >  - Feedback requested on details described in the Jira
> > >
> > > Rok: Temporal rounding kernels option in Arrow C++ compute engine [3]
> > >  - Feedback requested about what we should name it
> > >  - Possibilities include ceil_on_boundary, ceil_is_strictly_greater,
> > > strict_ceil, ceil_is_strictly_greater, is_strict_ceil, ceil_is_strict
> > >  - Joris favors ceil_is_strictly_greater
> > >
> > > Ian C: Discussion about naming the Arrow C++ engine [4]
> > >  - Comments welcome on the mailing list
> > >
> > > David: ADBC (Arrow Database Connectivity) proposal [5][6]
> > >  - Feedback requested
> > >
> > > Ian C: Discussion about whether the community around Arrow would like
> > > to have DataFrame-like APIs for Arrow in more languages, for example
> > > C++
> > >  - For C++, maybe this would look similar to xframe [7]
> > >  - Probably better to approach projects like these outside of Arrow
> > > and have them produce plans in Substrait format [8] which the Arrow
> > > C++ engine (and other engines) could consume and execute
> > >
> > > Arrow 8.0.0 release
> > >  - Most post-release tasks complete
> > >  - Please contribute to the release blog post [9]
> > >
> > > Release process
> > >  - Please comment on the proposed RC process change [10]
> > >  - There is a discussion about changing to a bimonthly major releases
> > > (instead of 

Re: [Rust] Enable GitHub discussions for Rust projects?

2022-05-04 Thread Gavin Ray
How does voting on ASF mailing lists work? I assume random people don't get
votes.
If so, consider this email an informal voice of support -- otherwise +1
from me =)

On Wed, May 4, 2022 at 11:40 AM Matthew Turner 
wrote:

> +1 on enabling GitHub discussions for both arrow-rs and datafusion.  I
> think there is a lot of value in distinguishing actual "issues" with
> questions / conversations.  I believe this would also complement the
> datafusion site which doesn't have any type of forum for conversations.
>
> -Original Message-
> From: Andy Grove 
> Sent: Wednesday, May 4, 2022 11:31 AM
> To: dev 
> Subject: [Rust] Enable GitHub discussions for Rust projects?
>
> We have a request [1] to enable GitHub discussions for DataFusion.
> Personally, I am in favor of doing this for DataFusion as well as arrow-rs.
> We need to file an infra ticket to get this enabled and have to provide a
> link to "consensus discussion thread" [2] so I would like to gather
> opinions here prior to hopefully initiating a vote on this.
>
> I am in favor of keeping discussions separate from issues because:
>
> 1. I believe that it makes It is less intimidating for users to ask
> questions 2. It is easier for part-time contributors to keep an eye on
> discussions than to monitor the issues which are likely always going to be
> higher volume
>
> Please let me know your thoughts.
>
> Thanks,
>
> Andy.
>
>
> [1]
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-datafusion%2Fissues%2F2350data=05%7C01%7C%7C0f650a1d3a3c43492b2808da2de31d62%7C84df9e7fe9f640afb435%7C1%7C0%7C637872750725583461%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=A8Xqspxnay1qSsC0s3hxBWGtwigGJSMdSKd9jBbnurU%3Dreserved=0
> [2]
>
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FINFRA%2FGit%2B-%2B.asf.yaml%2Bfeatures%23Git.asf.yamlfeatures-GitHubDiscussionsdata=05%7C01%7C%7C0f650a1d3a3c43492b2808da2de31d62%7C84df9e7fe9f640afb435%7C1%7C0%7C637872750725583461%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=dF7G3KPa%2BuJDIEnN2t%2F%2BCUhG30Ed4sOUuue8iwlMWk4%3Dreserved=0
>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Gavin Ray
Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above

The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific behavior/abilities of
wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core

On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou  wrote:

>
> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >
> > AFAIK yes. My understanding from playing with it in JS is that a
> > WASM-backed udf execution would be something like:
> >
> > 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > 2. provide a small WASM-compiled middleware of the c data interface that
> > consumes (binary, c data interface pointers)
> > 3. ship a WASM interpreter as part of the query engine
> > 4. pass binary and c data interface pointers from the query engine
> program
> > to the interpreter with WASM-compiled middleware
>
> Ok, but the key word in my question was "safely". What mechanisms are in
> place such that the WASM user function will not access Arrow buffers out
> of bounds? Nothing really stands out in
> https://webassembly.github.io/spec/core/index.html, but it's the first
> time I try to have a look at the WebAssembly spec.
>
> Regards
>
> Antoine.
>
>
> >
> > Step 2 is necessary to read the buffers from FFI and output the result
> back
> > from the interpreter once the UDF is done, similar to what we do in
> > datafusion to run Python from Rust. In the case of datafusion the
> "binary"
> > is a Python function, which has security implications since the Python
> > interpreter allows everything by default.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 25/04/2022 à 23:04, David Li a écrit :
> >>> The WebAssembly documentation has a rundown of the techniques used:
> >> https://webassembly.org/docs/security/
> >>>
> >>> I think usually you would run WASM in-process, though we could indeed
> >> also put it in a subprocess to further isolate things.
> >>
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >> It's not obvious from reading the page above.
> >>
> >>
> >>>
> >>> It would be interesting to define the Flight "harness" protocol.
> >> Handling heterogeneous arguments may require some evolution in Flight
> (e.g.
> >> if the function is non scalar and arguments are of different length -
> we'd
> >> need something like the ColumnBag proposal, so this might be a good
> reason
> >> to revive that).
> >>>
> >>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
>  Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> > I was going to reply to this e-mail thread on user@ but thought I
> > would start a new thread on dev@.
> >
> > Executing user-defined functions in memory, especially untrusted
> > functions, in general is unsafe. For "trusted" functions, having an
> > in-memory API for writing them in user languages is very useful. I
> > remember tinkering with adding UDFs in Impala with LLVM IR, which
> > would allow UDFs to have performance consistent with built-ins
> > (because built-in functions are all inlined into code-generated
> > expressions), but segfaults would bring down the server, so only
> > admins could be trusted to add new UDFs.
> >
> > However, I wonder if we should eventually define an "external UDF"
> > protocol and an example UDF "harness", using Flight to do RPC across
> > the process boundaries. So the idea is that an external local UDF
> > Flight execution service is spun up, and then data is sent to the UDF
> > in a DoExchange call.
> >
> > As Jacques pointed out in an interview 1], a compelling solution to
> > the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> > functions to be run safely in-process.
> 
>  How does the sandboxing work in this case? Is it simply executing in a
>  separate process with restricted capabilities, or are other mechanisms
>  put in place?
> >>
> >
>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Gavin Ray
Sounds like a fantastic idea, and WASM seems a natural choice

You get the ability to opt into IO if you want/need to, with WASI, but by
default
you can rest assured about worst-case consequences being contained.

On Mon, Apr 25, 2022 at 4:20 PM Wes McKinney  wrote:

> I was going to reply to this e-mail thread on user@ but thought I
> would start a new thread on dev@.
>
> Executing user-defined functions in memory, especially untrusted
> functions, in general is unsafe. For "trusted" functions, having an
> in-memory API for writing them in user languages is very useful. I
> remember tinkering with adding UDFs in Impala with LLVM IR, which
> would allow UDFs to have performance consistent with built-ins
> (because built-in functions are all inlined into code-generated
> expressions), but segfaults would bring down the server, so only
> admins could be trusted to add new UDFs.
>
> However, I wonder if we should eventually define an "external UDF"
> protocol and an example UDF "harness", using Flight to do RPC across
> the process boundaries. So the idea is that an external local UDF
> Flight execution service is spun up, and then data is sent to the UDF
> in a DoExchange call.
>
> As Jacques pointed out in an interview 1], a compelling solution to
> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> functions to be run safely in-process. However, we would need to
> harden and document the details of the interface between the host
> language and the user WASM code.
>
> Since there are many different potential kinds of user-defined
> functions aside from scalar functions, that increases the complexity /
> scope of specification work here also.
>
> - Wes
>
> [1]:
> https://reneeshah.medium.com/how-webassembly-gets-used-the-18-most-exciting-startups-building-with-wasm-939474e951db
>
> On Fri, Apr 22, 2022 at 2:09 PM David Li  wrote:
> >
> > This is currently being implemented for Python:
> https://github.com/apache/arrow/pull/12590 It may not land for 8.0.0 but
> should be there for 9.0.0, presumably.
> >
> > It is already possible in C++. The same APIs that built-in functions use
> to register themselves should be available to applications and there's a
> fairly trivial example of this in [1]. Such a function would also be
> available from Python/R/etc. if you could figure out how to
> package/distribute/load the application library appropriately.
> >
> > [1]:
> https://github.com/apache/arrow/blob/e1e782a4542817e8a6139d6d5e022b56abdbc81d/cpp/examples/arrow/compute_register_example.cc
> >
> > On Fri, Apr 22, 2022, at 15:04, Wenlei Xie wrote:
> >
> > Hi,
> >
> > I am wondering if I can define my own Arrow Compute function and use it,
> say in PyArrow? It looks like Compute Function has a FuntionRegistry, but I
> didn't find documentation about how to write your own Arrow Compute
> function (but maybe just didn't find the right place)
> >
> > Thank you so much!
> >
> > --
> > Best Regards,
> > Wenlei Xie
> >
> > Email: wenlei@gmail.com
> >
> >
>


Re: [DISCUSS] A book about Apache Arrow

2022-04-20 Thread Gavin Ray
Nevermind, I'm a bit slow -- the ToC is at the bottom of the Amazon
description.

On Wed, Apr 20, 2022 at 2:17 PM Gavin Ray  wrote:

> Sorry to derail the thread a bit -- the book looks great and based on the
> author/reviewers & summary I've ordered a copy.
> Not a big deal but just curious whether there's a preview
> available/table-of-contents though? I didn't see one on either Amazon or
> Packt.
>
> Thanks =)
>
> On Wed, Apr 20, 2022 at 1:26 PM Matthew Topol  wrote:
>
>> I was wondering who they got as the technical reviewer! Haha. They never
>> told me who they got, they just pass along the comments.
>>
>> Allow me to formally thank you for that! 
>>
>> --Matt
>>
>> -Original Message-
>> From: James Duong 
>> Sent: Wednesday, April 20, 2022 1:03 PM
>> To: dev 
>> Subject: Re: [DISCUSS] A book about Apache Arrow
>>
>> Hi Matt,
>>
>> FYI I'm currently a reviewer on this book already.
>>
>> On Wed, Apr 20, 2022 at 9:51 AM Matthew Topol  wrote:
>>
>> > Hey All,
>> >
>> > I've been writing a book on Apache Arrow for Packt (packtpub.com) and
>> > it's in the final stages of revising and editing etc. The publisher,
>> > Packt, asked if I could reach out to Arrow PMC members for a couple
>> reasons:
>> >
>> >
>> >   1.  To potentially have an Arrow PMC member write a Foreword for the
>> > book, which would improve visibility of the book.
>> >   2.  To see if any PMC members would agree to do a review of the book
>> > prior to its release (currently it is planned to be eBook ready by May
>> > 13th with a print launch on June 17th)
>> >   3.  And to see if there could be any partnering / coordinating with
>> > Packt to create a potential "Community Edition" / provide discount
>> > codes/free copies to members of the Arrow Community along with social
>> > media announcements and so on.
>> >
>> > If anyone is interested please reply and let me know! The book is
>> > already listed on Amazon and Barnes & Noble for Pre-order (
>> > https://urldefense.com/v3/__https://www.amazon.com/Memory-Analytics-Ap
>> > ache-Arrow-hierarchical-dp-1801071039/dp/1801071039/__;!!PBKjc0U4!LvfQ
>> > YRdrUMyrpfWh-giy7Dd5ylHuZrRcS0BBY_12Da4p5bJc7Idb1vgtN4L_dEC2vHLTMVPQJX
>> > YrCBMDcaT0X3HxOQ$
>> > ).
>> >
>> > Thanks much for your time everyone! Hope to hear back from someone soon.
>> >
>> > Matthew Topol Vice President, Principal Software Architect
>> > mto...@factset.com<mailto:mto...@factset.com> T 203.810.1804 FactSet
>> > Research Systems
>> >
>> >
>>
>> --
>>
>> *James Duong*
>> Lead Software Developer
>> Bit Quill Technologies Inc.
>> Direct: +1.604.562.6082 | jam...@bitquilltech.com
>> https://urldefense.com/v3/__https://www.bitquilltech.com__;!!PBKjc0U4!LvfQYRdrUMyrpfWh-giy7Dd5ylHuZrRcS0BBY_12Da4p5bJc7Idb1vgtN4L_dEC2vHLTMVPQJXYrCBMDcaQtRt9Ssw$
>>
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential and privileged information.  Any unauthorized
>> review, use, disclosure, or distribution is prohibited.  If you are not the
>> intended recipient, please contact the sender by reply email and destroy
>> all copies of the original message.  Thank you.
>>
>


Re: [DISCUSS] A book about Apache Arrow

2022-04-20 Thread Gavin Ray
Sorry to derail the thread a bit -- the book looks great and based on the
author/reviewers & summary I've ordered a copy.
Not a big deal but just curious whether there's a preview
available/table-of-contents though? I didn't see one on either Amazon or
Packt.

Thanks =)

On Wed, Apr 20, 2022 at 1:26 PM Matthew Topol  wrote:

> I was wondering who they got as the technical reviewer! Haha. They never
> told me who they got, they just pass along the comments.
>
> Allow me to formally thank you for that! 
>
> --Matt
>
> -Original Message-
> From: James Duong 
> Sent: Wednesday, April 20, 2022 1:03 PM
> To: dev 
> Subject: Re: [DISCUSS] A book about Apache Arrow
>
> Hi Matt,
>
> FYI I'm currently a reviewer on this book already.
>
> On Wed, Apr 20, 2022 at 9:51 AM Matthew Topol  wrote:
>
> > Hey All,
> >
> > I've been writing a book on Apache Arrow for Packt (packtpub.com) and
> > it's in the final stages of revising and editing etc. The publisher,
> > Packt, asked if I could reach out to Arrow PMC members for a couple
> reasons:
> >
> >
> >   1.  To potentially have an Arrow PMC member write a Foreword for the
> > book, which would improve visibility of the book.
> >   2.  To see if any PMC members would agree to do a review of the book
> > prior to its release (currently it is planned to be eBook ready by May
> > 13th with a print launch on June 17th)
> >   3.  And to see if there could be any partnering / coordinating with
> > Packt to create a potential "Community Edition" / provide discount
> > codes/free copies to members of the Arrow Community along with social
> > media announcements and so on.
> >
> > If anyone is interested please reply and let me know! The book is
> > already listed on Amazon and Barnes & Noble for Pre-order (
> > https://urldefense.com/v3/__https://www.amazon.com/Memory-Analytics-Ap
> > ache-Arrow-hierarchical-dp-1801071039/dp/1801071039/__;!!PBKjc0U4!LvfQ
> > YRdrUMyrpfWh-giy7Dd5ylHuZrRcS0BBY_12Da4p5bJc7Idb1vgtN4L_dEC2vHLTMVPQJX
> > YrCBMDcaT0X3HxOQ$
> > ).
> >
> > Thanks much for your time everyone! Hope to hear back from someone soon.
> >
> > Matthew Topol Vice President, Principal Software Architect
> > mto...@factset.com T 203.810.1804 FactSet
> > Research Systems
> >
> >
>
> --
>
> *James Duong*
> Lead Software Developer
> Bit Quill Technologies Inc.
> Direct: +1.604.562.6082 | jam...@bitquilltech.com
> https://urldefense.com/v3/__https://www.bitquilltech.com__;!!PBKjc0U4!LvfQYRdrUMyrpfWh-giy7Dd5ylHuZrRcS0BBY_12Da4p5bJc7Idb1vgtN4L_dEC2vHLTMVPQJXYrCBMDcaQtRt9Ssw$
>
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential and privileged information.  Any unauthorized
> review, use, disclosure, or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and destroy
> all copies of the original message.  Thank you.
>


Re: Arrow in HPC

2022-04-07 Thread Gavin Ray
Congrats!

On Thu, Apr 7, 2022 at 1:35 PM David Li  wrote:

> Just as an update: thanks to Yibo for the reviews; we've merged an initial
> implementation that will be available in Arrow 8.0.0 (if built from
> source). There's definitely more work to do:
>
> ARROW-10787 [C++][Flight] DoExchange doesn't support dictionary replacement
> ARROW-15756 [C++][FlightRPC] Benchmark in-process Flight performance
> ARROW-15835 [C++][FlightRPC] Refactor auth, middleware into the
> transport-agnostic layer
> ARROW-15836 [C++][FlightRPC] Refactor remaining methods into
> transport-agnostic handlers
> ARROW-16069 [C++][FlightRPC] Refactor error statuses/codes into the
> transport-agnostic layer
> ARROW-16124 [C++][FlightRPC] UCX server should be able to shed load
> ARROW-16125 [C++][FlightRPC] Implement shutdown with deadline for UCX
> ARROW-16126 [C++][FlightRPC] Pipeline memory allocation/registration
> ARROW-16127 [C++][FlightRPC] Improve concurrent call implementation in UCX
> client
> ARROW-16135 [C++][FlightRPC] Investigate TSAN with gRPC/UCX tests
>
> However it should be usable, and any feedback from intrepid users would be
> very welcome.
>
> On Fri, Mar 18, 2022, at 14:45, David Li wrote:
> > For anyone interested, the PR is finally up and ready:
> > https://github.com/apache/arrow/pull/12442
> >
> > As part of this, Flight in C++ was refactored to allow plugging in
> > alternative transports. There's more work to be done there (auth,
> > middleware, etc. need to be uplifted into the common layer), but this
> > should enable UCX and potentially other network transports.
> >
> > There's still some caveats as described in the PR itself, including
> > some edge cases I need to track down and missing support for a variety
> > of features, but the core data plane methods are supported and the
> > Flight benchmark can be run.
> >
> > Thanks to Yibo Cai, Pavel Shamis, Antoine Pitrou (among others) for
> > assistance and review, and the HPC Advisory Council for granting access
> > to an HPC cluster to help with development and testing.
> >
> > On Tue, Jan 18, 2022, at 18:33, David Li wrote:
> >> Ah, yes, thanks for the reminder. That's one of the things that needs
> >> to be addressed for sure.
> >>
> >> -David
> >>
> >> On Tue, Jan 18, 2022, at 17:48, Supun Kamburugamuve wrote:
> >>> One general observation. I think this implementation uses the polling
> to
> >>> check the progress. Because of the client server semantics of Arrow
> Flight,
> >>> you may need to use an interrupt based polling like epoll to avoid the
> busy
> >>> looping.
> >>>
> >>> Best,
> >>> Supun..
> >>>
> >>> On Tue, Jan 18, 2022 at 8:13 AM David Li  wrote:
> >>>
> >>> > Thanks for those results, Yibo! Looks like there's still more room
> for
> >>> > improvement here. Yes, things are a little unstable, though I didn't
> >>> > get that much trouble trying to just start the benchmark - I will
> need
> >>> > to find suitable hardware and iron out these issues. Note that I've
> >>> > only implemented DoGet, and I haven't implemented concurrent streams,
> >>> > which would explain why most benchmark configurations hang or error.
> >>> >
> >>> > Since the last time, I've rewritten the prototype to use UCX's
> "active
> >>> > message" functionality instead of trying to implement messages over
> >>> > the "streams" API. This simplified the code. I also did some
> >>> > refactoring along the lines of Yibo's prototype to share more code
> >>> > between the gRPC and UCX implementations. Here are some benchmark
> >>> > numbers:
> >>> >
> >>> > For IPC (server/client on the same machine): UCX with shared memory
> >>> > handily beats gRPC here. UCX with TCP isn't quite up to par, though.
> >>> >
> >>> > gRPC:
> >>> > 128KiB batches: 4463 MiB/s
> >>> > 2MiB batches:   3537 MiB/s
> >>> > 32MiB batches:  1828 MiB/s
> >>> >
> >>> > UCX (shared memory):
> >>> > 128KiB batches: 6500 MiB/s
> >>> > 2MiB batches:  13879 MiB/s
> >>> > 32MiB batches:  9045 MiB/s
> >>> >
> >>> > UCX (TCP):
> >>> > 128KiB batches: 1069 MiB/s
> >>> > 2MiB batches:   1735 MiB/s
> >>> > 32MiB batches:  1602 MiB/s
> >>> >
> >>> > For RPC (server/client on different machines): Two t3.xlarge (4 core,
> >>> > 16 thread) machines were used in AWS EC2. These have "up to" 5Gbps
> >>> > bandwidth. This isn't really a scenario where UCX is expected to
> >>> > shine, however, UCX performs comparably to gRPC here.
> >>> >
> >>> > gRPC:
> >>> > 128 KiB batches: 554 MiB/s
> >>> > 2 MiB batches:   575 MiB/s
> >>> >
> >>> > UCX:
> >>> > 128 KiB batches: 546 MiB/s
> >>> > 2 MiB batches:   567 MiB/s
> >>> >
> >>> > Raw test logs can be found here:
> >>> > https://gist.github.com/lidavidm/57d8a3cba46229e4d277ae0730939acc
> >>> >
> >>> > For IPC, the shared memory results are promising in that it could be
> >>> > feasible to expose a library purely over Flight without worrying
> about
> >>> > FFI bindings. Also, it seems results are roughly comparable to what
> >>> > Yibo observed in ARROW-15282 [1] meaning UCX 

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-03-29 Thread Gavin Ray
"Arrow Compute Engine" sounds quite nice to me, tbh
Agreeing with the points made above about ACE being difficult to google,
and AQE being a loaded term in query engines already.


On Tue, Mar 29, 2022 at 10:07 AM Andy Grove  wrote:

> Just my 2 cents on this. If you were to call it ACE, I would make the C
> stand for "Compute" rather than C++ since it is intended to be used from
> other languages, such as Python.
>
> The problem with ACE is that is a common word and it will make it hard to
> Google for documentation. Even the combination of Arrow and ACE already has
> plenty of results.
>
> Also, I saw in the linked doc a reference to AQE (for Arrow Query Engine).
> I would not recommend using this since many people know AQE as Adaptive
> Query Execution (especially Spark users).
>
> "Arrow Compute Engine" in full doesn't sound bad perhaps?
>
> With DataFusion, I made a list of words related to the project (data,
> query, compute, engine, etc) and then a list of completely unrelated words
> and then looked at the combinations to see what sounded good to me.
>
> Andy.
>
>
>
>
> On Mon, Mar 28, 2022 at 4:31 PM Antoine Pitrou  wrote:
>
> >
> > ACE is already the name of a well-known C++ library, though I'm not sure
> > how widely used it is nowadays :
> > http://www.dre.vanderbilt.edu/~schmidt/ACE.html
> >
> > I would name it "execution engine" or "Arrow C++ execution engine" in
> full.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/03/2022 à 00:15, Wes McKinney a écrit :
> > > hi all,
> > >
> > > There has been a steady stream of work over the last year and a half
> > > or so to create a set of query engine building blocks in C++ to
> > > evaluate queries against Arrow Datasets and input streams, which can
> > > be of use to applications that are already building on top of the
> > > Arrow C++ project. This effort has a smaller surface area than
> > > DataFusion since SQL parsing and query optimization are being left to
> > > other tools.
> > >
> > > I thought it would be useful to have a name for this subproject
> > > similar to how we have Gandiva, Plasma, DataFusion, and other named
> > > Apache Arrow subprojects. We had discussed creating a project like
> > > this a few years ago [1], but since there are now multiple
> > > Arrow-native or Arrow-compatible query engines in the wild, it would
> > > be helpful to disambiguate.
> > >
> > > One simple name is ACE — Arrow C++ Engine. I'm not very good at naming
> > > things, so if there are other suggestions from the community I would
> > > love to hear them!
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]:
> >
> https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y
> >
>


Re: [FlightSQL] Higher-level facade API to increase adoption/audience? Or does this belong as a personal project

2022-03-13 Thread Gavin Ray
FWIW, I filed an RFC issue here, along with a prototype implementation and
sample usage + console output code:

https://github.com/apache/arrow/issues/12618

On Sun, Mar 13, 2022 at 10:43 AM Gavin Ray  wrote:

> Generally, the preferred pattern is one VectorSchemaRoot that
>> gets reloaded each time.  So an API like "df.loadVectorSchemaRoot(root)"
>> probably makes more sense but we can iterate on this.
>>
>
> Could you expand on what exactly you mean by this?
>
> Still a bit blurry on the best-practices behind sending
> the Arrow response in Flight and seems like an important point.
>
>
> ... creating a new contrib module that maps
>> from java objects (just like there are JDBC and Avro ones) seems
>> worthwhile.  If you are interested in contributing something like this I
>> think a short design doc would be worth-while.
>>
>
> Where would be the best place to post this?
>
> I was thinking about GitHub issues but I am GitHub-centric,
> not sure if JIRA or mailing list would be better.
>
> Thanks, Micah!
>
>
> On Sun, Mar 13, 2022 at 12:46 AM Micah Kornfield 
> wrote:
>
>> Hi Gavin,
>>
>> > Just curious whether there is any interest/intention of possibly making
>> a
>> > higher level API around the basic FlightSQL one?
>>
>>
>> IIUC, I don't think this is an issue with Flight but one with generic
>> conversion between data into Arrow.  I don't think anyone is actively
>> working on something like this, but creating a new contrib module that
>> maps
>> from java objects (just like there are JDBC and Avro ones) seems
>> worthwhile.  If you are interested in contributing something like this I
>> think a short design doc would be worth-while.
>>
>> VectorSchemaRoot root = df.toVectorSchemaRoot();
>> > listener.setVectorSchemaRoot(root);
>> > listener.sendVectorSchemaRootContents();
>>
>>
>> A small nit.  Generally, the preferred pattern is one VectorSchemaRoot
>> that
>> gets reloaded each time.  So an API like "df.loadVectorSchemaRoot(root)"
>> probably makes more sense but we can iterate on this.  This wasn't
>> commonly
>> understood when some of the other contrib modules were developed.
>>
>> Cheers,
>> Micah
>>
>>
>> On Sat, Mar 12, 2022 at 12:15 PM Gavin Ray  wrote:
>>
>> > While trying to implement and introduce the idea of adopting FlightSQL,
>> the
>> > largest challenge was the API itself
>> >
>> > I know it's meant to be low-level. But I found that most of the
>> development
>> > time was in code to convert to/from
>> > row-based data (IE Map) and Java types, and columnar
>> data +
>> > Arrow types.
>> >
>> > I'm likely in the minority position here -- I know that Arrow and
>> FlightSQL
>> > users are largely looking at transferring large volumes of data and
>> > servicing OLAP-type workloads
>> > But the thing that excites me most about FlightSQL, isn't its
>> performance
>> > (always nice to have), but that it's a language-agnostic standard for
>> data
>> > access.
>> >
>> > That has broad implications -- for all kinds of data-access workloads
>> and
>> > business usecases.
>> >
>> > The challenge is that in trying to advocate for it, when presenting a
>> > proof-of-concept,
>> > rather than what a developer might expect to see, something like:
>> >
>> > // FlightSQL handler code
>> > List> results = ;
>> > results.add(Map.of("id", 1, "name", "Person 1");
>> > return results;
>> >
>> > A significant portion of the code is in Arrow-specific implementation
>> > details:
>> > creating a VectorSchemaRoot, FieldVector, de-serializing the results on
>> the
>> > client, etc.
>> >
>> > Just curious whether there is any interest/intention of possibly making
>> a
>> > higher level API around the basic FlightSQL one?
>> > Maybe something closer to the traditional notion of a row-based
>> "DataFrame"
>> > or "Table", like:
>> >
>> > DataFrame df = new DataFrame();
>> > df.addColumn("id", ArrowTypes.Int);
>> > df.addColumn("name", ArrowTypes.VarChar);
>> > df.addRow(Map.of("id", 1, "name", "Person 1"));
>> > VectorSchemaRoot root = df.toVectorSchemaRoot();
>> > listener.setVectorSchemaRoot(root);
>> > listener.sendVectorSchemaRootContents();
>> >
>>
>


Re: [FlightSQL] Higher-level facade API to increase adoption/audience? Or does this belong as a personal project

2022-03-13 Thread Gavin Ray
>
> Generally, the preferred pattern is one VectorSchemaRoot that
> gets reloaded each time.  So an API like "df.loadVectorSchemaRoot(root)"
> probably makes more sense but we can iterate on this.
>

Could you expand on what exactly you mean by this?

Still a bit blurry on the best-practices behind sending
the Arrow response in Flight and seems like an important point.


... creating a new contrib module that maps
> from java objects (just like there are JDBC and Avro ones) seems
> worthwhile.  If you are interested in contributing something like this I
> think a short design doc would be worth-while.
>

Where would be the best place to post this?

I was thinking about GitHub issues but I am GitHub-centric,
not sure if JIRA or mailing list would be better.

Thanks, Micah!


On Sun, Mar 13, 2022 at 12:46 AM Micah Kornfield 
wrote:

> Hi Gavin,
>
> > Just curious whether there is any interest/intention of possibly making a
> > higher level API around the basic FlightSQL one?
>
>
> IIUC, I don't think this is an issue with Flight but one with generic
> conversion between data into Arrow.  I don't think anyone is actively
> working on something like this, but creating a new contrib module that maps
> from java objects (just like there are JDBC and Avro ones) seems
> worthwhile.  If you are interested in contributing something like this I
> think a short design doc would be worth-while.
>
> VectorSchemaRoot root = df.toVectorSchemaRoot();
> > listener.setVectorSchemaRoot(root);
> > listener.sendVectorSchemaRootContents();
>
>
> A small nit.  Generally, the preferred pattern is one VectorSchemaRoot that
> gets reloaded each time.  So an API like "df.loadVectorSchemaRoot(root)"
> probably makes more sense but we can iterate on this.  This wasn't commonly
> understood when some of the other contrib modules were developed.
>
> Cheers,
> Micah
>
>
> On Sat, Mar 12, 2022 at 12:15 PM Gavin Ray  wrote:
>
> > While trying to implement and introduce the idea of adopting FlightSQL,
> the
> > largest challenge was the API itself
> >
> > I know it's meant to be low-level. But I found that most of the
> development
> > time was in code to convert to/from
> > row-based data (IE Map) and Java types, and columnar
> data +
> > Arrow types.
> >
> > I'm likely in the minority position here -- I know that Arrow and
> FlightSQL
> > users are largely looking at transferring large volumes of data and
> > servicing OLAP-type workloads
> > But the thing that excites me most about FlightSQL, isn't its performance
> > (always nice to have), but that it's a language-agnostic standard for
> data
> > access.
> >
> > That has broad implications -- for all kinds of data-access workloads and
> > business usecases.
> >
> > The challenge is that in trying to advocate for it, when presenting a
> > proof-of-concept,
> > rather than what a developer might expect to see, something like:
> >
> > // FlightSQL handler code
> > List> results = ;
> > results.add(Map.of("id", 1, "name", "Person 1");
> > return results;
> >
> > A significant portion of the code is in Arrow-specific implementation
> > details:
> > creating a VectorSchemaRoot, FieldVector, de-serializing the results on
> the
> > client, etc.
> >
> > Just curious whether there is any interest/intention of possibly making a
> > higher level API around the basic FlightSQL one?
> > Maybe something closer to the traditional notion of a row-based
> "DataFrame"
> > or "Table", like:
> >
> > DataFrame df = new DataFrame();
> > df.addColumn("id", ArrowTypes.Int);
> > df.addColumn("name", ArrowTypes.VarChar);
> > df.addRow(Map.of("id", 1, "name", "Person 1"));
> > VectorSchemaRoot root = df.toVectorSchemaRoot();
> > listener.setVectorSchemaRoot(root);
> > listener.sendVectorSchemaRootContents();
> >
>


[FlightSQL] Higher-level facade API to increase adoption/audience? Or does this belong as a personal project

2022-03-12 Thread Gavin Ray
While trying to implement and introduce the idea of adopting FlightSQL, the
largest challenge was the API itself

I know it's meant to be low-level. But I found that most of the development
time was in code to convert to/from
row-based data (IE Map) and Java types, and columnar data +
Arrow types.

I'm likely in the minority position here -- I know that Arrow and FlightSQL
users are largely looking at transferring large volumes of data and
servicing OLAP-type workloads
But the thing that excites me most about FlightSQL, isn't its performance
(always nice to have), but that it's a language-agnostic standard for data
access.

That has broad implications -- for all kinds of data-access workloads and
business usecases.

The challenge is that in trying to advocate for it, when presenting a
proof-of-concept,
rather than what a developer might expect to see, something like:

// FlightSQL handler code
List> results = ;
results.add(Map.of("id", 1, "name", "Person 1");
return results;

A significant portion of the code is in Arrow-specific implementation
details:
creating a VectorSchemaRoot, FieldVector, de-serializing the results on the
client, etc.

Just curious whether there is any interest/intention of possibly making a
higher level API around the basic FlightSQL one?
Maybe something closer to the traditional notion of a row-based "DataFrame"
or "Table", like:

DataFrame df = new DataFrame();
df.addColumn("id", ArrowTypes.Int);
df.addColumn("name", ArrowTypes.VarChar);
df.addRow(Map.of("id", 1, "name", "Person 1"));
VectorSchemaRoot root = df.toVectorSchemaRoot();
listener.setVectorSchemaRoot(root);
listener.sendVectorSchemaRootContents();


Re: Flight/FlightSQL Optimization for Small Results?

2022-03-08 Thread Gavin Ray
Thank you for doing this, left a few questions on the GH issue

I would adopt this proposal as soon as it makes it into nightlies
(or possibly earlier if it's just a matter of regenerating the proto
definitions)

The operation flow would be like this, or what would it look like?

Client ---> GetFlightInfo (query/update operation in payload) ---> Server
---> Results (non-streamed)




On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield 
wrote:

> Some people have already left comments on
> https://github.com/apache/arrow/pull/12571  More eyes on it would be
> appreciated.  If there aren't more comments, I'll try to start implementing
> this feature in Flight next week, and hopefully have a vote after it is
> supported in Java and C++/Python.
>
>
> Thanks,
> Micah
>
> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield 
> wrote:
>
> > I put together straw-man proposal in PR [1] for the Flight changes.
> > Ultimately, it seemed based on the use-cases discussed inlining the data
> on
> > the Ticket made the most sense.  This might be overly complex (I'm not
> sure
> > how I feel about a enum indicating partial vs full results) but welcome
> > feedback.  Once we get consensus on this proposal, I can add changes to
> > Flight SQL and try to provide reference implementations.
> >
> > [1] https://github.com/apache/arrow/pull/12571
> >
> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield 
> > wrote:
> >
> >> Would it make sense to make this part of DoGet since it
> >>> still would be returning a record batch
> >>
> >> I would lean against this. I think in many cases the client doesn't know
> >> the size of the data that it expects.  Leaving the flexibility on the
> >> server side to send back inlined data when it thinks it makes sense, or
> a
> >> bunch of tickets when there is in fact a lot of data seems like the best
> >> option here.
> >>
> >> For cases like previewing data, you usually just want to get a small
> >>> amount
> >>> of data quickly.
> >>
> >> This is interesting and might be an additional use case.  If we did
> >> decide to extend FlightInfo we might also want a way of annotating
> inlined
> >> data with its corresponding ticket.  That way even for large results,
> you
> >> could still send back a small preview if desired.
> >>
> >> After considering it a little bit I think I'm sold that inlined data
> >> should not replace a ticket.  So in my mind the open question is whether
> >> the client needs to actively opt-in to inlined data.  The scenarios I
> could
> >> come with where inlined data isn't useful are:
> >> 1.  The client is an old client and isn't aware inline data might be
> >> returned.  In this case the main cost is of extra data on the wire and
> >> storing it as unknown fields [1].
> >> 2.  The client is a new client but still doesn't want to get inline data
> >> (it might want to distribute all consumption to other processes).  Same
> >> cost is paid as option 1.
> >>
> >> Are there other scenarios?  If servers choose reasonable limits on what
> >> data to inline, the extra complexity of negotiating with the client in
> this
> >> case might not be worth the benefits.
> >>
> >> Cheers,
> >> Micah
> >>
> >>
> >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns
> >>
> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler  wrote:
> >>
> >>> I think this would be a useful feature and be nice to have in Flight
> >>> core.
> >>> For cases like previewing data, you usually just want to get a small
> >>> amount
> >>> of data quickly. Would it make sense to make this part of DoGet since
> it
> >>> still would be returning a record batch? Perhaps a Ticket could be made
> >>> to
> >>> have an optional FlightDescriptor that would serve as an all-in-one
> shot?
> >>>
> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li  wrote:
> >>>
> >>> > I agree with something along Antoine's proposal, though: maybe we
> >>> should
> >>> > be more structured with the flags (akin to what Micah mentioned with
> >>> the
> >>> > Feature enum).
> >>> >
> >>> > Also, the flag could be embedded into the Flight SQL messages
> instead.
> >>> (So
> >>> > in effect, Flight would only add the capability to return data with
> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide
> how
> >>> > they want to take advantage of that.)
> >>> >
> >>> > I think having a completely separate method and return type and
> having
> >>> to
> >>> > poll for it beforehand somewhat defeats the purpose of having
> it/would
> >>> be
> >>> > much harder of a transition.
> >>> >
> >>> > Also: it should be `repeated FlightInfo inline_data` right? In case
> we
> >>> > also need dictionary batches?
> >>> >
> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
> >>> > > Can we just add the following field to the FlightDescriptor
> message:
> >>> > >
> >>> > >   bool accept_inline_data = 4;
> >>> > >
> >>> > > and this one to the FlightInfo message:
> >>> > >
> >>> > >   FlightData inline_data = 100;
> >>> > >
> >>> > > 

Re: [Rust] DataFusion + Substrait

2022-03-07 Thread Gavin Ray
Incredibly exciting! Following along eagerly =)

On Mon, Mar 7, 2022 at 11:31 AM Andy Grove  wrote:

> I created a new repo in the datafusion-contrib GitHub org over the weekend
> with a starting point for supporting DataFusion as both a producer and
> consumer of Substrait plans.
>
> https://github.com/datafusion-contrib/datafusion-substrait
>
> I am hopeful that we can eventually use Substrait in Ballista as a
> replacement for the current query plan protobuf format, meaning that the
> Ballista scheduler could potentially be used with engines other than
> DataFusion.
>
> I also think it could be helpful with in-memory language interoperability,
> such as passing query plans between Python and Rust.
>
> I plan on continuing to merge my own PRs here as I flesh out more of this,
> at least until there are other contributors.
>
> Thanks,
>
> Andy.
>


Re: [FlightSQL] Non-gRPC interop (IE REST) possible with SerializeToString() [C++] / serialize() [Java]?

2022-03-07 Thread Gavin Ray
Ahh got it, perfectly clear now -- thank you!

On Mon, Mar 7, 2022 at 11:20 AM David Li  wrote:

> So "Flight" and "Flight SQL" are distinct projects. Flight defines RPC
> methods, and "Flight SQL" defines higher-level methods on top of the Flight
> methods. The optimization proposed is for Flight. Once/if that gets
> accepted and implemented, Flight SQL servers could then use it to optimize
> GetCatalogs: they would return a FlightInfo that has the data embedded. So
> yes, all the methods should get support for this once things get worked out.
>
> On Mon, Mar 7, 2022, at 10:57, Gavin Ray wrote:
> > Sure, will use that JIRA issue for whatever thoughts/feedback =)
> >
> > On that note, filed the above bug here:
> > https://issues.apache.org/jira/browse/ARROW-15861
> >
> > About the "two-step" thing, I guess what I mean  is code like this
> > where you make the initial op, then get the stream:
> >
> > val catalogs: FlightInfo = client.getCatalogs()
> > val stream: FlightStream =
> > client.getStream(catalogs.endpoints[0].ticket)
> > while (stream.next()) {
> > stream.root.use { root -> println(root.contentToTSVString()) }
> > }
> >
> > You override two methods, "getFlightInfoCatalogs" and then
> > "getStreamCatalogs"
> > Maybe I misunderstood -- can you just return the data directly from IE
> > "getFlightInfoCatalogs"
> >
> > Ideally I'd love to be able to do something like:
> >
> > val catalogs = client.getCatalogs()
> > for (catalog in catalogs.rows) {}
> >
> > But maybe this is not really feasible/practical with how Arrow works as a
> > format or the architecture of Flight
> >
> > And RE: the JS implementation, TypeScript is my primary language so I'd
> > love to be useful there if I could =)
> > It's also much faster to prototype stuff in JS/TS due to lack of
> > compilation.
> >
> > On Mon, Mar 7, 2022 at 10:46 AM David Li  wrote:
> >
> >> (responses inline)
> >>
> >> On Mon, Mar 7, 2022, at 10:37, Gavin Ray wrote:
> >> >>
> >> >> Another contributor is currently working on some Java
> >> >> tutorials/documentation so any feedback would be helpful.
> >> >
> >> >
> >> > Ah, yeah this would be incredibly useful. Will compile some thoughts,
> >> where
> >> > should I share them?
> >> > Didn't know about the Cookbook, definitely going to be tonight's
> reading!
> >> >
> >>
> >> Would you mind putting them on the overall Jira?
> >> https://issues.apache.org/jira/browse/ARROW-15156
> >>
> >> If there's questions about the cookbook, or tasks where it's not clear
> how
> >> to accomplish them, you can file issues directly on the cookbook repo
> too.
> >>
> >> >
> >> > Ah, I suppose having the small-value optimization would mostly cover
> your
> >> >> needs then? And then grpc-web or a similar bridge should suffice for
> >> you.
> >> >
> >> >
> >> > Yeah 100%
> >> > Wanted to ask a question on this -- is there a possibility to add the
> >> > "one-shot" single message RPC s for all operations?
> >> >
> >> > In my case it's mostly extra-overhead to send the first ticket, get a
> >> > statement handle, and then make a second call which streams the
> results
> >> > Would be awesome to have the ability to opt-in to one-shot messages
> for
> >> > both Metadata and Query operations
> >>
> >> Hmm, which other operations are you looking at? For instance, GetSchema
> >> takes a FlightDescriptor directly. It's really just DoGet that has that
> >> two-step structure.
> >>
> >> >
> >> > If you have details about the dependency issue, do you mind filing a
> Jira
> >> >> issue?
> >> >> Seems something might have changed and we should be prepared to fix
> it.
> >> >> (Flight/Java does a lot of poking at internal APIs to try to avoid
> >> copies.)
> >> >
> >> >
> >> > Absolutely, no problem. I'll revert my dep override and file an issue
> >> with
> >> > the stacktrace.
> >>
> >> Thanks!
> >>
> >> > ---
> >> > On a side note, I've started work on a Node.js implementation of
> Flight +
> >> > FlightSQL in the Arrow repo.
> >> >

Re: [FlightSQL] Non-gRPC interop (IE REST) possible with SerializeToString() [C++] / serialize() [Java]?

2022-03-07 Thread Gavin Ray
Sure, will use that JIRA issue for whatever thoughts/feedback =)

On that note, filed the above bug here:
https://issues.apache.org/jira/browse/ARROW-15861

About the "two-step" thing, I guess what I mean  is code like this
where you make the initial op, then get the stream:

val catalogs: FlightInfo = client.getCatalogs()
val stream: FlightStream =
client.getStream(catalogs.endpoints[0].ticket)
while (stream.next()) {
stream.root.use { root -> println(root.contentToTSVString()) }
}

You override two methods, "getFlightInfoCatalogs" and then
"getStreamCatalogs"
Maybe I misunderstood -- can you just return the data directly from IE
"getFlightInfoCatalogs"

Ideally I'd love to be able to do something like:

val catalogs = client.getCatalogs()
for (catalog in catalogs.rows) {}

But maybe this is not really feasible/practical with how Arrow works as a
format or the architecture of Flight

And RE: the JS implementation, TypeScript is my primary language so I'd
love to be useful there if I could =)
It's also much faster to prototype stuff in JS/TS due to lack of
compilation.

On Mon, Mar 7, 2022 at 10:46 AM David Li  wrote:

> (responses inline)
>
> On Mon, Mar 7, 2022, at 10:37, Gavin Ray wrote:
> >>
> >> Another contributor is currently working on some Java
> >> tutorials/documentation so any feedback would be helpful.
> >
> >
> > Ah, yeah this would be incredibly useful. Will compile some thoughts,
> where
> > should I share them?
> > Didn't know about the Cookbook, definitely going to be tonight's reading!
> >
>
> Would you mind putting them on the overall Jira?
> https://issues.apache.org/jira/browse/ARROW-15156
>
> If there's questions about the cookbook, or tasks where it's not clear how
> to accomplish them, you can file issues directly on the cookbook repo too.
>
> >
> > Ah, I suppose having the small-value optimization would mostly cover your
> >> needs then? And then grpc-web or a similar bridge should suffice for
> you.
> >
> >
> > Yeah 100%
> > Wanted to ask a question on this -- is there a possibility to add the
> > "one-shot" single message RPC s for all operations?
> >
> > In my case it's mostly extra-overhead to send the first ticket, get a
> > statement handle, and then make a second call which streams the results
> > Would be awesome to have the ability to opt-in to one-shot messages for
> > both Metadata and Query operations
>
> Hmm, which other operations are you looking at? For instance, GetSchema
> takes a FlightDescriptor directly. It's really just DoGet that has that
> two-step structure.
>
> >
> > If you have details about the dependency issue, do you mind filing a Jira
> >> issue?
> >> Seems something might have changed and we should be prepared to fix it.
> >> (Flight/Java does a lot of poking at internal APIs to try to avoid
> copies.)
> >
> >
> > Absolutely, no problem. I'll revert my dep override and file an issue
> with
> > the stacktrace.
>
> Thanks!
>
> > ---
> > On a side note, I've started work on a Node.js implementation of Flight +
> > FlightSQL in the Arrow repo.
> > Never worked with gRPC but hopefully I can get the majority of the work
> > finished and file a draft PR =)
>
> That will be interesting to see. I believe the Arrow JS implementation
> could use some more attention in general.
>
> >
> > https://gist.github.com/GavinRay97/876c8e8476b18c8eb01cb6e8f807bf28
> >
> > On Mon, Mar 7, 2022 at 9:55 AM David Li  wrote:
> >
> >> Cool - if you have API questions, feel free to send them here or
> >> u...@arrow.apache.org. Another contributor is currently working on some
> >> Java tutorials/documentation so any feedback would be helpful. There's
> also
> >> some basic recipes here: https://github.com/apache/arrow-cookbook/
> >>
> >> Ah, I suppose having the small-value optimization would mostly cover
> your
> >> needs then? And then grpc-web or a similar bridge should suffice for
> you.
> >>
> >> If you have details about the dependency issue, do you mind filing a
> Jira
> >> issue? Seems something might have changed and we should be prepared to
> fix
> >> it. (Flight/Java does a lot of poking at internal APIs to try to avoid
> >> copies.)
> >>
> >> Thanks,
> >> David
> >>
> >> On Mon, Mar 7, 2022, at 09:48, Gavin Ray wrote:
> >> > Ah brilliant! Yeah, Websockets (or anything that's a basic transport
> and
> >> > doesn't require a language-specific SDK) would

Re: [FlightSQL] Non-gRPC interop (IE REST) possible with SerializeToString() [C++] / serialize() [Java]?

2022-03-07 Thread Gavin Ray
>
> Another contributor is currently working on some Java
> tutorials/documentation so any feedback would be helpful.


Ah, yeah this would be incredibly useful. Will compile some thoughts, where
should I share them?
Didn't know about the Cookbook, definitely going to be tonight's reading!


Ah, I suppose having the small-value optimization would mostly cover your
> needs then? And then grpc-web or a similar bridge should suffice for you.


Yeah 100%
Wanted to ask a question on this -- is there a possibility to add the
"one-shot" single message RPC s for all operations?

In my case it's mostly extra-overhead to send the first ticket, get a
statement handle, and then make a second call which streams the results
Would be awesome to have the ability to opt-in to one-shot messages for
both Metadata and Query operations

If you have details about the dependency issue, do you mind filing a Jira
> issue?
> Seems something might have changed and we should be prepared to fix it.
> (Flight/Java does a lot of poking at internal APIs to try to avoid copies.)


Absolutely, no problem. I'll revert my dep override and file an issue with
the stacktrace.
---
On a side note, I've started work on a Node.js implementation of Flight +
FlightSQL in the Arrow repo.
Never worked with gRPC but hopefully I can get the majority of the work
finished and file a draft PR =)

https://gist.github.com/GavinRay97/876c8e8476b18c8eb01cb6e8f807bf28

On Mon, Mar 7, 2022 at 9:55 AM David Li  wrote:

> Cool - if you have API questions, feel free to send them here or
> u...@arrow.apache.org. Another contributor is currently working on some
> Java tutorials/documentation so any feedback would be helpful. There's also
> some basic recipes here: https://github.com/apache/arrow-cookbook/
>
> Ah, I suppose having the small-value optimization would mostly cover your
> needs then? And then grpc-web or a similar bridge should suffice for you.
>
> If you have details about the dependency issue, do you mind filing a Jira
> issue? Seems something might have changed and we should be prepared to fix
> it. (Flight/Java does a lot of poking at internal APIs to try to avoid
> copies.)
>
> Thanks,
> David
>
> On Mon, Mar 7, 2022, at 09:48, Gavin Ray wrote:
> > Ah brilliant! Yeah, Websockets (or anything that's a basic transport and
> > doesn't require a language-specific SDK) would be fantastic.
> >
> > In my case, streaming wouldn't be a requirement, at least not for some
> time
> > (more of a nice-to-have).
> > It'd be mostly OLTP-style workloads, with small response sizes
> (10-1,000kB).
> >
> > By the way -- wanted to thank yourself and the others from the mailing
> list
> > for all the help.
> > Last night I was able to get a basic FlightSQL server implementation
> > working based on the feedback I'd got here.
> >
> > Now the only challenge is not being familiar with the Arrow format +
> > APIs/working with vector-based data
> > Majority of the time was in trying to figure out how to translate JVM
> > arrays/objects into Arrow values.
> >
> > The one thing I did have to do is override dependencies due to a problem
> in
> > netty/grpc with an
> > incompatible constructor signature for "PooledByteBufAllocator"
> >
> > // workaround for bug with PooledByteBufAllocator
> > implementation("io.grpc", "grpc-netty").version {
> > strictly("1.44.1")
> > }
> > implementation("io.netty", "netty-all").version {
> > strictly("4.1.74.Final")
> > }
> > implementation("io.netty", "netty-codec").version {
> > strictly("4.1.74.Final")
> > }
> >
> > On Mon, Mar 7, 2022 at 9:39 AM David Li  wrote:
> >
> >> No worries about questions, it's always good to see how people are using
> >> Arrow.
> >>
> >> For tunneling Flight/gRPC over HTTP: this has been a long-standing
> >> question. I believe some people have had success with one of the various
> >> gRPC-HTTP proxies. In particular, I recall Deephaven has done this
> >> successfully (with some workaround for the lack of streaming methods).
> If
> >> Nate is around, maybe he can describe what they've done.
> >>
> >> There's also an ongoing effort to enable alternative transports in
> Flight
> >> [1], which would let us implement (say) a native WebSocket transport.
> >>
> >> For these methods specifically: they basically wrap Protobuf
> >> SerializeToString/ParseFromString so you could use them to try to
> implement
> >> your own protocol using HTTP, yes.
> >>
> >>

Re: [FlightSQL] "flightsql-kotlin" submodule for Kotlin protobuf/gRPC codegen?

2022-03-07 Thread Gavin Ray
> Is there a problem with generating those inside your own project?

No no, not at all -- just wasn't sure if it was something that would be
useful enough to be upstream.
Sounds like probably not, I will just add the gRPC/Protobuf plugin to my
gradle build

On Mon, Mar 7, 2022 at 9:57 AM David Li  wrote:

> This would be just the generated Protobuf sources but with a Kotlin API?
> Is there a problem with generating those inside your own project? (At least
> in C++ we also try to hide the Protobuf messages, I suppose we can't quite
> do that in Java easily.)
>
> On Mon, Mar 7, 2022, at 09:13, Gavin Ray wrote:
> > I'm curious whether folks think it would be reasonable to upstream an
> > optional Kotlin submodule that uses the Kotlin code generator for
> FlightSQL?
> >
> > Or would this be better off as a personal repository?
> >
> > The Rust FlightSQL API is a fair bit nicer due to the syntax.
> > The Kotlin Protobuf plugin produces an API very similar:
> > *
> https://github.com/wangfenjin/arrow-datafusion/pull/1/files#diff-d942c264020a5d47b87deaca1b1064b53f3819a8f90764fad8fa3c2b9ccf6225R82-R92
> > <
> https://github.com/wangfenjin/arrow-datafusion/pull/1/files#diff-d942c264020a5d47b87deaca1b1064b53f3819a8f90764fad8fa3c2b9ccf6225R82-R92
> >*
> >
> > This module would allow writing the below:
> >
> > DiceSeries series = DiceSeries.newBuilder()
> > .addRoll(DiceRoll.newBuilder()
> > .setValue(5))
> > .addRoll(DiceRoll.newBuilder()
> > .setValue(20)
> > .setNickname("critical hit"))
> > .build()
> >
> > As:
> >
> > val series = diceSeries {
> >   rolls = listOf(
> > diceRoll { value = 5 },
> > diceRoll {
> >   value = 20
> >   nickname = "critical hit"
> > }
> >   )
> > }
>


Re: [FlightSQL] Non-gRPC interop (IE REST) possible with SerializeToString() [C++] / serialize() [Java]?

2022-03-07 Thread Gavin Ray
Ah brilliant! Yeah, Websockets (or anything that's a basic transport and
doesn't require a language-specific SDK) would be fantastic.

In my case, streaming wouldn't be a requirement, at least not for some time
(more of a nice-to-have).
It'd be mostly OLTP-style workloads, with small response sizes (10-1,000kB).

By the way -- wanted to thank yourself and the others from the mailing list
for all the help.
Last night I was able to get a basic FlightSQL server implementation
working based on the feedback I'd got here.

Now the only challenge is not being familiar with the Arrow format +
APIs/working with vector-based data
Majority of the time was in trying to figure out how to translate JVM
arrays/objects into Arrow values.

The one thing I did have to do is override dependencies due to a problem in
netty/grpc with an
incompatible constructor signature for "PooledByteBufAllocator"

// workaround for bug with PooledByteBufAllocator
implementation("io.grpc", "grpc-netty").version {
strictly("1.44.1")
}
implementation("io.netty", "netty-all").version {
strictly("4.1.74.Final")
}
implementation("io.netty", "netty-codec").version {
strictly("4.1.74.Final")
}

On Mon, Mar 7, 2022 at 9:39 AM David Li  wrote:

> No worries about questions, it's always good to see how people are using
> Arrow.
>
> For tunneling Flight/gRPC over HTTP: this has been a long-standing
> question. I believe some people have had success with one of the various
> gRPC-HTTP proxies. In particular, I recall Deephaven has done this
> successfully (with some workaround for the lack of streaming methods). If
> Nate is around, maybe he can describe what they've done.
>
> There's also an ongoing effort to enable alternative transports in Flight
> [1], which would let us implement (say) a native WebSocket transport.
>
> For these methods specifically: they basically wrap Protobuf
> SerializeToString/ParseFromString so you could use them to try to implement
> your own protocol using HTTP, yes.
>
> [1]: https://github.com/apache/arrow/pull/12465
>
> -David
>
> On Mon, Mar 7, 2022, at 09:24, Gavin Ray wrote:
> > Due to the current implementation status of FlightSQL (C++/Rust/JVM only)
> >
> > I am trying to see whether it's possible to allow FlightSQL over
> something
> > like HTTP/REST so that arbitrary languages can be used.
> >
> > In the codebase, I saw these (and their deserialize counterparts):
> >
> >   /// \brief Get the wire-format representation of this type.
> >   /// Useful when interoperating with non-Flight systems (e.g. REST
> >   /// services) that may want to return Flight types.
> >   arrow::Result SerializeToString() const;
> >
> >   /**
> >* Get the serialized form of this protocol message.
> >* Intended to help interoperability by allowing non-Flight services
> > to still return Flight types.
> >*/
> >   public ByteBuffer serialize() {
> > return ByteBuffer.wrap(toProtocol().toByteArray());
> >   }
> >
> > I know this is probably very low-priority at the moment, but just wanted
> to
> > ask about whether it's even possible.
> > Thank you, and sorry for spamming the mailing list with so many questions
> > lately =)
>


[FlightSQL] Non-gRPC interop (IE REST) possible with SerializeToString() [C++] / serialize() [Java]?

2022-03-07 Thread Gavin Ray
Due to the current implementation status of FlightSQL (C++/Rust/JVM only)

I am trying to see whether it's possible to allow FlightSQL over something
like HTTP/REST so that arbitrary languages can be used.

In the codebase, I saw these (and their deserialize counterparts):

  /// \brief Get the wire-format representation of this type.
  /// Useful when interoperating with non-Flight systems (e.g. REST
  /// services) that may want to return Flight types.
  arrow::Result SerializeToString() const;

  /**
   * Get the serialized form of this protocol message.
   * Intended to help interoperability by allowing non-Flight services
to still return Flight types.
   */
  public ByteBuffer serialize() {
return ByteBuffer.wrap(toProtocol().toByteArray());
  }

I know this is probably very low-priority at the moment, but just wanted to
ask about whether it's even possible.
Thank you, and sorry for spamming the mailing list with so many questions
lately =)


[FlightSQL] "flightsql-kotlin" submodule for Kotlin protobuf/gRPC codegen?

2022-03-07 Thread Gavin Ray
I'm curious whether folks think it would be reasonable to upstream an
optional Kotlin submodule that uses the Kotlin code generator for FlightSQL?

Or would this be better off as a personal repository?

The Rust FlightSQL API is a fair bit nicer due to the syntax.
The Kotlin Protobuf plugin produces an API very similar:
*https://github.com/wangfenjin/arrow-datafusion/pull/1/files#diff-d942c264020a5d47b87deaca1b1064b53f3819a8f90764fad8fa3c2b9ccf6225R82-R92
*

This module would allow writing the below:

DiceSeries series = DiceSeries.newBuilder()
.addRoll(DiceRoll.newBuilder()
.setValue(5))
.addRoll(DiceRoll.newBuilder()
.setValue(20)
.setNickname("critical hit"))
.build()

As:

val series = diceSeries {
  rolls = listOf(
diceRoll { value = 5 },
diceRoll {
  value = 20
  nickname = "critical hit"
}
  )
}


Re: [FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-03-06 Thread Gavin Ray
Got it, thank you David!
I started prototyping the implementation last night, hopefully I will make
some good progress and have something basic functioning soon.

RE: The metadata thing -- I think both Calcite and Teiid have solid
interfaces for defining what capabilities a datasource has.
https://github.com/teiid/teiid/blob/8e9057a46be009d68b2d67701781f1f8c175baa7/api/src/main/java/org/teiid/translator/ExecutionFactory.java#L349-L1528

It's probably not possible to make something universal, but it seems like
you could get pretty close to most common functionality/capabilities


On Sat, Mar 5, 2022 at 11:48 PM Kyle Porter 
wrote:

> Yes, we should, where possible, avoid any one of metadata. This is where
> other standards fail in that applications must be custom built for each
> data source, if we standardize the metadata then applications can at least
> be built to adapt.
>
> On Sat., Mar. 5, 2022, 6:54 p.m. David Li,  wrote:
>
> > Yes, GetSqlInfo reserves a range of metadata IDs for Flight SQL's use, so
> > the application can use others for its own purposes. That said if they
> seem
> > commonly applicable maybe we should try to standardize them.
> >
> > I think what you are doing should be reasonable. You may not need _all_
> of
> > the capabilities in Flight SQL for this (e.g. all the various metadata
> > calls, or prepared statements, perhaps) but I don't see why it wouldn't
> > work for you.
> >
> > On Fri, Mar 4, 2022, at 19:03, Gavin Ray wrote:
> > > To touch on the question about supported features -- is it possible to
> > > advertise arbitrary/custom "capabilites" in GetSqlInfo?
> > > Say that you want to represent some set of behaviors that FlightSQL
> > > services can support.
> > >
> > > Stuff like "Supports grouping by multiple distinct aggregates",
> "Supports
> > > self-joins on aliased tables" etc
> > > This is going to be unique to each implementation, but I couldn't
> > determine
> > > whether there was a way to express arbitrary capabilities
> > >
> > > Also, in case it's helpful I put together an ASCII diagram of what I'm
> > > trying to do with FlightSQL
> > > If anyone has a moment, would appreciate input on whether it's
> feasible/a
> > > good idea
> > >
> > > https://pastebin.com/raw/VF2r0F3f
> > >
> > > Thank you =)
> > >
> > >
> > > On Fri, Mar 4, 2022 at 2:37 PM David Li  wrote:
> > >
> > >> We could also add say CommandSubstraitQuery as a distinct message, and
> > >> older servers would just reject it as an unknown request type.
> > >>
> > >> -David
> > >>
> > >> On Fri, Mar 4, 2022, at 17:01, Micah Kornfield wrote:
> > >> >>
> > >> >> 1. How does a server report that it supports each command type?
> > Initial
> > >> >> thought is a property in GetSqlInfo.
> > >> >
> > >> >
> > >> > This sounds reasonable.
> > >> >
> > >> >
> > >> >> What happens to client code written prior to changing the command
> > type
> > >> >> to be a oneOf field? Same for servers.
> > >> >
> > >> >
> > >> > It is transparent from older clients (I'm 99% sure the wire protocol
> > >> > doesn't change).  Servers is a little harder.  The one saving grace
> > is I
> > >> > don't think an empty/not-present SQL string would be something most
> > >> servers
> > >> > could handle, so they would probably error with something that while
> > >> > not-obvious would give a clue to the clients (but hopefully this
> would
> > >> be a
> > >> > non-issue because the capabilities would be checked for clients
> > wishing
> > >> to
> > >> > to use this feature first).
> > >> >
> > >> > -Micah
> > >> >
> > >> > On Fri, Mar 4, 2022 at 1:50 PM James Duong  > >> .invalid>
> > >> > wrote:
> > >> >
> > >> >> It sounds like an interesting and useful project to use Subtstrait
> > as an
> > >> >> alternative to SQL strings.
> > >> >>
> > >> >> Important aspects to spec out are:
> > >> >> 1. How does a server report that it supports each command type?
> > Initial
> > >> >> thought is a property in GetSqlInfo.
> > >> >> 2. What happens to c

Re: Is 7.0.0 release missing the Java arrow-flight POM?

2022-03-06 Thread Gavin Ray
Hey all,

I wanted to start prototyping a project with FlightSQL, so I have written a
script to extract from the nightlies
and published the assets from 03/03 on my personal Github.

You can use this repo as a Gradle/Maven repository if you want, while we
wait for the next release containing the FlightSQL POM
Instructions for use are in the README

https://github.com/GavinRay97/arrow-nightlies-repo

Hope this helps =)

On Thu, Mar 3, 2022 at 9:23 PM Kun Liu  wrote:

> Hi all,
>
> We also meet this issue:
>
> "Failed to read artifact descriptor for
> org.apache.arrow:flight-core:jar:7.0.0: Could not find artifact
> org.apache.arrow:arrow-flight:pom:7.0.0"
>
> and we work around this issue by adding the file to the local maven repo.
>
> From this ticket: https://issues.apache.org/jira/browse/ARROW-15746, I see
> that this issue will be fixed in the `7.0.1` or `8.0.0` and I want to know
> when will the `7.0.1` be released?
>
> Thanks,
> Kun
>
>
>
> Bryan Cutler  于2022年2月22日周二 01:32写道:
>
> > Thanks Kou, sounds like if it's in that list then the upload should work.
> > It would be good if we could make that not so brittle, but in the
> meantime
> > I'll open a pr to add the pom. Made
> > https://issues.apache.org/jira/browse/ARROW-15746 to track.
> >
> > On Sat, Feb 19, 2022 at 9:39 PM Sutou Kouhei  wrote:
> >
> > > Hi,
> > >
> > > I found that "dev/release/04-binary-download.sh 7.0.0 10
> > > --task-filter 'java-jars'" doesn't download
> > > arrow-flight*.pom.
> > >
> > > I think that we need to add arrow-flight*.pom to
> > > https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L761
> > > .
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In <20220220.142345.1095495044811966896@clear-code.com>
> > >   "Re: Is 7.0.0 release missing the Java arrow-flight POM?" on Sun, 20
> > Feb
> > > 2022 14:23:45 +0900 (JST),
> > >   Sutou Kouhei  wrote:
> > >
> > > > Hi,
> > > >
> > > > I tried "dev/release/06-java-upload.sh 7.0.0 10" and upload
> > > > the log to
> > > > https://gist.github.com/kou/b6d8aa2b9420baa086a7cf0763a9bf37
> > > > . It seems that arrow-flight isn't uploaded...
> > > >
> > > > You can see uploaded files at
> > > > https://repository.apache.org/#stagingRepositories with your
> > > > ASF account.
> > > > Note that you MUST not press the "Close" button! I'll remove
> > > > them by pressing "Drop" button when we fix this.
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In  13ckywgw7...@mail.gmail.com
> > >
> > > >   "Re: Is 7.0.0 release missing the Java arrow-flight POM?" on Fri,
> 18
> > > Feb 2022 12:51:59 -0800,
> > > >   Bryan Cutler  wrote:
> > > >
> > > >> I wasn't able to able to run the entire process, so I downloaded a
> few
> > > >> artifacts from the nightly java-jars and pointed the script there to
> > see
> > > >> the output:
> > > >>
> > > >> dev/release/06-java-upload.sh 7.0.0 10
> > > >>
> > > >> deploy:deploy-file -Durl=
> > > >> https://repository.apache.org/service/local/staging/deploy/maven2
> > > >> -DrepositoryId=apache.releases.https
> > > >> -DpomFile=./java-jars/arrow-flight-8.0.0.dev82.pom
> > > >> -Dfile=./java-jars/arrow-flight-8.0.0.dev82.pom -Dfiles= -Dtypes=
> > > >> -Dclassifiers=
> > > >>
> > > >> deploy:deploy-file -Durl=
> > > >> https://repository.apache.org/service/local/staging/deploy/maven2
> > > >> -DrepositoryId=apache.releases.https
> > > >> -DpomFile=./java-jars/arrow-java-root-8.0.0.dev82.pom
> > > >> -Dfile=./java-jars/arrow-java-root-8.0.0.dev82.pom -Dfiles= -Dtypes=
> > > >> -Dclassifiers=
> > > >>
> > > >> deploy:deploy-file -Durl=
> > > >> https://repository.apache.org/service/local/staging/deploy/maven2
> > > >> -DrepositoryId=apache.releases.https
> > > >> -DpomFile=./java-jars/flight-core-8.0.0.dev82.pom
> > > >> -Dfile=./java-jars/flight-core-8.0.0.dev82.jar -Dfiles= -Dtypes=
> > > >> -Dclassifiers=
> > > >>
> > > >> Based on that, it looks like the maven command is correct, it should
> > > deploy
> > > >> it just like the arrow-java-root pom. Is there any way to see the
> log
> > > >> output for when the release artifacts were deployed?
> > > >>
> > > >> On Thu, Feb 17, 2022 at 10:06 PM Bryan Cutler 
> > > wrote:
> > > >>
> > > >>> Sure, I'll take a look at the script.
> > > >>>
> > > >>> On Thu, Feb 17, 2022 at 4:39 PM Sutou Kouhei 
> > > wrote:
> > > >>>
> > >  Hi,
> > > 
> > >  Ah, arrow-flight-*.pom exists on our CI artifacts:
> > > 
> > > 
> > >
> >
> https://github.com/ursacomputing/crossbow/releases/tag/nightly-2022-02-17-0-github-java-jars
> > > 
> > >  I don't know why our upload script
> > > 
> > >
> >
> https://github.com/apache/arrow/blob/master/dev/release/06-java-upload.sh
> > >  doesn't upload it...
> > > 
> > >  Could you take a look at it?
> > > 
> > > 
> > >  Thanks,
> > >  --
> > >  kou
> > > 
> > >  In <
> > > cabr4zata2bwatbhoic_enfwemzpynvhyys895+5rjxpcgur...@mail.gmail.com>
> > >    "Re: Is 7.0.0 release 

Re: [FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-03-04 Thread Gavin Ray
To touch on the question about supported features -- is it possible to
advertise arbitrary/custom "capabilites" in GetSqlInfo?
Say that you want to represent some set of behaviors that FlightSQL
services can support.

Stuff like "Supports grouping by multiple distinct aggregates", "Supports
self-joins on aliased tables" etc
This is going to be unique to each implementation, but I couldn't determine
whether there was a way to express arbitrary capabilities

Also, in case it's helpful I put together an ASCII diagram of what I'm
trying to do with FlightSQL
If anyone has a moment, would appreciate input on whether it's feasible/a
good idea

https://pastebin.com/raw/VF2r0F3f

Thank you =)


On Fri, Mar 4, 2022 at 2:37 PM David Li  wrote:

> We could also add say CommandSubstraitQuery as a distinct message, and
> older servers would just reject it as an unknown request type.
>
> -David
>
> On Fri, Mar 4, 2022, at 17:01, Micah Kornfield wrote:
> >>
> >> 1. How does a server report that it supports each command type? Initial
> >> thought is a property in GetSqlInfo.
> >
> >
> > This sounds reasonable.
> >
> >
> >> What happens to client code written prior to changing the command type
> >> to be a oneOf field? Same for servers.
> >
> >
> > It is transparent from older clients (I'm 99% sure the wire protocol
> > doesn't change).  Servers is a little harder.  The one saving grace is I
> > don't think an empty/not-present SQL string would be something most
> servers
> > could handle, so they would probably error with something that while
> > not-obvious would give a clue to the clients (but hopefully this would
> be a
> > non-issue because the capabilities would be checked for clients wishing
> to
> > to use this feature first).
> >
> > -Micah
> >
> > On Fri, Mar 4, 2022 at 1:50 PM James Duong  .invalid>
> > wrote:
> >
> >> It sounds like an interesting and useful project to use Subtstrait as an
> >> alternative to SQL strings.
> >>
> >> Important aspects to spec out are:
> >> 1. How does a server report that it supports each command type? Initial
> >> thought is a property in GetSqlInfo.
> >> 2. What happens to client code written prior to changing the command
> type
> >> to be a oneOf field? Same for servers.
> >> More generally, how should backward compatibility work, and what should
> >> happen if a client sends an unsupported
> >> command type to a server.
> >> 3. Should inputs to catalog RPC calls also accept Substrait structures?
> >>
> >> On Thu, Mar 3, 2022 at 11:00 PM Gavin Ray 
> wrote:
> >>
> >> > @James Duong 
> >> >
> >> > You are absolutely right, I realized this and confirmed whether this
> >> > would be possible with Jacques to double-check.
> >> > It would amount to what I might call "dollar-store Substrait." It's
> not
> >> > elegant or a good solution, but definitely presents a good duct-tape
> hack
> >> > and is a crafty idea.
> >> >
> >> > I agree with Jacques -- when you think about FlightSQL, what you are
> >> > attempting with a query isn't necessarily SQL, but a general
> data-compute
> >> > operation.
> >> > SQL just so happens to be a fairly universal way to express them,
> with an
> >> > ANSI standard, but FlightSQL doesn't recognize any particular subset
> of
> >> it
> >> > and for all intents and purposes it doesn't matter what the operation
> >> > string contains.
> >> >
> >> > Substrait would make a fantastic logical next-feature because it's
> >> > targeted as a specification for expressing relational algebra and
> >> > data-compute operations
> >> > This more-or-less equates to SQL strings (in my mind at least) with a
> >> much
> >> > better toolkit and Dev UX. If there is anything I can do to help move
> >> this
> >> > forward, please let me know because I am extremely motivated to do so.
> >> >
> >> > @David Li 
> >> >
> >> > Also agreed. Substrait is put together by folks much smarter than
> myself,
> >> > and if I had to hedge my bets, I'd put money on it being the future of
> >> > data-compute interop.
> >> > I would love nothing more than to adopt this technology and push it
> >> along.
> >> >
> >> > Your project does sound interesting - basically, it sounds like a
> tabular
> >

Re: [FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-03-03 Thread Gavin Ray
@James Duong 

You are absolutely right, I realized this and confirmed whether this
would be possible with Jacques to double-check.
It would amount to what I might call "dollar-store Substrait." It's not
elegant or a good solution, but definitely presents a good duct-tape hack
and is a crafty idea.

I agree with Jacques -- when you think about FlightSQL, what you are
attempting with a query isn't necessarily SQL, but a general data-compute
operation.
SQL just so happens to be a fairly universal way to express them, with an
ANSI standard, but FlightSQL doesn't recognize any particular subset of it
and for all intents and purposes it doesn't matter what the operation
string contains.

Substrait would make a fantastic logical next-feature because it's targeted
as a specification for expressing relational algebra and data-compute
operations
This more-or-less equates to SQL strings (in my mind at least) with a much
better toolkit and Dev UX. If there is anything I can do to help move this
forward, please let me know because I am extremely motivated to do so.

@David Li 

Also agreed. Substrait is put together by folks much smarter than myself,
and if I had to hedge my bets, I'd put money on it being the future of
data-compute interop.
I would love nothing more than to adopt this technology and push it along.

Your project does sound interesting - basically, it sounds like a tabular
> data storage service with query pushdown?
>

Yeah this is more or less the details of it (my personal email, with
discretion assumed, is always open)

Imagine an environment where a backend wants to advertise some kind of
schema/data catalog

And then a central service introspects these backends, and dynamically
generates an API from the data catalogues/schemas, where requests get
proxied to the underlying backend service for each schema to actually be
executed

In text, the flow would look something like:


 <> Data Provider Backend 0
Client <-> Central Service <---> Generated API <> Data-Provider
Backend 1

 <> Data Provider Backend 2



On Thu, Mar 3, 2022 at 5:52 PM David Li  wrote:

> Gavin, thanks for sharing. I'm not so sure you'll find an alternative to
> Substrait, at least one that isn't even more nascent or one that's very
> tied to a particular language, so perhaps it might be better to get
> involved in Substrait and see if it suits your needs? Convincing a team to
> try something new can be hard, though, and it is somewhat of a moving
> target - but Flight SQL is in a similar spot, I think, as it's still
> getting enhancements.
>
> Your project does sound interesting - basically, it sounds like a tabular
> data storage service with query pushdown?
>
> On Thu, Mar 3, 2022, at 19:58, Jacques Nadeau wrote:
> > James, I agree that you could use JSON but that feels a bit hacky
> > (mis-use
> > of the paradigm). Instead, I'd really like to do something like David is
> > suggesting: support Substrait as an alternative to a SQL string.
> > Something like this:
> >
> https://github.com/jacques-n/arrow/commit/e22674fa882e77c2889cf95f69f6e3701db362bc
> >
> > It would be great if someone wanted to pick this up. It would be a nice
> > enhancement to FlightSQL (and provide a structured way to express
> > operations).
> >
> >
> >
> > On Thu, Mar 3, 2022 at 4:56 PM James Duong  .invalid>
> > wrote:
> >
> >> In the same way that you could write an ODBC driver that takes in text
> >> that's not SQL, you could write a Flight SQL server that takes in text
> >> that's JSON.
> >> Flight SQL doesn't parse the query, so you could create commands that
> are
> >> just JSON text.
> >>
> >> Is that the only bit you need, Gavin?
> >>
> >> On Thu, Mar 3, 2022 at 4:26 PM Gavin Ray  wrote:
> >>
> >> > I am enthusiastic about Substrait and have followed it's progress
> eagerly
> >> > =D
> >> >
> >> > When I presented it as a tentative option, there were reservations
> >> because
> >> > of the project/spec being young and the functionality still being
> >> > fleshed out.
> >> > I think if I were having this conversation in say, 8-16 months, it
> would
> >> > have been an easy choice, no doubt.
> >> >
> >> > On a public mailing list (and I can share more details in private if
> >> you're
> >> > curious), the gist of it is this:
> >> >
> >> > Some well-defined/backed-by-mature tech solution for expressing data
> >> > compute operations between services would be a useful thing to have
> >> > (Especially if it's language-agnostic)
> >> >

Re: [FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-03-03 Thread Gavin Ray
I am enthusiastic about Substrait and have followed it's progress eagerly =D

When I presented it as a tentative option, there were reservations because
of the project/spec being young and the functionality still being
fleshed out.
I think if I were having this conversation in say, 8-16 months, it would
have been an easy choice, no doubt.

On a public mailing list (and I can share more details in private if you're
curious), the gist of it is this:

Some well-defined/backed-by-mature tech solution for expressing data
compute operations between services would be a useful thing to have
(Especially if it's language-agnostic)

The goal is for an "implementing service" to have:
- An introspectable schema (IE, "describe yourself to me")
- A query/operation execution endpoint (IE: "perform this operation on your
data")

With FlightSQL this is possible I believe, but it requires the operation to
be expressed as a SQL string which isn't ideal.

Working with some programmatic, structured object that has the same
semantics ("Logical Plan", or whatnot) as a SQL query would have, would be
a better experience
(Jacques is on to something here!)

This interface between services would be somewhat the equivalent of an
"SDK", so it would be nice to have a strongly-typed library for expressing
and building-up query/data-compute ops.


On Thu, Mar 3, 2022 at 3:17 PM David Li  wrote:

> You probably want Substrait: https://substrait.io/
>
> Which is being worked on by several people, including Arrow community
> members.
>
> It might be interesting to generalize Flight SQL to include support for
> Substrait. I'm curious what your application, if you're able to share more.
>
> -David
>
> On Thu, Mar 3, 2022, at 18:05, Gavin Ray wrote:
> > Hiya,
> >
> > I am drafting a proposal for a way to enable services to express data
> > compute operations to each other.
> >
> > However I think it'll be difficult to get buy-in if the only
> representation
> > for queries is as SQL strings.
> >
> > Is there any kind of lower-level API that can be used to express
> operations?
> >
> > IE instead of "SELECT name FROM user"
> >
> > A structured representation like:
> > {
> >   "op": "query",
> >   "schema": "user",
> >   "project": ["name"]
> > }
> >
> > Or maybe this is a bad idea/doesn't make sense?
> >
> > Thank you =)
>


[FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-03-03 Thread Gavin Ray
Hiya,

I am drafting a proposal for a way to enable services to express data
compute operations to each other.

However I think it'll be difficult to get buy-in if the only representation
for queries is as SQL strings.

Is there any kind of lower-level API that can be used to express operations?

IE instead of "SELECT name FROM user"

A structured representation like:
{
  "op": "query",
  "schema": "user",
  "project": ["name"]
}

Or maybe this is a bad idea/doesn't make sense?

Thank you =)


Re: Arrow sync call March 2 at 12:00 US/Eastern, 17:00 UTC

2022-03-02 Thread Gavin Ray
Ah got it, thank you! =)

On Wed, Mar 2, 2022 at 10:33 AM Micah Kornfield 
wrote:

> Hi Gavin,
> This was mostly discussing the current discussion on the e-mail thread
> titled "Flight/FlightSQL Optimization for Small Results?".  The main thing
> discussed was whether we think adding the complexity around negotiation is
> necessary.  The consensus was it probably depends on how big the results
> will be.  The other item discussed is whether the results should be inlined
> on a ticket instead of FlightInfo.  The main complexity there is how much
> custom parsing we want to do, which again depends on intended size of
> inlined data.
>
> -Micah
>
> On Wed, Mar 2, 2022 at 10:22 AM Gavin Ray  wrote:
>
>> Particularly curious about the small-results FlightSQL optimizations and
>> general FlightSQL developments, if there was anything anyone felt was worth
>> noting outside of the general outline.
>>
>> Thank you =)
>>
>> On Wed, Mar 2, 2022 at 10:12 AM Micah Kornfield 
>> wrote:
>>
>>> It was not.  Is there anything you would like more context on?
>>>
>>> On Wed, Mar 2, 2022 at 10:10 AM Gavin Ray  wrote:
>>>
>>> > Was this recorded by any chance? No worries if not.
>>> >
>>> > On Wed, Mar 2, 2022 at 9:58 AM Alessandro Molina <
>>> > alessan...@ursacomputing.com> wrote:
>>> >
>>> > > Attendees:
>>> > >
>>> > >
>>> > > Alessandro Molina
>>> > >
>>> > > Micah Kornfield
>>> > >
>>> > > David Li
>>> > >
>>> > > Joris Van Den Bossche
>>> > >
>>> > >
>>> > >
>>> > > Discussion:
>>> > >
>>> > >
>>> > > Flight SQL Optimization for Small Results
>>> > >
>>> > >  - Reference to
>>> > >
>>> > >
>>> >
>>> https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html
>>> > >
>>> > >
>>> > >  - Building directly in Flight as flight will benefit from it too.
>>> > >
>>> > >
>>> > > Binary VS String KeyValue Metadata
>>> > >
>>> > >   - Resuscitate the discussion on ML -> [DISCUSS] Binary Values in
>>> Key
>>> > > value pairs
>>> > >
>>> > >   - It is technically a breaking change and older clients expect the
>>> > > previous format.
>>> > >
>>> > >
>>> > > GeoData JSON vs Binary
>>> > >
>>> > >   - Binary doesn’t seem very hard in the majority of cases
>>> > >
>>> > >   - base64 might add unnecessary overhead
>>> > >
>>> > >
>>> > > FlightSQL
>>> > >
>>> > >   - Lack of documentation, not referenced from the website
>>> > >
>>> > >   - Probably needing another community developer acting as champion
>>> for
>>> > the
>>> > > FlightSQL, David Li has been overseeing efforts, but the current
>>> > > contributions seem more focused on internal usage.
>>> > >
>>> > > On Wed, Mar 2, 2022 at 1:03 PM Ian Cook 
>>> wrote:
>>> > >
>>> > > > Hi all,
>>> > > >
>>> > > > Our biweekly sync call is today at 12:00 noon Eastern time.
>>> > > >
>>> > > > The Zoom meeting URL for this and other biweekly Arrow sync calls
>>> is:
>>> > > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>>> > > >
>>> > > > Alternatively, enter this information into the Zoom website or app
>>> to
>>> > > > join the call:
>>> > > > Meeting ID: 876 4903 3008
>>> > > > Passcode: 958092
>>> > > >
>>> > > > Thanks,
>>> > > > Ian
>>> > > >
>>> > >
>>> >
>>>
>>


Re: Arrow sync call March 2 at 12:00 US/Eastern, 17:00 UTC

2022-03-02 Thread Gavin Ray
Particularly curious about the small-results FlightSQL optimizations and
general FlightSQL developments, if there was anything anyone felt was worth
noting outside of the general outline.

Thank you =)

On Wed, Mar 2, 2022 at 10:12 AM Micah Kornfield 
wrote:

> It was not.  Is there anything you would like more context on?
>
> On Wed, Mar 2, 2022 at 10:10 AM Gavin Ray  wrote:
>
> > Was this recorded by any chance? No worries if not.
> >
> > On Wed, Mar 2, 2022 at 9:58 AM Alessandro Molina <
> > alessan...@ursacomputing.com> wrote:
> >
> > > Attendees:
> > >
> > >
> > > Alessandro Molina
> > >
> > > Micah Kornfield
> > >
> > > David Li
> > >
> > > Joris Van Den Bossche
> > >
> > >
> > >
> > > Discussion:
> > >
> > >
> > > Flight SQL Optimization for Small Results
> > >
> > >  - Reference to
> > >
> > >
> >
> https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html
> > >
> > >
> > >  - Building directly in Flight as flight will benefit from it too.
> > >
> > >
> > > Binary VS String KeyValue Metadata
> > >
> > >   - Resuscitate the discussion on ML -> [DISCUSS] Binary Values in Key
> > > value pairs
> > >
> > >   - It is technically a breaking change and older clients expect the
> > > previous format.
> > >
> > >
> > > GeoData JSON vs Binary
> > >
> > >   - Binary doesn’t seem very hard in the majority of cases
> > >
> > >   - base64 might add unnecessary overhead
> > >
> > >
> > > FlightSQL
> > >
> > >   - Lack of documentation, not referenced from the website
> > >
> > >   - Probably needing another community developer acting as champion for
> > the
> > > FlightSQL, David Li has been overseeing efforts, but the current
> > > contributions seem more focused on internal usage.
> > >
> > > On Wed, Mar 2, 2022 at 1:03 PM Ian Cook  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Our biweekly sync call is today at 12:00 noon Eastern time.
> > > >
> > > > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > > >
> > > > Alternatively, enter this information into the Zoom website or app to
> > > > join the call:
> > > > Meeting ID: 876 4903 3008
> > > > Passcode: 958092
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > >
> >
>


Re: Arrow sync call March 2 at 12:00 US/Eastern, 17:00 UTC

2022-03-02 Thread Gavin Ray
Was this recorded by any chance? No worries if not.

On Wed, Mar 2, 2022 at 9:58 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> Attendees:
>
>
> Alessandro Molina
>
> Micah Kornfield
>
> David Li
>
> Joris Van Den Bossche
>
>
>
> Discussion:
>
>
> Flight SQL Optimization for Small Results
>
>  - Reference to
>
> https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html
>
>
>  - Building directly in Flight as flight will benefit from it too.
>
>
> Binary VS String KeyValue Metadata
>
>   - Resuscitate the discussion on ML -> [DISCUSS] Binary Values in Key
> value pairs
>
>   - It is technically a breaking change and older clients expect the
> previous format.
>
>
> GeoData JSON vs Binary
>
>   - Binary doesn’t seem very hard in the majority of cases
>
>   - base64 might add unnecessary overhead
>
>
> FlightSQL
>
>   - Lack of documentation, not referenced from the website
>
>   - Probably needing another community developer acting as champion for the
> FlightSQL, David Li has been overseeing efforts, but the current
> contributions seem more focused on internal usage.
>
> On Wed, Mar 2, 2022 at 1:03 PM Ian Cook  wrote:
>
> > Hi all,
> >
> > Our biweekly sync call is today at 12:00 noon Eastern time.
> >
> > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> >
> > Alternatively, enter this information into the Zoom website or app to
> > join the call:
> > Meeting ID: 876 4903 3008
> > Passcode: 958092
> >
> > Thanks,
> > Ian
> >
>


Re: [PROPOSAL] New Proposals for FlightSQL

2022-02-24 Thread Gavin Ray
My opinion isn't worth much, but any extra metadata and utility
methods/classes
to work with them are incredibly useful for tools that do
dynamic/programmatic
generation of UI's or codegen.

Imagining a service that takes a FlightSQL connection and generates a web
UI for
CRUD dynamically using the schema/table/column metadata, or similar tools.

On Thu, Feb 24, 2022 at 2:06 PM José Almeida <
jose.alme...@simbioseventures.com> wrote:

> Hi all,
>
> My team and I are submitting some additions to the FlightSQL protocol.
>
> We are proposing to add a new method called GetXdbcTypeInfo which will
> have the responsibility to get all the information of the datatype
> related to the source [1].
> We are also proposing a ColumnMetadata return in operations such as
> Prepared Statement, GetTables. [2]
>
> We already received some feedback from David Li and Antonie Pitrou.
> However, we would like some more feedback so we can get these changes
> over the line.
>
> These changes will be crucial to the construction of the JDBC (that's
> already being
> developed[3]) and ODBC driver on top of the FlightSQL.
>
> [1] https://github.com/apache/arrow/pull/11982
> [2] https://github.com/apache/arrow/pull/11999
> [3] https://github.com/apache/arrow/pull/12254
>
> Thanks,
> Jose Almeida
>


[FlightSQL] Flight as a cross-language JDBC driver?

2022-02-22 Thread Gavin Ray
Hello all,

Perhaps a bit of a dumb question, but going through the Flight codebase I
noticed this:
https://github.com/apache/arrow/blob/5680d209fd870f99134e2d7299b47acd90fabb8e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L182-L183

public class FlightSqlExample implements FlightSqlProducer, AutoCloseable {
  private static final String DATABASE_URI = "jdbc:derby:target/derbyDB";

I wanted to ask whether there was any reason someone couldn't abstract this
into a
"FlightSqlJDBCServer" that took a JDBC connection URI and exposed a
FlightSQL server over it.

This would in theory give any language that has a FlightSQL client library
the ability to
re-use existing JDBC drivers, albeit at a bit of a performance cost right?

Seems like it could be useful, but I thought there might be a reason why
someone hadn't done this (Chesterton's Fence)
Thank you =)

Gavin Ray.