[DISCUSS][Rust][DataFusion][HiveMetaStore] Possible Metastore integration with Data Fusion

2023-10-03 Thread Kothapalli, Vamsi
Hi devs,

I would like to start discussion with possible integration of Hive Metastore 
with Data fusion,


Simmilarly along the lines of this Glue catalog itegration
https://github.com/apache/arrow-datafusion/issues/2206, Can anyone suggest me 
how the code should look like if
I or some would like to work on building HiveMetastore as catalog provider 
feature in data fusion

Thanks,
Vamsi


[https://opengraph.githubassets.com/e562cb13412873d4ec10975083b49f02937926cfb878b1c07f4daf147c1f7494/apache/arrow-datafusion/issues/2206]
[datafusion-contrib] AWS Glue Integration · Issue #2206 · 
apache/arrow-datafusion
Is your feature request related to a problem or challenge? Please describe what 
you are trying to do. This has been discussed in various places, #907 and 
datafusion-contrib/datafusion-objectstore-s...
github.com



quivr - a new library built on pyarrow

2023-10-03 Thread Spencer Nelson
Hi all - I'd like to share a library I've been working on for a few months
which is built on top of Arrow. It's called quivr
 (like a bundle of arrows) and it could
be thought of as tools to wrap up PyArrow Tables and extend their
capabilities.

I work on scientific software. A lot of the initial scientific work is done
in Jupyter notebooks with dataframes. When it's time to build larger
production systems on top of that work, the flexibility of dataframes
becomes a liability. It's hard to write structured code because dataframes
can be so variably typed and permissive.

But if you try to use normal tools for this (Python objects, lists,
dictionaries), you get crushed with performance issues. I wanted an
array-oriented framework, but with a more structured model than any
dataframe libraries out there.

So, quivr fills that need. You write a *Table* definition, which
corresponds closely to a pyarrow Table schema. You do that by writing a
Python class, with class attributes signaling the types and names of your
columns. And then you can attach methods to describe computation.

By using Arrow's struct types, Tables can be composed. You might have a
Table which defines a "Location" - and has sophisticated logic for that
purpose - and reuse that Location within other, higher-order tables. The
compositional approach has really been working extremely well so far in our
work.

I've written a little blog post
 describing the
motivations and showing it in use, and docs are up too
. quivr is still in a pretty
molten state, so I'm very interested in any feedback or broader interest in
this from anyone who might find it useful. I'd love to work closer with the
Arrow team as well - I have a growing wishlist of features around PyArrow
which I'd be interested in working on.

Thanks,
Spencer


[RESULT] [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Matt Topol
@Antoine i'll update the comment in the .proto file to make it explicit
that there's no implied/inherent relationship required between the
app_metadata for a FlightInfo and the corresponding
FlightEndpoint/FlightData messages.

That gives 3 +1 votes, no -1 votes. So the vote passes. Thanks everyone!
I'll add the comment and then merge the PR.

--Matt

On Tue, Oct 3, 2023 at 12:17 PM Antoine Pitrou  wrote:

>
> +1 from me. It might be worth spelling out whether any relationship is
> expected between the `app_metadata` for a FlightInfo and any of the
> corresponding `FlightEndpoint`s and `FlightData` chunks.
>
>
> Le 12/09/2023 à 17:48, Matt Topol a écrit :
> > Hey all,
> >
> > I would like to propose adding a new app_metadata field to both the
> > FlightInfo and FlightEndpoint message types of the Arrow Flight protocol.
> > There has been discussion of doing so for a while and has now been
> brought
> > back up in regards to [1]. More specifically, this enables adding
> > application defined metadata for FlightSQL (by way of FlightInfo) which
> can
> > then be utilized to pass information such as QueryID, QueryCost, etc.
> >
> > I've put up a PR to add this at [2].
> >
> > The vote will be open for at least 24 hours:
> >
> > [ ] +1 Add these fields to the Arrow Flight RPC protocol
> > [ ] +0
> > [ ] -1 Do not add these fields to the Arrow Flight RPC protocol
> because
> >
> > Thanks much!
> > --Matt
> >
> > [1]: https://github.com/apache/arrow/issues/37635
> > [2]: https://github.com/apache/arrow/pull/37679
> >
>


Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Antoine Pitrou



+1 from me. It might be worth spelling out whether any relationship is 
expected between the `app_metadata` for a FlightInfo and any of the 
corresponding `FlightEndpoint`s and `FlightData` chunks.



Le 12/09/2023 à 17:48, Matt Topol a écrit :

Hey all,

I would like to propose adding a new app_metadata field to both the
FlightInfo and FlightEndpoint message types of the Arrow Flight protocol.
There has been discussion of doing so for a while and has now been brought
back up in regards to [1]. More specifically, this enables adding
application defined metadata for FlightSQL (by way of FlightInfo) which can
then be utilized to pass information such as QueryID, QueryCost, etc.

I've put up a PR to add this at [2].

The vote will be open for at least 24 hours:

[ ] +1 Add these fields to the Arrow Flight RPC protocol
[ ] +0
[ ] -1 Do not add these fields to the Arrow Flight RPC protocol because

Thanks much!
--Matt

[1]: https://github.com/apache/arrow/issues/37635
[2]: https://github.com/apache/arrow/pull/37679



Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Sutou Kouhei
+1

I've reviewed it and fixed some test related problems
directly.

Thanks,
-- 
kou

In 
  "Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint" on 
Mon, 25 Sep 2023 10:17:52 -0400,
  "David Li"  wrote:

> I see integration tests are added now, thanks Matt!
> 
> +1
> 
> On Thu, Sep 14, 2023, at 22:56, Sutou Kouhei wrote:
>> Hi,
>>
>> Could you add integration tests for this as David said at
>> https://github.com/apache/arrow/pull/37679#issuecomment-1720143047 ?
>>
>> See also:
>> https://arrow.apache.org/docs/dev/format/Changing.html
>>
>>
>> Thanks,
>> -- 
>> kou
>>
>> In 
>>   "Re: [VOTE] [Format] Add app_metadata to FlightInfo and 
>> FlightEndpoint" on Thu, 14 Sep 2023 15:55:14 -0400,
>>   Matt Topol  wrote:
>>
>>> The PR has been updated for a bit with both C++ and Go implementations,
>>> hopefully I can get some more votes on this thread?
>>> 
>>> On Tue, Sep 12, 2023 at 12:16 PM Matt Topol  wrote:
>>> 
 The C++ code gets auto-generated during build right? Ah, fair point the
 C++ still uses it's own objects. I'll update the PR with a C++
 implementation.

 On Tue, Sep 12, 2023 at 12:03 PM David Li  wrote:

> Don't we need another implementation (if we count the Go codegen as one
> implementation)?
>
> On Tue, Sep 12, 2023, at 11:48, Matt Topol wrote:
> > Hey all,
> >
> > I would like to propose adding a new app_metadata field to both the
> > FlightInfo and FlightEndpoint message types of the Arrow Flight
> protocol.
> > There has been discussion of doing so for a while and has now been
> brought
> > back up in regards to [1]. More specifically, this enables adding
> > application defined metadata for FlightSQL (by way of FlightInfo) which
> can
> > then be utilized to pass information such as QueryID, QueryCost, etc.
> >
> > I've put up a PR to add this at [2].
> >
> > The vote will be open for at least 24 hours:
> >
> > [ ] +1 Add these fields to the Arrow Flight RPC protocol
> > [ ] +0
> > [ ] -1 Do not add these fields to the Arrow Flight RPC protocol
> because
> >
> > Thanks much!
> > --Matt
> >
> > [1]: https://github.com/apache/arrow/issues/37635
> > [2]: https://github.com/apache/arrow/pull/37679
>



Re: [DISCUSS][C++] Raw pointer string views

2023-10-03 Thread Antoine Pitrou



Le 03/10/2023 à 01:36, Matt Topol a écrit :


The cost of conversion is actually significantly higher than the actual
overhead of simply accessing the values in either representation, leading
to a high potential for bottleneck. For systems like Velox and DuckDB where
it's important to be able to return results as fast as possible, if they
have an operation with a throughput of several hundred MB/s or even G/s,
this conversion cost would become a huge bottleneck to returning results
given several cases of converting Raw Pointer views to the offset-based
views go as low as ~22MB/s.


I think you misread the benchmark numbers. It's 22 MItems/s, not 22 MB/s.
Since that number is for the kLongAndSeldomInlineable case, I assume the 
MB/s would two or three orders of magnitude higher.


Regards

Antoine.