Traffic control and cancel detect on flight do_get in python implementation

2023-07-06 Thread Wenbo Hu
Hi,
I'm using arrow flight to transfer data in distributed system, but
the lightning speed makes both client and server faces out of memory
issue.
For do_put and do_exchange method, the protocol provides stream
metadata reader/writer for client/server exchange control messages
along data stream.
But do_get only returns a FlightDataStream without any extra
control message can be used to communicate with each other.
Also, the returned FlightDataStream is unaware of client
canceling, while java has a cancel callback.
My solution is to use do_exchange to replace do_get for client
download data, or is there any better way to implement that?

-- 
-
Best Regards,
Wenbo Hu,


Re: [ANNOUNCE] New Arrow committer: Kevin Gurney

2023-07-06 Thread Jacob Wujciak-Jens
Congratulations and welcome!

On Wed, Jul 5, 2023 at 10:49 PM Kevin Gurney  wrote:

> Thank you all for the kind words and warm welcome!
>
> I feel honored and excited to be part of this vibrant community! I look
> forward to continuing to collaborate with all of you!
>
> Best Regards,
>
> Kevin Gurney
> 
> From: Alenka Frim 
> Sent: Wednesday, July 5, 2023 12:22 AM
> To: dev@arrow.apache.org 
> Subject: Re: [ANNOUNCE] New Arrow committer: Kevin Gurney
>
> Congratulations!
>
> On Tue, Jul 4, 2023 at 9:41 PM Dewey Dunnington
>  wrote:
>
> > Congrats!
> >
> > On Tue, Jul 4, 2023 at 2:08 PM Matt Topol 
> wrote:
> > >
> > > Welcome!
> > >
> > > On Tue, Jul 4, 2023, 11:06 AM Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com> wrote:
> > >
> > > > Congrats Kevin!
> > > >
> > > > On Tue, 4 Jul 2023 at 13:47, David Li  wrote:
> > > > >
> > > > > Welcome Kevin!
> > > > >
> > > > > On Tue, Jul 4, 2023, at 05:55, Raúl Cumplido wrote:
> > > > > > Congratulations Kevin!!!
> > > > > >
> > > > > > El mar, 4 jul 2023 a las 3:32, Weston Pace (<
> weston.p...@gmail.com
> > >)
> > > > escribió:
> > > > > >>
> > > > > >> Congratulations Kevin!
> > > > > >>
> > > > > >> On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei  >
> > > > wrote:
> > > > > >>
> > > > > >> > On behalf of the Arrow PMC, I'm happy to announce that Kevin
> > Gurney
> > > > > >> > has accepted an invitation to become a committer on Apache
> > > > > >> > Arrow. Welcome, and thank you for your contributions!
> > > > > >> >
> > > > > >> > --
> > > > > >> > kou
> > > > > >> >
> > > >
> >
>


Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 27.0.0 RC1

2023-07-06 Thread Andrew Lamb
+1 (binding)

Verified on x86 mac.

Thank you Andy for keeping these releases going



On Wed, Jul 5, 2023 at 1:13 PM Andy Grove  wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow DataFusion Python
> Bindings,
> version 27.0.0.
>
> This release candidate is based on commit:
> 3f81513d6c5fd109bdf8c509f81c0a587924d354 [1]
> The proposed release tarball and signatures are hosted at [2].
> The changelog is located at [3].
> The Python wheels are located at [4].
>
> Please download, verify checksums and signatures, run the unit tests, and
> vote
> on the release. The vote will be open for at least 72 hours.
>
> Only votes from PMC members are binding, but all members of the community
> are
> encouraged to test the release and vote with "(non-binding)".
>
> The standard verification procedure is documented at
>
> https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates
> .
>
> [ ] +1 Release this as Apache Arrow DataFusion Python 27.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow DataFusion Python 27.0.0
> because...
>
> Here is my vote:
>
> +1
>
> [1]:
>
> https://github.com/apache/arrow-datafusion-python/tree/3f81513d6c5fd109bdf8c509f81c0a587924d354
> [2]:
>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-27.0.0-rc1
> [3]:
>
> https://github.com/apache/arrow-datafusion-python/blob/3f81513d6c5fd109bdf8c509f81c0a587924d354/CHANGELOG.md
> [4]: https://test.pypi.org/project/datafusion/27.0.0/
>


Re: Webassembly?

2023-07-06 Thread Tim Paine
I can help, we use emscripten-compiled 
 arrow for perspective ( https://github.com/finos/perspective) and we now 
compile perspective's python side for pyodide so I have an interest in a fully 
functional pyarrow/pandas in pyodide on an ongoing basis. 

Tim Paine
tim.paine.nyc
908-721-1185

> On Jul 6, 2023, at 11:14, Antoine Pitrou  wrote:
> 
> 
> Hi Joe,
> 
> Thank you for working on that.
> 
> The one question I have is: are you willing to help us maintain Arrow C++ on 
> the long term? The logic you're adding in 
> https://github.com/apache/arrow/pull/35672 is quite delicate; also I don't 
> think anyone among us is a Webassembly expert, which means that we might 
> break things unwillingly. So while it would be great to get Arrow C++ to work 
> with WASM, a dedicated expert is needed to help maintain and debug WASM 
> support in the future.
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 03/07/2023 à 17:29, Joe Marshall a écrit :
>> Hi,
>> I'm a pyodide developer amongst other things (webassembly cpython 
>> intepreter) and I've got some PRs in progress on arrow relating to 
>> webassembly support. I wondered if it might be worth discussing my broader 
>> ideas for this on the list or at the biweekly development meeting?
>> So far I have 35176 in, which makes arrow run on a single thread. This is 
>> needed because in a lot of webassembly environments (browsers at least, 
>> pyodide), threading isn't available or is heavily constrained.
>> With that I've aimed to make it relatively transparent to users, so that 
>> things like datasets and acero mostly just work (but slower obviously). It's 
>> kind of fiddly in the arrow code but working, and means users can port 
>> things easily.
>> Once that is in, the plan is to submit a following pr that adds cmake 
>> presets for emscripten which can build the cpp libraries and pyarrow for 
>> pyodide. I've hacked this together in a build already, it's a bit fiddly and 
>> needs a load of tidying up, but I'm confident it can be done.
>> Essentially, I'm wanting to get this stuff in because pandas is moving 
>> towards arrow as a pretty much required dependency, and webassembly is a 
>> pandas platform, as well as
>>  being an official python platform, so it would be great to get it working 
>> in pyodide without us needing to maintain a load of patches. I guess it 
>> could also come in handy with various container platforms that are moving to 
>> webassembly.
>> Basically I thought it's probably worth a bit of a heads up relating to 
>> this, as I know the bigger picture of things is often hard to see from just 
>> pull requests.
>> Thanks
>> Joe
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please contact the sender and delete the email and
>> attachment.
>> Any views or opinions expressed by the author of this email do not
>> necessarily reflect the views of the University of Nottingham. Email
>> communications with the University of Nottingham may be monitored
>> where permitted by law.


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-06 Thread Benjamin Kietzman
@Andrew:

Restricting these arrays to a single buffer will severely decrease their
utility. Since the character data is stored in multiple character buffers
writing Utf8View array can proceed without resizing allocations,
which is a major overhead when writing Utf8 arrays. Furthermore since the
character buffers have no restrictions on their size, it's straightforward
to
reuse an existing buffer as a character buffer rather than always allocating
a new one. In the case of creating an array which shares a lot of data with
another (for example, appending some strings) we can reuse most of the
character buffers from the original. Finally Utf8View is well adapted for
efficiently wrapping non-arrow string data for ingestion by a kernel, even
if the string data's full extent is not known ahead of time and is spread
across multiple non-contiguous buffers.

@Raphael:

> branch on access

The branch-on-access is unavoidable since a primary feature of the Utf8View
format is keeping short strings inline in the fixed width portion of data.
It's worth noting that the inline prefix allows skipping the branch entirely
for common cases of comparison, for example when the strings to be compared
differ within the first 4 bytes.

In benchmarking (for example while building a hash table) I have not
observed
that this branch overly pessimizes access. Although I can't guarantee every
Utf8View array will be more efficient than any Utf8 array, it is certainly
faster for many relevant cases. Specifically sorting and equality comparison
benefit significantly from the prefix comparison fast path,
so I'd anticipate that multi column sorting and aggregations would as well.
If there are any other benchmarks which would help to justify Utf8View in
your
mind, I'd be happy to try writing them.

> UTF-8 validation for StringArray can be done very efficiently by first
verifying the entire buffer, and then verifying the offsets correspond to
the start of a UTF-8 codepoint

For non-inlined strings, the character buffers do always contain the entire
string's data and not just the last `len - 4` bytes. Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.

> it does seem inconsistent to use unsigned types

It is indeed more typical for the arrow format to use signed integers for
offsets and other quantities. In this case there is prior art in other
engines with which we can remain compatible by using unsigned integers
instead. Since this is only a break with convention within the format and
shouldn't be difficult for any implementation to accommodate, I would argue
that it's worthwhile to avoid pushing change onto existing implementers.

> I presume that StringView will behave similarly to dictionaries in that
the selection kernels will not recompute the underlying value buffers.

The Utf8View format itself is not prescriptive of selection operations on
the
array; kernels are free to reuse character buffers (which produces an
implicit
selection vector) or to recompute them. Furthermore unlike an explicit
selection vector a kernel may decide to copy and densify dynamically if it
detects that output is getting sparse or fragmented. It's also worth noting
that unlike an explicit selection vector a Utf8View array (however sparse or
fragmented) will still benefit from the prefix comparison fast path.

Sincerely,
Ben Kietzman

On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
 wrote:

> > I would be interested in hearing some input from the Rust community.
>
>  A couple of thoughts:
>
> The variable number of buffers would definitely pose some challenges for
> the Rust implementation, the closest thing we currently have is possibly
> UnionArray, but even then the number of buffers is still determined
> statically by the DataType. I therefore also wonder about the possibility
> of always having a single backing buffer that stores the character data,
> including potentially a copy of the prefix. This would also avoid forcing a
> branch on access, which I would have expected to hurt performance for some
> kernels quite significantly.
>
> Whilst not really a concern for Rust, which supports unsigned types, it
> does seem inconsistent to use unsigned types where the rest of the format
> encourages the use of signed offsets, etc...
>
> It isn't clearly specified whether a null should have a valid set of
> offsets, etc... I think it is an important property of the current array
> layouts that, with exception to dictionaries, the data in null slots is
> arbitrary, i.e. can take any value, but not undefined. This allows for
> separate handling of the null mask and values, which can be important for
> some kernels and APIs.
>
> More an observation than an issue, but UTF-8 validation for StringArray
> can be done very efficiently by first verifying the entire buffer, and then
> verifying the offsets correspond to the start of a UTF-8 codepoint. Thi

Re: Webassembly?

2023-07-06 Thread Antoine Pitrou



Hi Joe,

Thank you for working on that.

The one question I have is: are you willing to help us maintain Arrow 
C++ on the long term? The logic you're adding in 
https://github.com/apache/arrow/pull/35672 is quite delicate; also I 
don't think anyone among us is a Webassembly expert, which means that we 
might break things unwillingly. So while it would be great to get Arrow 
C++ to work with WASM, a dedicated expert is needed to help maintain and 
debug WASM support in the future.


Regards

Antoine.


Le 03/07/2023 à 17:29, Joe Marshall a écrit :

Hi,

I'm a pyodide developer amongst other things (webassembly cpython intepreter) 
and I've got some PRs in progress on arrow relating to webassembly support. I 
wondered if it might be worth discussing my broader ideas for this on the list 
or at the biweekly development meeting?

So far I have 35176 in, which makes arrow run on a single thread. This is 
needed because in a lot of webassembly environments (browsers at least, 
pyodide), threading isn't available or is heavily constrained.

With that I've aimed to make it relatively transparent to users, so that things 
like datasets and acero mostly just work (but slower obviously). It's kind of 
fiddly in the arrow code but working, and means users can port things easily.

Once that is in, the plan is to submit a following pr that adds cmake presets 
for emscripten which can build the cpp libraries and pyarrow for pyodide. I've 
hacked this together in a build already, it's a bit fiddly and needs a load of 
tidying up, but I'm confident it can be done.

Essentially, I'm wanting to get this stuff in because pandas is moving towards 
arrow as a pretty much required dependency, and webassembly is a pandas 
platform, as well as
  being an official python platform, so it would be great to get it working in 
pyodide without us needing to maintain a load of patches. I guess it could also 
come in handy with various container platforms that are moving to webassembly.

Basically I thought it's probably worth a bit of a heads up relating to this, 
as I know the bigger picture of things is often hard to see from just pull 
requests.

Thanks
Joe



This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment.

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored
where permitted by law.