Traffic control and cancel detect on flight do_get in python implementation
Hi, I'm using arrow flight to transfer data in distributed system, but the lightning speed makes both client and server faces out of memory issue. For do_put and do_exchange method, the protocol provides stream metadata reader/writer for client/server exchange control messages along data stream. But do_get only returns a FlightDataStream without any extra control message can be used to communicate with each other. Also, the returned FlightDataStream is unaware of client canceling, while java has a cancel callback. My solution is to use do_exchange to replace do_get for client download data, or is there any better way to implement that? -- - Best Regards, Wenbo Hu,
Re: [ANNOUNCE] New Arrow committer: Kevin Gurney
Congratulations and welcome! On Wed, Jul 5, 2023 at 10:49 PM Kevin Gurney wrote: > Thank you all for the kind words and warm welcome! > > I feel honored and excited to be part of this vibrant community! I look > forward to continuing to collaborate with all of you! > > Best Regards, > > Kevin Gurney > > From: Alenka Frim > Sent: Wednesday, July 5, 2023 12:22 AM > To: dev@arrow.apache.org > Subject: Re: [ANNOUNCE] New Arrow committer: Kevin Gurney > > Congratulations! > > On Tue, Jul 4, 2023 at 9:41 PM Dewey Dunnington > wrote: > > > Congrats! > > > > On Tue, Jul 4, 2023 at 2:08 PM Matt Topol > wrote: > > > > > > Welcome! > > > > > > On Tue, Jul 4, 2023, 11:06 AM Joris Van den Bossche < > > > jorisvandenboss...@gmail.com> wrote: > > > > > > > Congrats Kevin! > > > > > > > > On Tue, 4 Jul 2023 at 13:47, David Li wrote: > > > > > > > > > > Welcome Kevin! > > > > > > > > > > On Tue, Jul 4, 2023, at 05:55, Raúl Cumplido wrote: > > > > > > Congratulations Kevin!!! > > > > > > > > > > > > El mar, 4 jul 2023 a las 3:32, Weston Pace (< > weston.p...@gmail.com > > >) > > > > escribió: > > > > > >> > > > > > >> Congratulations Kevin! > > > > > >> > > > > > >> On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei > > > > > wrote: > > > > > >> > > > > > >> > On behalf of the Arrow PMC, I'm happy to announce that Kevin > > Gurney > > > > > >> > has accepted an invitation to become a committer on Apache > > > > > >> > Arrow. Welcome, and thank you for your contributions! > > > > > >> > > > > > > >> > -- > > > > > >> > kou > > > > > >> > > > > > > > >
Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 27.0.0 RC1
+1 (binding) Verified on x86 mac. Thank you Andy for keeping these releases going On Wed, Jul 5, 2023 at 1:13 PM Andy Grove wrote: > Hi, > > I would like to propose a release of Apache Arrow DataFusion Python > Bindings, > version 27.0.0. > > This release candidate is based on commit: > 3f81513d6c5fd109bdf8c509f81c0a587924d354 [1] > The proposed release tarball and signatures are hosted at [2]. > The changelog is located at [3]. > The Python wheels are located at [4]. > > Please download, verify checksums and signatures, run the unit tests, and > vote > on the release. The vote will be open for at least 72 hours. > > Only votes from PMC members are binding, but all members of the community > are > encouraged to test the release and vote with "(non-binding)". > > The standard verification procedure is documented at > > https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates > . > > [ ] +1 Release this as Apache Arrow DataFusion Python 27.0.0 > [ ] +0 > [ ] -1 Do not release this as Apache Arrow DataFusion Python 27.0.0 > because... > > Here is my vote: > > +1 > > [1]: > > https://github.com/apache/arrow-datafusion-python/tree/3f81513d6c5fd109bdf8c509f81c0a587924d354 > [2]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-27.0.0-rc1 > [3]: > > https://github.com/apache/arrow-datafusion-python/blob/3f81513d6c5fd109bdf8c509f81c0a587924d354/CHANGELOG.md > [4]: https://test.pypi.org/project/datafusion/27.0.0/ >
Re: Webassembly?
I can help, we use emscripten-compiled arrow for perspective ( https://github.com/finos/perspective) and we now compile perspective's python side for pyodide so I have an interest in a fully functional pyarrow/pandas in pyodide on an ongoing basis. Tim Paine tim.paine.nyc 908-721-1185 > On Jul 6, 2023, at 11:14, Antoine Pitrou wrote: > > > Hi Joe, > > Thank you for working on that. > > The one question I have is: are you willing to help us maintain Arrow C++ on > the long term? The logic you're adding in > https://github.com/apache/arrow/pull/35672 is quite delicate; also I don't > think anyone among us is a Webassembly expert, which means that we might > break things unwillingly. So while it would be great to get Arrow C++ to work > with WASM, a dedicated expert is needed to help maintain and debug WASM > support in the future. > > Regards > > Antoine. > > >> Le 03/07/2023 à 17:29, Joe Marshall a écrit : >> Hi, >> I'm a pyodide developer amongst other things (webassembly cpython >> intepreter) and I've got some PRs in progress on arrow relating to >> webassembly support. I wondered if it might be worth discussing my broader >> ideas for this on the list or at the biweekly development meeting? >> So far I have 35176 in, which makes arrow run on a single thread. This is >> needed because in a lot of webassembly environments (browsers at least, >> pyodide), threading isn't available or is heavily constrained. >> With that I've aimed to make it relatively transparent to users, so that >> things like datasets and acero mostly just work (but slower obviously). It's >> kind of fiddly in the arrow code but working, and means users can port >> things easily. >> Once that is in, the plan is to submit a following pr that adds cmake >> presets for emscripten which can build the cpp libraries and pyarrow for >> pyodide. I've hacked this together in a build already, it's a bit fiddly and >> needs a load of tidying up, but I'm confident it can be done. >> Essentially, I'm wanting to get this stuff in because pandas is moving >> towards arrow as a pretty much required dependency, and webassembly is a >> pandas platform, as well as >> being an official python platform, so it would be great to get it working >> in pyodide without us needing to maintain a load of patches. I guess it >> could also come in handy with various container platforms that are moving to >> webassembly. >> Basically I thought it's probably worth a bit of a heads up relating to >> this, as I know the bigger picture of things is often hard to see from just >> pull requests. >> Thanks >> Joe >> This message and any attachment are intended solely for the addressee >> and may contain confidential information. If you have received this >> message in error, please contact the sender and delete the email and >> attachment. >> Any views or opinions expressed by the author of this email do not >> necessarily reflect the views of the University of Nottingham. Email >> communications with the University of Nottingham may be monitored >> where permitted by law.
Re: [DISCUSS][Format] Draft implementation of string view array format
@Andrew: Restricting these arrays to a single buffer will severely decrease their utility. Since the character data is stored in multiple character buffers writing Utf8View array can proceed without resizing allocations, which is a major overhead when writing Utf8 arrays. Furthermore since the character buffers have no restrictions on their size, it's straightforward to reuse an existing buffer as a character buffer rather than always allocating a new one. In the case of creating an array which shares a lot of data with another (for example, appending some strings) we can reuse most of the character buffers from the original. Finally Utf8View is well adapted for efficiently wrapping non-arrow string data for ingestion by a kernel, even if the string data's full extent is not known ahead of time and is spread across multiple non-contiguous buffers. @Raphael: > branch on access The branch-on-access is unavoidable since a primary feature of the Utf8View format is keeping short strings inline in the fixed width portion of data. It's worth noting that the inline prefix allows skipping the branch entirely for common cases of comparison, for example when the strings to be compared differ within the first 4 bytes. In benchmarking (for example while building a hash table) I have not observed that this branch overly pessimizes access. Although I can't guarantee every Utf8View array will be more efficient than any Utf8 array, it is certainly faster for many relevant cases. Specifically sorting and equality comparison benefit significantly from the prefix comparison fast path, so I'd anticipate that multi column sorting and aggregations would as well. If there are any other benchmarks which would help to justify Utf8View in your mind, I'd be happy to try writing them. > UTF-8 validation for StringArray can be done very efficiently by first verifying the entire buffer, and then verifying the offsets correspond to the start of a UTF-8 codepoint For non-inlined strings, the character buffers do always contain the entire string's data and not just the last `len - 4` bytes. Thus the approach you describe for validating an entire character buffer as UTF-8 then checking offsets will be just as valid for Utf8View arrays as for Utf8 arrays. > it does seem inconsistent to use unsigned types It is indeed more typical for the arrow format to use signed integers for offsets and other quantities. In this case there is prior art in other engines with which we can remain compatible by using unsigned integers instead. Since this is only a break with convention within the format and shouldn't be difficult for any implementation to accommodate, I would argue that it's worthwhile to avoid pushing change onto existing implementers. > I presume that StringView will behave similarly to dictionaries in that the selection kernels will not recompute the underlying value buffers. The Utf8View format itself is not prescriptive of selection operations on the array; kernels are free to reuse character buffers (which produces an implicit selection vector) or to recompute them. Furthermore unlike an explicit selection vector a kernel may decide to copy and densify dynamically if it detects that output is getting sparse or fragmented. It's also worth noting that unlike an explicit selection vector a Utf8View array (however sparse or fragmented) will still benefit from the prefix comparison fast path. Sincerely, Ben Kietzman On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies wrote: > > I would be interested in hearing some input from the Rust community. > > A couple of thoughts: > > The variable number of buffers would definitely pose some challenges for > the Rust implementation, the closest thing we currently have is possibly > UnionArray, but even then the number of buffers is still determined > statically by the DataType. I therefore also wonder about the possibility > of always having a single backing buffer that stores the character data, > including potentially a copy of the prefix. This would also avoid forcing a > branch on access, which I would have expected to hurt performance for some > kernels quite significantly. > > Whilst not really a concern for Rust, which supports unsigned types, it > does seem inconsistent to use unsigned types where the rest of the format > encourages the use of signed offsets, etc... > > It isn't clearly specified whether a null should have a valid set of > offsets, etc... I think it is an important property of the current array > layouts that, with exception to dictionaries, the data in null slots is > arbitrary, i.e. can take any value, but not undefined. This allows for > separate handling of the null mask and values, which can be important for > some kernels and APIs. > > More an observation than an issue, but UTF-8 validation for StringArray > can be done very efficiently by first verifying the entire buffer, and then > verifying the offsets correspond to the start of a UTF-8 codepoint. Thi
Re: Webassembly?
Hi Joe, Thank you for working on that. The one question I have is: are you willing to help us maintain Arrow C++ on the long term? The logic you're adding in https://github.com/apache/arrow/pull/35672 is quite delicate; also I don't think anyone among us is a Webassembly expert, which means that we might break things unwillingly. So while it would be great to get Arrow C++ to work with WASM, a dedicated expert is needed to help maintain and debug WASM support in the future. Regards Antoine. Le 03/07/2023 à 17:29, Joe Marshall a écrit : Hi, I'm a pyodide developer amongst other things (webassembly cpython intepreter) and I've got some PRs in progress on arrow relating to webassembly support. I wondered if it might be worth discussing my broader ideas for this on the list or at the biweekly development meeting? So far I have 35176 in, which makes arrow run on a single thread. This is needed because in a lot of webassembly environments (browsers at least, pyodide), threading isn't available or is heavily constrained. With that I've aimed to make it relatively transparent to users, so that things like datasets and acero mostly just work (but slower obviously). It's kind of fiddly in the arrow code but working, and means users can port things easily. Once that is in, the plan is to submit a following pr that adds cmake presets for emscripten which can build the cpp libraries and pyarrow for pyodide. I've hacked this together in a build already, it's a bit fiddly and needs a load of tidying up, but I'm confident it can be done. Essentially, I'm wanting to get this stuff in because pandas is moving towards arrow as a pretty much required dependency, and webassembly is a pandas platform, as well as being an official python platform, so it would be great to get it working in pyodide without us needing to maintain a load of patches. I guess it could also come in handy with various container platforms that are moving to webassembly. Basically I thought it's probably worth a bit of a heads up relating to this, as I know the bigger picture of things is often hard to see from just pull requests. Thanks Joe This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please contact the sender and delete the email and attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. Email communications with the University of Nottingham may be monitored where permitted by law.