Re: [DISCUSS][C++] Raw pointer string views
Could you please simply describe the layout of DuckDB and Velox so we can know what kind of conversion is required from the raw pointer variant? If any engine simply represents string array in the form of something like std::vector, should we provide a similar variant in C++ to minimize the conversion cost? Best, Gang On Wed, Sep 27, 2023 at 7:09 AM Raphael Taylor-Davies wrote: > I'm confused why this would need to copy string data, assuming the > pointers are into defined memory regions, something necessary for the C > data interface's ownership semantics regardless, why can't these memory > regions just be used as buffers as is? This would therefore require just > rewriting the views buffer to subtract the base pointer of the given > buffer, which should be extremely fast? > > On 26 September 2023 23:34:54 BST, Matt Topol > wrote: > >I believe the motivation is to avoid the cost of the data copy that would > >have to happen to convert from a pointer based to offset based scenario. > >Allowing the pointer-based implementation will ensure that we can maintain > >zero-copy communication with both DuckDB and Velox in a common workflow > >scenario. > > > >Converting to the offset-based version would have a cost of having to copy > >strings from their locations to contiguous buffers which could end up > being > >very significant depending on the shape and size of the data. The pointer > >-based solution wouldn't be allowed in IPC though, only across the C Data > >interface (correct me if I'm wrong). > > > >--Matt > > > >On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies > > wrote: > > > >> Hi, > >> > >> Is the motivation here to avoid DuckDB and Velox having to duplicate the > >> conversion logic from pointer-based to offset-based, or to allow > >> arrow-cpp to operate directly on pointer-based arrays? > >> > >> If it is the former, I personally wouldn't have thought the conversion > >> logic sufficiently complex to really warrant this? > >> > >> If it is the latter, I wonder if you have some benchmark numbers for > >> converting between and operating on the differing representations? In > >> the absence of a strong performance case, it's hard in my opinion to > >> justify adding what will be an arrow-cpp specific extension that isn't > >> part of the standard, with all the potential for confusion and > >> interoperability challenges that entails. > >> > >> Kind Regards, > >> > >> Raphael > >> > >> On 26/09/2023 21:34, Benjamin Kietzman wrote: > >> > Hello all, > >> > > >> > In the PR to add support for Utf8View to the c++ implementation, > >> > I've taken the approach of allowing raw pointer views [1] alongside > the > >> > index/offset views described in the spec [2]. This was done to ease > >> > communication with other engines such as DuckDB and Velox whose native > >> > string representation is the raw pointer view. In order to be usable > >> > as a utility for writing IPC files and other operations on arrow > >> > formatted data, it is useful for the library to be able to directly > >> > import raw pointer arrays even when immediately converting these to > >> > the index/offset representation. > >> > > >> > However there has been objection in review [3] since the raw pointer > >> > representation is not part of the official format. Since data > visitation > >> > utilities are generic, IMHO this hybrid approach does not add > >> > significantly to the complexity of the C++ library, and I feel the > >> > aforementioned interoperability is a high priority when adding this > >> > feature to the C++ library. It's worth noting that this > interoperability > >> > has been a stated goal of the Utf8Type since its original proposal [4] > >> > and throughout the discussion of its adoption [5]. > >> > > >> > Sincerely, > >> > Ben Kietzman > >> > > >> > [1]: > >> > > >> > https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 > >> > [2]: > >> > > >> > https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 > >> > [3]: > https://github.com/apache/arrow/pull/37792#discussion_r1336010665 > >> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > >> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 > >> > > >> >
Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
+1 (non-binding) full verification with conda arrow 13.0.0 R 4.3 on pop_os 23.04, cmake 3.27, gcc 11 On Wed, Sep 27, 2023 at 1:26 AM Bryce Mecum wrote: > +1 (non-binding) > > Verified with `./verify-release-candidate.sh 0.3.0 0` on: > - Windows 10, x86_64, libarrow-main, MSVC 17 2022, R 4.3.1, Rtools 43 > - macOS 13.6, aarch64, libarrow 13.0.0, R 4.3.1 > - Ubuntu 23.04, aarch64, libarrow 13.0.0, R 4.2.2 >
Re: [DISCUSS][Gandiva] External function registry proposal
> I think the key idea is to let users call Gandiva functions to register functions and pass necessary info explicitly to Gandiva, rather than letting Gandiva discover them by itself. That makes sense. Thanks Jin and Antonie for your valuable feedback. I will revise the proposal accordingly later. Regards, Yue On Wed, Sep 27, 2023 at 4:29 AM Jin Shang wrote: > I agree with Antoine that we don't need to define a JSON format or a > directory structure for Gandiva. > To support external functions, we essentially need two things: > 1. Gandiva's function registry needs to be aware of the function metadata: > We can achieve this by having a > `FunctionRegistry::AddFunction(NativeFunction* func)` function. The > `NativeFunction` can come from whatever source the user wants or even hard > coded, not necessarily from JSON files. The function registry is a > singleton so this should be easy. > 2. The LLVM engine needs access to the function IR definition: Users should > be able register a string representation of IR bytecode similar to what the > current `Engine::LoadPreCompiledIR` does. So something like > `LoadExternalIR(std::string_view)` is enough. Although `Engine` is not a > singleton, we can create a global object holding external IRs and Engines > can link them on construction. > I think the key idea is to let users call Gandiva functions to register > functions and pass necessary info explicitly to Gandiva, rather than > letting Gandiva discover them by itself. > > On Tue, Sep 26, 2023 at 2:14 AM Yue Ni wrote: > > > > The definition of an external function registry can certainly belong in > > Gandiva, but how it's populated should be left to third-party projects > > > > Are you proposing a more general approach, like incorporating the > following > > APIs into Gandiva? (Please note that the function names/signatures are > > tentative and just meant for illustrative purposes.) > > 1) AddExternalFunctionRegistry(ExternalFunctionRegistry > function_registry) > > 2) AddFunctionBitcodeLoader(FunctionBitcodeLoader bitcode_loader) > > Where `ExternalFunctionRegistry` can return a list of function > definitions > > and `FunctionBitcodeLoader` can return a list of bitcode buffers, so that > > the specific metadata/bitcode data population logic can be moved out of > > Gandiva? Thanks. > > > > Regards, > > Yue > > > > On Tue, Sep 26, 2023 at 12:25 AM Antoine Pitrou > > wrote: > > > > > > > > Hi Yue, > > > > > > Le 25/09/2023 à 18:15, Yue Ni a écrit : > > > > > > > >> a CMake entrypoint (for example a function) making it easy for > > > > third-party projects to compile their own functions > > > > I can come up with a minimum CMake template so that users can compile > > C++ > > > > based functions, and I think if the integration happens at the LLVM > IR > > > > level, it is possible to author the functions beyond C++ languages, > > such > > > as > > > > Rust/Zig as long as the compiler can generate LLVM IR (there are > other > > > > issues that need to be addressed from the Rust experiment I made, but > > > that > > > > can be another proposal/PR). If we make that work, CMake is probably > > not > > > so > > > > important either since other languages can use their own build tools > > such > > > > as Cargo/zig build, and we just need some documentation to describe > how > > > it > > > > should be interfaced typically. > > > > > > As long as there's a well-known and supported way to generate the code > > > for external functions, then it's fine to me. > > > > > > (also the required signature for these functions should be documented > > > somewhere) > > > > > > >> The rest of the proposal (a specific JSON file format, a bunch of > > > functions > > > > to iterate directory entries in a specific layout) is IMHO off-topic > > for > > > > Gandiva, and each third-party project can implement their own idioms > > for > > > > the discovery of external functions > > > > > > > > Could you give some more guidance on how this should work without an > > > > external function registry containing metadata? As far as I know, for > > > each > > > > pre-compiled function used in an expression, Gandiva needs to lookup > > its > > > > signature from the function registry, which currently is a C++ class > > that > > > > is hard coded to contain 6 categories of built-in functions > > > > (arithmetic/datetime/hash/mathops/string/datetime arithmetic). If a > > third > > > > party function cannot be found in the registry, it cannot be used in > > the > > > > expression. If we don't load the pre-compiled function metadata from > > > > external files, how do we avoid Gandiva rejecting the expression > when a > > > > third party function cannot be found in the function registry? > Thanks. > > > > > > What I'm saying is that code to load function metadata from JSON and > > > walk directories of .bc files does not belong in Gandiva. The > definition > > > of an external function registry can certainly belong in Gandiva, but > > > how it's
[DISCUSS][Flight SQL] Adding Ingest Support for Flight SQL
Hi devs, I would like to open a discussion around adding support for a native "ingest" command to the Flight SQL specification. The initial motivating use-case for this is to be able to support ADBC ingest when using the Flight SQL driver, which is currently not possible because the specific UPDATE semantics cannot be generalized across all possible Flight SQL backends. Specifically, I am proposing to extend the Flight SQL protobuf specification with a "CommandStatementIngest" message type. The GH issue [1] includes a sample message definition for this command. This command would be included in the FlightDescriptor of a DoPut call to the server, after which the subsequent FlightData stream could be handled as a single bulk ingest. I would greatly appreciate thoughts and feedback on this proposal. Thank you, Joel Lubinitsky [1] https://github.com/apache/arrow-adbc/issues/1107
Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
+1 (non-binding) Verified with `./verify-release-candidate.sh 0.3.0 0` on: - Windows 10, x86_64, libarrow-main, MSVC 17 2022, R 4.3.1, Rtools 43 - macOS 13.6, aarch64, libarrow 13.0.0, R 4.3.1 - Ubuntu 23.04, aarch64, libarrow 13.0.0, R 4.2.2
Re: [DISCUSS][C++] Raw pointer string views
I'm confused why this would need to copy string data, assuming the pointers are into defined memory regions, something necessary for the C data interface's ownership semantics regardless, why can't these memory regions just be used as buffers as is? This would therefore require just rewriting the views buffer to subtract the base pointer of the given buffer, which should be extremely fast? On 26 September 2023 23:34:54 BST, Matt Topol wrote: >I believe the motivation is to avoid the cost of the data copy that would >have to happen to convert from a pointer based to offset based scenario. >Allowing the pointer-based implementation will ensure that we can maintain >zero-copy communication with both DuckDB and Velox in a common workflow >scenario. > >Converting to the offset-based version would have a cost of having to copy >strings from their locations to contiguous buffers which could end up being >very significant depending on the shape and size of the data. The pointer >-based solution wouldn't be allowed in IPC though, only across the C Data >interface (correct me if I'm wrong). > >--Matt > >On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies > wrote: > >> Hi, >> >> Is the motivation here to avoid DuckDB and Velox having to duplicate the >> conversion logic from pointer-based to offset-based, or to allow >> arrow-cpp to operate directly on pointer-based arrays? >> >> If it is the former, I personally wouldn't have thought the conversion >> logic sufficiently complex to really warrant this? >> >> If it is the latter, I wonder if you have some benchmark numbers for >> converting between and operating on the differing representations? In >> the absence of a strong performance case, it's hard in my opinion to >> justify adding what will be an arrow-cpp specific extension that isn't >> part of the standard, with all the potential for confusion and >> interoperability challenges that entails. >> >> Kind Regards, >> >> Raphael >> >> On 26/09/2023 21:34, Benjamin Kietzman wrote: >> > Hello all, >> > >> > In the PR to add support for Utf8View to the c++ implementation, >> > I've taken the approach of allowing raw pointer views [1] alongside the >> > index/offset views described in the spec [2]. This was done to ease >> > communication with other engines such as DuckDB and Velox whose native >> > string representation is the raw pointer view. In order to be usable >> > as a utility for writing IPC files and other operations on arrow >> > formatted data, it is useful for the library to be able to directly >> > import raw pointer arrays even when immediately converting these to >> > the index/offset representation. >> > >> > However there has been objection in review [3] since the raw pointer >> > representation is not part of the official format. Since data visitation >> > utilities are generic, IMHO this hybrid approach does not add >> > significantly to the complexity of the C++ library, and I feel the >> > aforementioned interoperability is a high priority when adding this >> > feature to the C++ library. It's worth noting that this interoperability >> > has been a stated goal of the Utf8Type since its original proposal [4] >> > and throughout the discussion of its adoption [5]. >> > >> > Sincerely, >> > Ben Kietzman >> > >> > [1]: >> > >> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 >> > [2]: >> > >> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 >> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 >> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq >> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 >> > >>
Re: [DISCUSS][C++] Raw pointer string views
I believe the motivation is to avoid the cost of the data copy that would have to happen to convert from a pointer based to offset based scenario. Allowing the pointer-based implementation will ensure that we can maintain zero-copy communication with both DuckDB and Velox in a common workflow scenario. Converting to the offset-based version would have a cost of having to copy strings from their locations to contiguous buffers which could end up being very significant depending on the shape and size of the data. The pointer -based solution wouldn't be allowed in IPC though, only across the C Data interface (correct me if I'm wrong). --Matt On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies wrote: > Hi, > > Is the motivation here to avoid DuckDB and Velox having to duplicate the > conversion logic from pointer-based to offset-based, or to allow > arrow-cpp to operate directly on pointer-based arrays? > > If it is the former, I personally wouldn't have thought the conversion > logic sufficiently complex to really warrant this? > > If it is the latter, I wonder if you have some benchmark numbers for > converting between and operating on the differing representations? In > the absence of a strong performance case, it's hard in my opinion to > justify adding what will be an arrow-cpp specific extension that isn't > part of the standard, with all the potential for confusion and > interoperability challenges that entails. > > Kind Regards, > > Raphael > > On 26/09/2023 21:34, Benjamin Kietzman wrote: > > Hello all, > > > > In the PR to add support for Utf8View to the c++ implementation, > > I've taken the approach of allowing raw pointer views [1] alongside the > > index/offset views described in the spec [2]. This was done to ease > > communication with other engines such as DuckDB and Velox whose native > > string representation is the raw pointer view. In order to be usable > > as a utility for writing IPC files and other operations on arrow > > formatted data, it is useful for the library to be able to directly > > import raw pointer arrays even when immediately converting these to > > the index/offset representation. > > > > However there has been objection in review [3] since the raw pointer > > representation is not part of the official format. Since data visitation > > utilities are generic, IMHO this hybrid approach does not add > > significantly to the complexity of the C++ library, and I feel the > > aforementioned interoperability is a high priority when adding this > > feature to the C++ library. It's worth noting that this interoperability > > has been a stated goal of the Utf8Type since its original proposal [4] > > and throughout the discussion of its adoption [5]. > > > > Sincerely, > > Ben Kietzman > > > > [1]: > > > https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 > > [2]: > > > https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 > > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 > > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 > > >
Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
+1 Tested on Ubuntu 20.04 LTS/x86_64, R 4.3.1 On Tue, Sep 26, 2023, at 18:05, Dane Pitkin wrote: > +1 (non-binding) > > I verified successfully on MacOS 13.5 (aarch64) with: > > cd dev/release && ./verify-release-candidate.sh 0.3.0 0 > > > > On Tue, Sep 26, 2023 at 5:30 PM Sutou Kouhei wrote: > >> +1 >> >> I ran the following command line on Debian GNU/Linux sid: >> >> CMAKE_PREFIX_PATH=/tmp/local \ >> dev/release/verify-release-candidate.sh 0.3.0 0 >> >> with: >> >> * Apache Arrow C++ main >> * gcc (Debian 13.2.0-4) 13.2.0 >> * R version 4.3.1 (2023-06-16) -- "Beagle Scouts" >> >> Thanks, >> -- >> kou >> >> In >> "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023 >> 12:23:52 -0300, >> Dewey Dunnington wrote: >> >> > Hello, >> > >> > I would like to propose the following release candidate (rc0) of >> > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release >> > consisting of 42 resolved GitHub issues from 4 contributors [1]. >> > >> > This release candidate is based on commit: >> > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2] >> > >> > The source release rc0 is hosted at [3]. >> > The changelog is located at [4]. >> > >> > Please download, verify checksums and signatures, run the unit tests, >> > and vote on the release. See [5] for how to validate a release >> > candidate. >> > >> > See also a successful suite of verification runs at [6]. >> > >> > The vote will be open for at least 72 hours. >> > >> > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0 >> > [ ] +0 >> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because... >> > >> > [0] https://github.com/apache/arrow-nanoarrow >> > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1 >> > [2] >> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0 >> > [3] >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/ >> > [4] >> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md >> > [5] >> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md >> > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940 >>
Re: [DISCUSS][C++] Raw pointer string views
Hi, Is the motivation here to avoid DuckDB and Velox having to duplicate the conversion logic from pointer-based to offset-based, or to allow arrow-cpp to operate directly on pointer-based arrays? If it is the former, I personally wouldn't have thought the conversion logic sufficiently complex to really warrant this? If it is the latter, I wonder if you have some benchmark numbers for converting between and operating on the differing representations? In the absence of a strong performance case, it's hard in my opinion to justify adding what will be an arrow-cpp specific extension that isn't part of the standard, with all the potential for confusion and interoperability challenges that entails. Kind Regards, Raphael On 26/09/2023 21:34, Benjamin Kietzman wrote: Hello all, In the PR to add support for Utf8View to the c++ implementation, I've taken the approach of allowing raw pointer views [1] alongside the index/offset views described in the spec [2]. This was done to ease communication with other engines such as DuckDB and Velox whose native string representation is the raw pointer view. In order to be usable as a utility for writing IPC files and other operations on arrow formatted data, it is useful for the library to be able to directly import raw pointer arrays even when immediately converting these to the index/offset representation. However there has been objection in review [3] since the raw pointer representation is not part of the official format. Since data visitation utilities are generic, IMHO this hybrid approach does not add significantly to the complexity of the C++ library, and I feel the aforementioned interoperability is a high priority when adding this feature to the C++ library. It's worth noting that this interoperability has been a stated goal of the Utf8Type since its original proposal [4] and throughout the discussion of its adoption [5]. Sincerely, Ben Kietzman [1]: https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 [2]: https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
+1 (non-binding) I verified successfully on MacOS 13.5 (aarch64) with: cd dev/release && ./verify-release-candidate.sh 0.3.0 0 On Tue, Sep 26, 2023 at 5:30 PM Sutou Kouhei wrote: > +1 > > I ran the following command line on Debian GNU/Linux sid: > > CMAKE_PREFIX_PATH=/tmp/local \ > dev/release/verify-release-candidate.sh 0.3.0 0 > > with: > > * Apache Arrow C++ main > * gcc (Debian 13.2.0-4) 13.2.0 > * R version 4.3.1 (2023-06-16) -- "Beagle Scouts" > > Thanks, > -- > kou > > In > "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023 > 12:23:52 -0300, > Dewey Dunnington wrote: > > > Hello, > > > > I would like to propose the following release candidate (rc0) of > > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release > > consisting of 42 resolved GitHub issues from 4 contributors [1]. > > > > This release candidate is based on commit: > > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2] > > > > The source release rc0 is hosted at [3]. > > The changelog is located at [4]. > > > > Please download, verify checksums and signatures, run the unit tests, > > and vote on the release. See [5] for how to validate a release > > candidate. > > > > See also a successful suite of verification runs at [6]. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0 > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because... > > > > [0] https://github.com/apache/arrow-nanoarrow > > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1 > > [2] > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0 > > [3] > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/ > > [4] > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md > > [5] > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md > > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940 >
Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
+1 I ran the following command line on Debian GNU/Linux sid: CMAKE_PREFIX_PATH=/tmp/local \ dev/release/verify-release-candidate.sh 0.3.0 0 with: * Apache Arrow C++ main * gcc (Debian 13.2.0-4) 13.2.0 * R version 4.3.1 (2023-06-16) -- "Beagle Scouts" Thanks, -- kou In "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023 12:23:52 -0300, Dewey Dunnington wrote: > Hello, > > I would like to propose the following release candidate (rc0) of > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release > consisting of 42 resolved GitHub issues from 4 contributors [1]. > > This release candidate is based on commit: > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2] > > The source release rc0 is hosted at [3]. > The changelog is located at [4]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. See [5] for how to validate a release > candidate. > > See also a successful suite of verification runs at [6]. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0 > [ ] +0 > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because... > > [0] https://github.com/apache/arrow-nanoarrow > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1 > [2] > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0 > [3] > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/ > [4] > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md > [5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940
[DISCUSS][C++] Raw pointer string views
Hello all, In the PR to add support for Utf8View to the c++ implementation, I've taken the approach of allowing raw pointer views [1] alongside the index/offset views described in the spec [2]. This was done to ease communication with other engines such as DuckDB and Velox whose native string representation is the raw pointer view. In order to be usable as a utility for writing IPC files and other operations on arrow formatted data, it is useful for the library to be able to directly import raw pointer arrays even when immediately converting these to the index/offset representation. However there has been objection in review [3] since the raw pointer representation is not part of the official format. Since data visitation utilities are generic, IMHO this hybrid approach does not add significantly to the complexity of the C++ library, and I feel the aforementioned interoperability is a high priority when adding this feature to the C++ library. It's worth noting that this interoperability has been a stated goal of the Utf8Type since its original proposal [4] and throughout the discussion of its adoption [5]. Sincerely, Ben Kietzman [1]: https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 [2]: https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
Re: [DISCUSS][Gandiva] External function registry proposal
I agree with Antoine that we don't need to define a JSON format or a directory structure for Gandiva. To support external functions, we essentially need two things: 1. Gandiva's function registry needs to be aware of the function metadata: We can achieve this by having a `FunctionRegistry::AddFunction(NativeFunction* func)` function. The `NativeFunction` can come from whatever source the user wants or even hard coded, not necessarily from JSON files. The function registry is a singleton so this should be easy. 2. The LLVM engine needs access to the function IR definition: Users should be able register a string representation of IR bytecode similar to what the current `Engine::LoadPreCompiledIR` does. So something like `LoadExternalIR(std::string_view)` is enough. Although `Engine` is not a singleton, we can create a global object holding external IRs and Engines can link them on construction. I think the key idea is to let users call Gandiva functions to register functions and pass necessary info explicitly to Gandiva, rather than letting Gandiva discover them by itself. On Tue, Sep 26, 2023 at 2:14 AM Yue Ni wrote: > > The definition of an external function registry can certainly belong in > Gandiva, but how it's populated should be left to third-party projects > > Are you proposing a more general approach, like incorporating the following > APIs into Gandiva? (Please note that the function names/signatures are > tentative and just meant for illustrative purposes.) > 1) AddExternalFunctionRegistry(ExternalFunctionRegistry function_registry) > 2) AddFunctionBitcodeLoader(FunctionBitcodeLoader bitcode_loader) > Where `ExternalFunctionRegistry` can return a list of function definitions > and `FunctionBitcodeLoader` can return a list of bitcode buffers, so that > the specific metadata/bitcode data population logic can be moved out of > Gandiva? Thanks. > > Regards, > Yue > > On Tue, Sep 26, 2023 at 12:25 AM Antoine Pitrou > wrote: > > > > > Hi Yue, > > > > Le 25/09/2023 à 18:15, Yue Ni a écrit : > > > > > >> a CMake entrypoint (for example a function) making it easy for > > > third-party projects to compile their own functions > > > I can come up with a minimum CMake template so that users can compile > C++ > > > based functions, and I think if the integration happens at the LLVM IR > > > level, it is possible to author the functions beyond C++ languages, > such > > as > > > Rust/Zig as long as the compiler can generate LLVM IR (there are other > > > issues that need to be addressed from the Rust experiment I made, but > > that > > > can be another proposal/PR). If we make that work, CMake is probably > not > > so > > > important either since other languages can use their own build tools > such > > > as Cargo/zig build, and we just need some documentation to describe how > > it > > > should be interfaced typically. > > > > As long as there's a well-known and supported way to generate the code > > for external functions, then it's fine to me. > > > > (also the required signature for these functions should be documented > > somewhere) > > > > >> The rest of the proposal (a specific JSON file format, a bunch of > > functions > > > to iterate directory entries in a specific layout) is IMHO off-topic > for > > > Gandiva, and each third-party project can implement their own idioms > for > > > the discovery of external functions > > > > > > Could you give some more guidance on how this should work without an > > > external function registry containing metadata? As far as I know, for > > each > > > pre-compiled function used in an expression, Gandiva needs to lookup > its > > > signature from the function registry, which currently is a C++ class > that > > > is hard coded to contain 6 categories of built-in functions > > > (arithmetic/datetime/hash/mathops/string/datetime arithmetic). If a > third > > > party function cannot be found in the registry, it cannot be used in > the > > > expression. If we don't load the pre-compiled function metadata from > > > external files, how do we avoid Gandiva rejecting the expression when a > > > third party function cannot be found in the function registry? Thanks. > > > > What I'm saying is that code to load function metadata from JSON and > > walk directories of .bc files does not belong in Gandiva. The definition > > of an external function registry can certainly belong in Gandiva, but > > how it's populated should be left to third-party projects (which then > > don't have to use JSON or a given directory layout). > > > > Regards > > > > Antoine. > > >
Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1
+1 (binding) Verified on mac x86_64 Looks like a good release to me -- thank you Raphael Andrew On Tue, Sep 26, 2023 at 12:05 PM Raphael Taylor-Davies wrote: > Hi, > > I would like to propose a release of Apache Arrow Rust Object > Store Implementation, version 0.7.1. > > This release candidate is based on commit: > 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1] > > The proposed release tarball and signatures are hosted at [2]. > > The changelog is located at [3]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. There is a script [4] that automates some of > the verification. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow Rust Object Store > [ ] +0 > [ ] -1 Do not release this as Apache Arrow Rust Object Store because... > > [1]: > > https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf > [2]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1 > [3]: > > https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md > [4]: > > https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh > >
Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1
+1 (binding) Verified on M1 Mac. Thanks Raphael. On Tue, Sep 26, 2023 at 9:05 AM Raphael Taylor-Davies wrote: > > Hi, > > I would like to propose a release of Apache Arrow Rust Object > Store Implementation, version 0.7.1. > > This release candidate is based on commit: > 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1] > > The proposed release tarball and signatures are hosted at [2]. > > The changelog is located at [3]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. There is a script [4] that automates some of > the verification. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow Rust Object Store > [ ] +0 > [ ] -1 Do not release this as Apache Arrow Rust Object Store because... > > [1]: > https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf > [2]: > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1 > [3]: > https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md > [4]: > https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh >
Re: [Format] C Data Interface integration testing
Thank you for setting this up! I look forward to adding nanoarrow as soon as time allows. Cheers, -dewey On Tue, Sep 26, 2023 at 9:48 AM Antoine Pitrou wrote: > > > Hello, > > We have added some infrastructure for integration testing of the C Data > Interface between Arrow implementations. We are now testing the C++ and > Go implementations, but the goal in the future is for all major > implementations to be tested there (perhaps including nanoarrow). > > - PR to add the testing infrastructure and enable the C++ implementation: > https://github.com/apache/arrow/pull/37769 > > - PR to enable the Go implementation > https://github.com/apache/arrow/pull/37788 > > Feel free to ask any questions. > > Regards > > Antoine. > > >
[VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1
Hi, I would like to propose a release of Apache Arrow Rust Object Store Implementation, version 0.7.1. This release candidate is based on commit: 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. There is a script [4] that automates some of the verification. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow Rust Object Store [ ] +0 [ ] -1 Do not release this as Apache Arrow Rust Object Store because... [1]: https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf [2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1 [3]: https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md [4]: https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0
Hello, I would like to propose the following release candidate (rc0) of Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release consisting of 42 resolved GitHub issues from 4 contributors [1]. This release candidate is based on commit: c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2] The source release rc0 is hosted at [3]. The changelog is located at [4]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. See [5] for how to validate a release candidate. See also a successful suite of verification runs at [6]. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0 [ ] +0 [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because... [0] https://github.com/apache/arrow-nanoarrow [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1 [2] https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0 [3] https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/ [4] https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md [5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940
[Format] C Data Interface integration testing
Hello, We have added some infrastructure for integration testing of the C Data Interface between Arrow implementations. We are now testing the C++ and Go implementations, but the goal in the future is for all major implementations to be tested there (perhaps including nanoarrow). - PR to add the testing infrastructure and enable the C++ implementation: https://github.com/apache/arrow/pull/37769 - PR to enable the Go implementation https://github.com/apache/arrow/pull/37788 Feel free to ask any questions. Regards Antoine.