Re: [VOTE] Release Apache Arrow 13.0.0 - RC0
Hi, Sorry to reply without a vote. I tried to run the verify-release-candidate.sh script on my Mac M1 but it took me forever to fix various environment issues. Is it better to verify this in a pure docker environment instead? Thanks, Gang On Tue, Jul 25, 2023 at 12:30 PM Yibo Cai wrote: > +1. > > Verified c++/python/go source on Ubuntu-22.04 aarch64. > > TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > Met with a non-blocking issue: > https://github.com/apache/arrow/issues/36860 > > On 7/21/23 17:49, Raúl Cumplido wrote: > > Hi, > > > > As discussed during the community calls I have also triggered the > > benchmark tests on the Pull Request for RC 0 [1]. > > > > I am trying to get the conbench comparison between the 13.0.0 RC0 and > > 12.0.1 RC1 (latest release) by having a chat with the conbench > > maintainers. I'll share as soon as I have it. > > > > I wanted to share the Verification email as soon as possible so we can > > start running the verification process. > > > > Thanks, > > Raúl > > > > [1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676 > > > > El vie, 21 jul 2023 a las 11:45, Raúl Cumplido () > escribió: > >> > >> Hi, > >> > >> I would like to propose the following release candidate (RC0) of Apache > >> Arrow version 13.0.0. This is a release consisting of 428 > >> resolved GitHub issues[1]. > >> > >> This release candidate is based on commit: > >> ac2d207611ce25c91fb9fc90d5eaff2933609660 [2] > >> > >> The source release rc0 is hosted at [3]. > >> The binary artifacts are hosted at [4][5][6][7][8][9][10][11]. > >> The changelog is located at [12]. > >> > >> Please download, verify checksums and signatures, run the unit tests, > >> and vote on the release. See [13] for how to validate a release > candidate. > >> > >> See also a verification result on GitHub pull request [14]. > >> > >> The vote will be open for at least 72 hours. > >> > >> [ ] +1 Release this as Apache Arrow 13.0.0 > >> [ ] +0 > >> [ ] -1 Do not release this as Apache Arrow 13.0.0 because... > >> > >> [1]: > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed > >> [2]: > https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660 > >> [3]: > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0 > >> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/ > >> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/ > >> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/ > >> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/ > >> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0 > >> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0 > >> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0 > >> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/ > >> [12]: > https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md > >> [13]: > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > >> [14]: https://github.com/apache/arrow/pull/36775 >
Re: [VOTE] Release Apache Arrow 13.0.0 - RC0
+1. Verified c++/python/go source on Ubuntu-22.04 aarch64. TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \ dev/release/verify-release-candidate.sh 13.0.0 0 Met with a non-blocking issue: https://github.com/apache/arrow/issues/36860 On 7/21/23 17:49, Raúl Cumplido wrote: Hi, As discussed during the community calls I have also triggered the benchmark tests on the Pull Request for RC 0 [1]. I am trying to get the conbench comparison between the 13.0.0 RC0 and 12.0.1 RC1 (latest release) by having a chat with the conbench maintainers. I'll share as soon as I have it. I wanted to share the Verification email as soon as possible so we can start running the verification process. Thanks, Raúl [1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676 El vie, 21 jul 2023 a las 11:45, Raúl Cumplido () escribió: Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 13.0.0. This is a release consisting of 428 resolved GitHub issues[1]. This release candidate is based on commit: ac2d207611ce25c91fb9fc90d5eaff2933609660 [2] The source release rc0 is hosted at [3]. The binary artifacts are hosted at [4][5][6][7][8][9][10][11]. The changelog is located at [12]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. See [13] for how to validate a release candidate. See also a verification result on GitHub pull request [14]. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow 13.0.0 [ ] +0 [ ] -1 Do not release this as Apache Arrow 13.0.0 because... [1]: https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed [2]: https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660 [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0 [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/ [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/ [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/ [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/ [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0 [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0 [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0 [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/ [12]: https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md [13]: https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates [14]: https://github.com/apache/arrow/pull/36775
Re: [VOTE] Release Apache Arrow 13.0.0 - RC0
Hi, I will create a new RC with the C# fix (https://github.com/apache/arrow/issues/36812) and some other minor issues identified to help with the release like: * https://github.com/apache/arrow/issues/35292 * https://github.com/apache/arrow/issues/36832 * https://github.com/apache/arrow/issues/36839 * https://github.com/apache/arrow/pull/36801 Thanks all! El lun, 24 jul 2023 a las 9:50, Sutou Kouhei () escribió: > > +1 > > I ran the followings on Debian GNU/Linux sid: > > * TEST_DEFAULT=0 \ > TEST_SOURCE=1 \ > LANG=C \ > TZ=UTC \ > CUDAToolkit_ROOT=/usr \ > ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > * TEST_DEFAULT=0 \ > TEST_APT=1 \ > LANG=C \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > * TEST_DEFAULT=0 \ > TEST_BINARY=1 \ > LANG=C \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > * TEST_DEFAULT=0 \ > TEST_JARS=1 \ > LANG=C \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > * TEST_DEFAULT=0 \ > TEST_PYTHON_VERSIONS=3.11 \ > TEST_WHEELS=1 \ > LANG=C \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > * TEST_DEFAULT=0 \ > TEST_YUM=1 \ > LANG=C \ > dev/release/verify-release-candidate.sh 13.0.0 0 > > with: > > * .NET SDK (6.0.411) > * Python 3.11.4 > * gcc (Debian 12.3.0-4) 12.3.0 > * nvidia-cuda-dev 11.8.99~11.8.0-4 > * openjdk version "17.0.7" 2023-04-18 > * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] > > Notes: > > * I needed https://github.com/apache/arrow/pull/36836 > but it's still a workaround. > We need a response from INFRA: > https://issues.apache.org/jira/browse/INFRA-24569 > > * https://github.com/apache/arrow/pull/36833 isn't a > blocker but it's a nice-to-have when we need RC1. > > > Thanks, > -- > kou > > In > "[VOTE] Release Apache Arrow 13.0.0 - RC0" on Fri, 21 Jul 2023 11:45:21 > +0200, > Raúl Cumplido wrote: > > > Hi, > > > > I would like to propose the following release candidate (RC0) of Apache > > Arrow version 13.0.0. This is a release consisting of 428 > > resolved GitHub issues[1]. > > > > This release candidate is based on commit: > > ac2d207611ce25c91fb9fc90d5eaff2933609660 [2] > > > > The source release rc0 is hosted at [3]. > > The binary artifacts are hosted at [4][5][6][7][8][9][10][11]. > > The changelog is located at [12]. > > > > Please download, verify checksums and signatures, run the unit tests, > > and vote on the release. See [13] for how to validate a release candidate. > > > > See also a verification result on GitHub pull request [14]. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow 13.0.0 > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow 13.0.0 because... > > > > [1]: > > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed > > [2]: > > https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660 > > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0 > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/ > > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/ > > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/ > > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/ > > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0 > > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0 > > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0 > > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/ > > [12]: > > https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md > > [13]: > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > [14]: https://github.com/apache/arrow/pull/36775
Re: hashing Arrow structures
> Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why? It's not user-configurable. The hash-join and hash-group-by always use the 32-bit variant. The asof-join always uses the 64-bit variant. I wouldn't stress too much about the hash-join. It is a very memory intensive operation and my guess is that by the time you have enough keys to worry about hash uniqueness you should probably be doing an out-of-core join anyways. The hash-join implementation is also fairly tolerant to duplicate keys anyways. I believe our hash-join performance is unlikely to be the bottleneck in most cases. It might make more sense to use the 64-bit variant for the group-by, as we are normally only storing the hash-to-group-id table itself in those cases. Solid benchmarking would probably be needed regardless. On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou wrote: > > Hi, > > Le 21/07/2023 à 15:58, Yaron Gvili a écrit : > > A first approach I found is using `Hashing32` and `Hashing64`. This > approach seems to be useful for hashing the fields composing a key of > multiple rows when joining. However, it has a couple of drawbacks. One > drawback is that if the number of distinct keys is large (like in the scale > of a million or so) then the probability of hash collision may no longer be > acceptable for some applications, more so when using `Hashing32`. Another > drawback that I noticed in my experiments is that the common `N/A` and `0` > integer values both hash to 0 and thus collide. > > Ouch... so if N/A does have the same hash value as a common non-null > value (0), this should be fixed. > > Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why? > > I don't think 32-bit hashing is a good idea when operating on large > data. Unless the hash function is exceptionally good, you may get lots > of hash collisions. It's nice to have a SIMD-accelerated hash table, but > less so if access times degenerate to O(n)... > > So IMHO we should only have one hashing variant with a 64-bit output. > And make sure it doesn't have trivial collisions on common data patterns > (such as nulls and zeros, or clustered integer ranges). > > > A second approach I found is by serializing the Arrow structures > (possibly by streaming) and hashing using functions in `util/hashing.h`. I > didn't yet look into what properties these hash functions have except for > the documented high performance. In particular, I don't know whether they > have unfortunate hash collisions and, more generally, what is the > probability of hash collision. I also don't know whether they are designed > for efficient use in the context of joining. > > Those hash functions shouldn't have unfortunate hash, but they were not > exercised on real-world data at the time. I have no idea whether they > are efficient in the context of joining, as they have been written much > earlier than our joining implementation. > > Regards > > Antoine. >
Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 28.0.0 RC1
+1 (binding) Verified in x86_64 mac Thank you very much Andy. Andrew On Sun, Jul 23, 2023 at 9:31 AM vin jake wrote: > +1 (binding) > > Verified on my M1 macbook. > > Thanks Andy >
Re: hashing Arrow structures
Hi, Le 21/07/2023 à 15:58, Yaron Gvili a écrit : A first approach I found is using `Hashing32` and `Hashing64`. This approach seems to be useful for hashing the fields composing a key of multiple rows when joining. However, it has a couple of drawbacks. One drawback is that if the number of distinct keys is large (like in the scale of a million or so) then the probability of hash collision may no longer be acceptable for some applications, more so when using `Hashing32`. Another drawback that I noticed in my experiments is that the common `N/A` and `0` integer values both hash to 0 and thus collide. Ouch... so if N/A does have the same hash value as a common non-null value (0), this should be fixed. Also, I don't understand why there are two versions of the hash table ("hashing32" and "hashing64" apparently). What's the rationale? How is the user meant to choose between them? Say a Substrait plan is being executed: which hashing variant is chosen and why? I don't think 32-bit hashing is a good idea when operating on large data. Unless the hash function is exceptionally good, you may get lots of hash collisions. It's nice to have a SIMD-accelerated hash table, but less so if access times degenerate to O(n)... So IMHO we should only have one hashing variant with a 64-bit output. And make sure it doesn't have trivial collisions on common data patterns (such as nulls and zeros, or clustered integer ranges). A second approach I found is by serializing the Arrow structures (possibly by streaming) and hashing using functions in `util/hashing.h`. I didn't yet look into what properties these hash functions have except for the documented high performance. In particular, I don't know whether they have unfortunate hash collisions and, more generally, what is the probability of hash collision. I also don't know whether they are designed for efficient use in the context of joining. Those hash functions shouldn't have unfortunate hash, but they were not exercised on real-world data at the time. I have no idea whether they are efficient in the context of joining, as they have been written much earlier than our joining implementation. Regards Antoine.
Re: [VOTE] Release Apache Arrow 13.0.0 - RC0
+1 I ran the followings on Debian GNU/Linux sid: * TEST_DEFAULT=0 \ TEST_SOURCE=1 \ LANG=C \ TZ=UTC \ CUDAToolkit_ROOT=/usr \ ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \ dev/release/verify-release-candidate.sh 13.0.0 0 * TEST_DEFAULT=0 \ TEST_APT=1 \ LANG=C \ dev/release/verify-release-candidate.sh 13.0.0 0 * TEST_DEFAULT=0 \ TEST_BINARY=1 \ LANG=C \ dev/release/verify-release-candidate.sh 13.0.0 0 * TEST_DEFAULT=0 \ TEST_JARS=1 \ LANG=C \ dev/release/verify-release-candidate.sh 13.0.0 0 * TEST_DEFAULT=0 \ TEST_PYTHON_VERSIONS=3.11 \ TEST_WHEELS=1 \ LANG=C \ dev/release/verify-release-candidate.sh 13.0.0 0 * TEST_DEFAULT=0 \ TEST_YUM=1 \ LANG=C \ dev/release/verify-release-candidate.sh 13.0.0 0 with: * .NET SDK (6.0.411) * Python 3.11.4 * gcc (Debian 12.3.0-4) 12.3.0 * nvidia-cuda-dev 11.8.99~11.8.0-4 * openjdk version "17.0.7" 2023-04-18 * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] Notes: * I needed https://github.com/apache/arrow/pull/36836 but it's still a workaround. We need a response from INFRA: https://issues.apache.org/jira/browse/INFRA-24569 * https://github.com/apache/arrow/pull/36833 isn't a blocker but it's a nice-to-have when we need RC1. Thanks, -- kou In "[VOTE] Release Apache Arrow 13.0.0 - RC0" on Fri, 21 Jul 2023 11:45:21 +0200, Raúl Cumplido wrote: > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 13.0.0. This is a release consisting of 428 > resolved GitHub issues[1]. > > This release candidate is based on commit: > ac2d207611ce25c91fb9fc90d5eaff2933609660 [2] > > The source release rc0 is hosted at [3]. > The binary artifacts are hosted at [4][5][6][7][8][9][10][11]. > The changelog is located at [12]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. See [13] for how to validate a release candidate. > > See also a verification result on GitHub pull request [14]. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow 13.0.0 > [ ] +0 > [ ] -1 Do not release this as Apache Arrow 13.0.0 because... > > [1]: > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed > [2]: > https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660 > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0 > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/ > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/ > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/ > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/ > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0 > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0 > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0 > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/ > [12]: > https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md > [13]: > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > [14]: https://github.com/apache/arrow/pull/36775