Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-24 Thread Gang Wu
Hi,

Sorry to reply without a vote.

I tried to run the verify-release-candidate.sh script on my Mac M1
but it took me forever to fix various environment issues. Is it better
to verify this in a pure docker environment instead?

Thanks,
Gang

On Tue, Jul 25, 2023 at 12:30 PM Yibo Cai  wrote:

> +1.
>
> Verified c++/python/go source on Ubuntu-22.04 aarch64.
>
> TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \
> dev/release/verify-release-candidate.sh 13.0.0 0
>
> Met with a non-blocking issue:
> https://github.com/apache/arrow/issues/36860
>
> On 7/21/23 17:49, Raúl Cumplido wrote:
> > Hi,
> >
> > As discussed during the community calls I have also triggered the
> > benchmark tests on the Pull Request for RC 0 [1].
> >
> > I am trying to get the conbench comparison between the 13.0.0 RC0 and
> > 12.0.1 RC1 (latest release) by having a chat with the conbench
> > maintainers. I'll share as soon as I have it.
> >
> > I wanted to share the Verification email as soon as possible so we can
> > start running the verification process.
> >
> > Thanks,
> > Raúl
> >
> > [1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676
> >
> > El vie, 21 jul 2023 a las 11:45, Raúl Cumplido ()
> escribió:
> >>
> >> Hi,
> >>
> >> I would like to propose the following release candidate (RC0) of Apache
> >> Arrow version 13.0.0. This is a release consisting of 428
> >> resolved GitHub issues[1].
> >>
> >> This release candidate is based on commit:
> >> ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]
> >>
> >> The source release rc0 is hosted at [3].
> >> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> >> The changelog is located at [12].
> >>
> >> Please download, verify checksums and signatures, run the unit tests,
> >> and vote on the release. See [13] for how to validate a release
> candidate.
> >>
> >> See also a verification result on GitHub pull request [14].
> >>
> >> The vote will be open for at least 72 hours.
> >>
> >> [ ] +1 Release this as Apache Arrow 13.0.0
> >> [ ] +0
> >> [ ] -1 Do not release this as Apache Arrow 13.0.0 because...
> >>
> >> [1]:
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
> >> [2]:
> https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
> >> [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
> >> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> >> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> >> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
> >> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
> >> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
> >> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> [12]:
> https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
> >> [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >> [14]: https://github.com/apache/arrow/pull/36775
>


Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-24 Thread Yibo Cai

+1.

Verified c++/python/go source on Ubuntu-22.04 aarch64.

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \
dev/release/verify-release-candidate.sh 13.0.0 0

Met with a non-blocking issue:
https://github.com/apache/arrow/issues/36860

On 7/21/23 17:49, Raúl Cumplido wrote:

Hi,

As discussed during the community calls I have also triggered the
benchmark tests on the Pull Request for RC 0 [1].

I am trying to get the conbench comparison between the 13.0.0 RC0 and
12.0.1 RC1 (latest release) by having a chat with the conbench
maintainers. I'll share as soon as I have it.

I wanted to share the Verification email as soon as possible so we can
start running the verification process.

Thanks,
Raúl

[1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676

El vie, 21 jul 2023 a las 11:45, Raúl Cumplido () escribió:


Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 13.0.0. This is a release consisting of 428
resolved GitHub issues[1].

This release candidate is based on commit:
ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 13.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 13.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/36775


Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-24 Thread Raúl Cumplido
Hi,

I will create a new RC with the C# fix
(https://github.com/apache/arrow/issues/36812) and some other minor
issues identified to help with the release like:
* https://github.com/apache/arrow/issues/35292
* https://github.com/apache/arrow/issues/36832
* https://github.com/apache/arrow/issues/36839
* https://github.com/apache/arrow/pull/36801

Thanks all!

El lun, 24 jul 2023 a las 9:50, Sutou Kouhei () escribió:
>
> +1
>
> I ran the followings on Debian GNU/Linux sid:
>
>   * TEST_DEFAULT=0 \
>   TEST_SOURCE=1 \
>   LANG=C \
>   TZ=UTC \
>   CUDAToolkit_ROOT=/usr \
>   ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
>   * TEST_DEFAULT=0 \
>   TEST_APT=1 \
>   LANG=C \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
>   * TEST_DEFAULT=0 \
>   TEST_BINARY=1 \
>   LANG=C \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
>   * TEST_DEFAULT=0 \
>   TEST_JARS=1 \
>   LANG=C \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
>   * TEST_DEFAULT=0 \
>   TEST_PYTHON_VERSIONS=3.11 \
>   TEST_WHEELS=1 \
>   LANG=C \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
>   * TEST_DEFAULT=0 \
>   TEST_YUM=1 \
>   LANG=C \
>   dev/release/verify-release-candidate.sh 13.0.0 0
>
> with:
>
>   * .NET SDK (6.0.411)
>   * Python 3.11.4
>   * gcc (Debian 12.3.0-4) 12.3.0
>   * nvidia-cuda-dev 11.8.99~11.8.0-4
>   * openjdk version "17.0.7" 2023-04-18
>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
>
> Notes:
>
>   * I needed https://github.com/apache/arrow/pull/36836
> but it's still a workaround.
> We need a response from INFRA:
> https://issues.apache.org/jira/browse/INFRA-24569
>
>   * https://github.com/apache/arrow/pull/36833 isn't a
> blocker but it's a nice-to-have when we need RC1.
>
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow 13.0.0 - RC0" on Fri, 21 Jul 2023 11:45:21 
> +0200,
>   Raúl Cumplido  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> > Arrow version 13.0.0. This is a release consisting of 428
> > resolved GitHub issues[1].
> >
> > This release candidate is based on commit:
> > ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > The changelog is located at [12].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [13] for how to validate a release candidate.
> >
> > See also a verification result on GitHub pull request [14].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 13.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 13.0.0 because...
> >
> > [1]: 
> > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [12]: 
> > https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
> > [13]: 
> > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > [14]: https://github.com/apache/arrow/pull/36775


Re: hashing Arrow structures

2023-07-24 Thread Weston Pace
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?

It's not user-configurable.  The hash-join and hash-group-by always use the
32-bit variant.  The asof-join always uses the 64-bit variant.  I wouldn't
stress too much about the hash-join.  It is a very memory intensive
operation and my guess is that by the time you have enough keys to worry
about hash uniqueness you should probably be doing an out-of-core join
anyways.  The hash-join implementation is also fairly tolerant to duplicate
keys anyways.  I believe our hash-join performance is unlikely to be the
bottleneck in most cases.

It might make more sense to use the 64-bit variant for the group-by, as we
are normally only storing the hash-to-group-id table itself in those
cases.  Solid benchmarking would probably be needed regardless.

On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou  wrote:

>
> Hi,
>
> Le 21/07/2023 à 15:58, Yaron Gvili a écrit :
> > A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> Ouch... so if N/A does have the same hash value as a common non-null
> value (0), this should be fixed.
>
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?
>
> I don't think 32-bit hashing is a good idea when operating on large
> data. Unless the hash function is exceptionally good, you may get lots
> of hash collisions. It's nice to have a SIMD-accelerated hash table, but
> less so if access times degenerate to O(n)...
>
> So IMHO we should only have one hashing variant with a 64-bit output.
> And make sure it doesn't have trivial collisions on common data patterns
> (such as nulls and zeros, or clustered integer ranges).
>
> > A second approach I found is by serializing the Arrow structures
> (possibly by streaming) and hashing using functions in `util/hashing.h`. I
> didn't yet look into what properties these hash functions have except for
> the documented high performance. In particular, I don't know whether they
> have unfortunate hash collisions and, more generally, what is the
> probability of hash collision. I also don't know whether they are designed
> for efficient use in the context of joining.
>
> Those hash functions shouldn't have unfortunate hash, but they were not
> exercised on real-world data at the time. I have no idea whether they
> are efficient in the context of joining, as they have been written much
> earlier than our joining implementation.
>
> Regards
>
> Antoine.
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 28.0.0 RC1

2023-07-24 Thread Andrew Lamb
+1 (binding)

Verified in x86_64 mac

Thank you very much Andy.
Andrew

On Sun, Jul 23, 2023 at 9:31 AM vin jake  wrote:

> +1 (binding)
>
> Verified on my M1 macbook.
>
> Thanks Andy
>


Re: hashing Arrow structures

2023-07-24 Thread Antoine Pitrou



Hi,

Le 21/07/2023 à 15:58, Yaron Gvili a écrit :

A first approach I found is using `Hashing32` and `Hashing64`. This approach 
seems to be useful for hashing the fields composing a key of multiple rows when 
joining. However, it has a couple of drawbacks. One drawback is that if the 
number of distinct keys is large (like in the scale of a million or so) then 
the probability of hash collision may no longer be acceptable for some 
applications, more so when using `Hashing32`. Another drawback that I noticed 
in my experiments is that the common `N/A` and `0` integer values both hash to 
0 and thus collide.


Ouch... so if N/A does have the same hash value as a common non-null 
value (0), this should be fixed.


Also, I don't understand why there are two versions of the hash table 
("hashing32" and "hashing64" apparently). What's the rationale? How is 
the user meant to choose between them? Say a Substrait plan is being 
executed: which hashing variant is chosen and why?


I don't think 32-bit hashing is a good idea when operating on large 
data. Unless the hash function is exceptionally good, you may get lots 
of hash collisions. It's nice to have a SIMD-accelerated hash table, but 
less so if access times degenerate to O(n)...


So IMHO we should only have one hashing variant with a 64-bit output. 
And make sure it doesn't have trivial collisions on common data patterns 
(such as nulls and zeros, or clustered integer ranges).



A second approach I found is by serializing the Arrow structures (possibly by 
streaming) and hashing using functions in `util/hashing.h`. I didn't yet look 
into what properties these hash functions have except for the documented high 
performance. In particular, I don't know whether they have unfortunate hash 
collisions and, more generally, what is the probability of hash collision. I 
also don't know whether they are designed for efficient use in the context of 
joining.


Those hash functions shouldn't have unfortunate hash, but they were not 
exercised on real-world data at the time. I have no idea whether they 
are efficient in the context of joining, as they have been written much 
earlier than our joining implementation.


Regards

Antoine.


Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-24 Thread Sutou Kouhei
+1

I ran the followings on Debian GNU/Linux sid:

  * TEST_DEFAULT=0 \
  TEST_SOURCE=1 \
  LANG=C \
  TZ=UTC \
  CUDAToolkit_ROOT=/usr \
  ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \
  dev/release/verify-release-candidate.sh 13.0.0 0

  * TEST_DEFAULT=0 \
  TEST_APT=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 13.0.0 0

  * TEST_DEFAULT=0 \
  TEST_BINARY=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 13.0.0 0

  * TEST_DEFAULT=0 \
  TEST_JARS=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 13.0.0 0

  * TEST_DEFAULT=0 \
  TEST_PYTHON_VERSIONS=3.11 \
  TEST_WHEELS=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 13.0.0 0

  * TEST_DEFAULT=0 \
  TEST_YUM=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 13.0.0 0

with:

  * .NET SDK (6.0.411)
  * Python 3.11.4
  * gcc (Debian 12.3.0-4) 12.3.0
  * nvidia-cuda-dev 11.8.99~11.8.0-4
  * openjdk version "17.0.7" 2023-04-18
  * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]

Notes:

  * I needed https://github.com/apache/arrow/pull/36836
but it's still a workaround.
We need a response from INFRA:
https://issues.apache.org/jira/browse/INFRA-24569

  * https://github.com/apache/arrow/pull/36833 isn't a
blocker but it's a nice-to-have when we need RC1.


Thanks,
-- 
kou

In 
  "[VOTE] Release Apache Arrow 13.0.0 - RC0" on Fri, 21 Jul 2023 11:45:21 +0200,
  Raúl Cumplido  wrote:

> Hi,
> 
> I would like to propose the following release candidate (RC0) of Apache
> Arrow version 13.0.0. This is a release consisting of 428
> resolved GitHub issues[1].
> 
> This release candidate is based on commit:
> ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]
> 
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
> 
> See also a verification result on GitHub pull request [14].
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 13.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 13.0.0 because...
> 
> [1]: 
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
> [2]: 
> https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]: 
> https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
> [13]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [14]: https://github.com/apache/arrow/pull/36775