Re: [VOTE] Release Apache Arrow ADBC 0.5.1 - RC1

2023-06-23 Thread Dewey Dunnington
+1!

I ran USE_CONDA=1 TEST_APT=0 TEST_YUM=0 ./verify-release-candidate.sh
0.5.1 1 on MacOS M1.

On Fri, Jun 23, 2023 at 8:50 PM David Li  wrote:
>
> My vote: +1 (Ubuntu Linux 20.04/x86_64; macOS 13.4/AArch64)
>
> On Fri, Jun 23, 2023, at 17:51, Matt Topol wrote:
> > +1 tested on Pop!_Os 22.04 with go 1.19
> >
> > On Fri, Jun 23, 2023, 4:52 PM Sutou Kouhei  wrote:
> >
> >> +1
> >>
> >> I ran the following on Debian GNU/Linux sid:
> >>
> >>   JAVA_HOME=/usr/lib/jvm/default-java \
> >> dev/release/verify-release-candidate.sh 0.5.1 1
> >>
> >> with:
> >>
> >>   * Python 3.11.4
> >>   * g++ (Debian 12.3.0-4) 12.3.0
> >>   * go version go1.20.5 linux/amd64
> >>   * openjdk version "17.0.7" 2023-04-18
> >>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <321c4e07-60a1-402d-9574-8437b462e...@app.fastmail.com>
> >>   "[VOTE] Release Apache Arrow ADBC 0.5.1 - RC1" on Thu, 22 Jun 2023
> >> 22:08:56 -0400,
> >>   "David Li"  wrote:
> >>
> >> > (I originally sent this with the wrong email, but it appears to have
> >> been swallowed. Apologies if this ends up being a duplicate.)
> >> >
> >> > I would like to propose the following release candidate (RC1) of Apache
> >> Arrow ADBC version 0.5.1. This is a release consisting of 8 resolved GitHub
> >> issues [1]. The main motivation is to release a fix in the Snowflake
> >> driver, as mentioned in an earlier thread.
> >> >
> >> > This release candidate is based on commit:
> >> 01c2f1eb281e8fb003f2d32096a6b0fe336128a9 [2]
> >> > (Note I had to manually patch one script; this will be resolved in
> >> future releases.)
> >> >
> >> > The source release rc1 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8].
> >> > The changelog is located at [9].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> and vote on the release. See [10] for how to validate a release candidate.
> >> >
> >> > See also a verification result on GitHub Actions [11].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow ADBC 0.5.1
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow ADBC 0.5.1 because...
> >> >
> >> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> >> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> >> TEST_APT=0 TEST_YUM=0`.)
> >> >
> >> > [1]:
> >> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.1%22+is%3Aclosed
> >> > [2]:
> >> https://github.com/apache/arrow-adbc/commit/01c2f1eb281e8fb003f2d32096a6b0fe336128a9
> >> > [3]:
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.1-rc1/
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [7]:
> >> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> > [8]:
> >> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.1-rc1
> >> > [9]:
> >> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.1-rc1/CHANGELOG.md
> >> > [10]:
> >> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5351160439
> >>


Re: [VOTE] Release Apache Arrow ADBC 0.5.1 - RC1

2023-06-23 Thread David Li
My vote: +1 (Ubuntu Linux 20.04/x86_64; macOS 13.4/AArch64)

On Fri, Jun 23, 2023, at 17:51, Matt Topol wrote:
> +1 tested on Pop!_Os 22.04 with go 1.19
>
> On Fri, Jun 23, 2023, 4:52 PM Sutou Kouhei  wrote:
>
>> +1
>>
>> I ran the following on Debian GNU/Linux sid:
>>
>>   JAVA_HOME=/usr/lib/jvm/default-java \
>> dev/release/verify-release-candidate.sh 0.5.1 1
>>
>> with:
>>
>>   * Python 3.11.4
>>   * g++ (Debian 12.3.0-4) 12.3.0
>>   * go version go1.20.5 linux/amd64
>>   * openjdk version "17.0.7" 2023-04-18
>>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
>>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>>
>> Thanks,
>> --
>> kou
>>
>> In <321c4e07-60a1-402d-9574-8437b462e...@app.fastmail.com>
>>   "[VOTE] Release Apache Arrow ADBC 0.5.1 - RC1" on Thu, 22 Jun 2023
>> 22:08:56 -0400,
>>   "David Li"  wrote:
>>
>> > (I originally sent this with the wrong email, but it appears to have
>> been swallowed. Apologies if this ends up being a duplicate.)
>> >
>> > I would like to propose the following release candidate (RC1) of Apache
>> Arrow ADBC version 0.5.1. This is a release consisting of 8 resolved GitHub
>> issues [1]. The main motivation is to release a fix in the Snowflake
>> driver, as mentioned in an earlier thread.
>> >
>> > This release candidate is based on commit:
>> 01c2f1eb281e8fb003f2d32096a6b0fe336128a9 [2]
>> > (Note I had to manually patch one script; this will be resolved in
>> future releases.)
>> >
>> > The source release rc1 is hosted at [3].
>> > The binary artifacts are hosted at [4][5][6][7][8].
>> > The changelog is located at [9].
>> >
>> > Please download, verify checksums and signatures, run the unit tests,
>> and vote on the release. See [10] for how to validate a release candidate.
>> >
>> > See also a verification result on GitHub Actions [11].
>> >
>> > The vote will be open for at least 72 hours.
>> >
>> > [ ] +1 Release this as Apache Arrow ADBC 0.5.1
>> > [ ] +0
>> > [ ] -1 Do not release this as Apache Arrow ADBC 0.5.1 because...
>> >
>> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
>> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
>> TEST_APT=0 TEST_YUM=0`.)
>> >
>> > [1]:
>> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.1%22+is%3Aclosed
>> > [2]:
>> https://github.com/apache/arrow-adbc/commit/01c2f1eb281e8fb003f2d32096a6b0fe336128a9
>> > [3]:
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.1-rc1/
>> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
>> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
>> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
>> > [7]:
>> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
>> > [8]:
>> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.1-rc1
>> > [9]:
>> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.1-rc1/CHANGELOG.md
>> > [10]:
>> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
>> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5351160439
>>


Re: [VOTE] Release Apache Arrow ADBC 0.5.1 - RC1

2023-06-23 Thread Matt Topol
+1 tested on Pop!_Os 22.04 with go 1.19

On Fri, Jun 23, 2023, 4:52 PM Sutou Kouhei  wrote:

> +1
>
> I ran the following on Debian GNU/Linux sid:
>
>   JAVA_HOME=/usr/lib/jvm/default-java \
> dev/release/verify-release-candidate.sh 0.5.1 1
>
> with:
>
>   * Python 3.11.4
>   * g++ (Debian 12.3.0-4) 12.3.0
>   * go version go1.20.5 linux/amd64
>   * openjdk version "17.0.7" 2023-04-18
>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>
> Thanks,
> --
> kou
>
> In <321c4e07-60a1-402d-9574-8437b462e...@app.fastmail.com>
>   "[VOTE] Release Apache Arrow ADBC 0.5.1 - RC1" on Thu, 22 Jun 2023
> 22:08:56 -0400,
>   "David Li"  wrote:
>
> > (I originally sent this with the wrong email, but it appears to have
> been swallowed. Apologies if this ends up being a duplicate.)
> >
> > I would like to propose the following release candidate (RC1) of Apache
> Arrow ADBC version 0.5.1. This is a release consisting of 8 resolved GitHub
> issues [1]. The main motivation is to release a fix in the Snowflake
> driver, as mentioned in an earlier thread.
> >
> > This release candidate is based on commit:
> 01c2f1eb281e8fb003f2d32096a6b0fe336128a9 [2]
> > (Note I had to manually patch one script; this will be resolved in
> future releases.)
> >
> > The source release rc1 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.5.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.5.1 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]:
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.1%22+is%3Aclosed
> > [2]:
> https://github.com/apache/arrow-adbc/commit/01c2f1eb281e8fb003f2d32096a6b0fe336128a9
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.1-rc1/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]:
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]:
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.1-rc1
> > [9]:
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.1-rc1/CHANGELOG.md
> > [10]:
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5351160439
>


Re: [VOTE] Release Apache Arrow ADBC 0.5.1 - RC1

2023-06-23 Thread Sutou Kouhei
+1

I ran the following on Debian GNU/Linux sid:

  JAVA_HOME=/usr/lib/jvm/default-java \
dev/release/verify-release-candidate.sh 0.5.1 1

with:

  * Python 3.11.4
  * g++ (Debian 12.3.0-4) 12.3.0
  * go version go1.20.5 linux/amd64
  * openjdk version "17.0.7" 2023-04-18
  * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
  * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"

Thanks,
-- 
kou

In <321c4e07-60a1-402d-9574-8437b462e...@app.fastmail.com>
  "[VOTE] Release Apache Arrow ADBC 0.5.1 - RC1" on Thu, 22 Jun 2023 22:08:56 
-0400,
  "David Li"  wrote:

> (I originally sent this with the wrong email, but it appears to have been 
> swallowed. Apologies if this ends up being a duplicate.)
> 
> I would like to propose the following release candidate (RC1) of Apache Arrow 
> ADBC version 0.5.1. This is a release consisting of 8 resolved GitHub issues 
> [1]. The main motivation is to release a fix in the Snowflake driver, as 
> mentioned in an earlier thread.
> 
> This release candidate is based on commit: 
> 01c2f1eb281e8fb003f2d32096a6b0fe336128a9 [2]
> (Note I had to manually patch one script; this will be resolved in future 
> releases.)
> 
> The source release rc1 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8].
> The changelog is located at [9].
> 
> Please download, verify checksums and signatures, run the unit tests, and 
> vote on the release. See [10] for how to validate a release candidate.
> 
> See also a verification result on GitHub Actions [11].
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow ADBC 0.5.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow ADBC 0.5.1 because...
> 
> Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export 
> TEST_APT=0 TEST_YUM=0`.)
> 
> [1]: 
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.1%22+is%3Aclosed
> [2]: 
> https://github.com/apache/arrow-adbc/commit/01c2f1eb281e8fb003f2d32096a6b0fe336128a9
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.1-rc1/
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [7]: 
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> [8]: 
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.1-rc1
> [9]: 
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.1-rc1/CHANGELOG.md
> [10]: 
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> [11]: https://github.com/apache/arrow-adbc/actions/runs/5351160439


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Weston Pace
> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.

In my mind there is a distinction between the "compute domain" (e.g. a
pandas dataframe or something like ibis or SQL) and the "data domain" (e.g.
pyarrow datasets).  I think, in a perfect world, you could push any and all
compute up and down the chain as far as possible.  However, in practice, I
think there is a healthy set of tools and libraries that say "simple column
projection and filtering is good enough".  I would argue that there is room
for both APIs and while the temptation is always present to "shove as much
compute as you can" I think pyarrow datasets seem to have found a balance
between the two that users like.

So I would argue that this protocol may never become a general-purpose
unmaterialized dataframe and that isn't necessarily a bad thing.

> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.

Just to clarify, the proposal currently only requires the fragments to be
serializable correct?

On Fri, Jun 23, 2023 at 11:48 AM Will Jones  wrote:

> Thanks Ian for your extensive feedback.
>
> I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions.
> >
>
> I would agree with this point, if we were starting from scratch. But one of
> my goals is for this protocol to be descriptive of the existing dataset
> integrations in the ecosystem, which all currently rely on PyArrow
> expressions. For example, you'll notice in the PR that there are unit tests
> to verify the current PyArrow Dataset classes conform to this protocol,
> without changes.
>
> I think there's three routes we can go here:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> Substrait-based alternatives we deprecate the PyArrow expression support.
> This is what I intended with the current design, and I think it provides
> the most obvious migration paths for existing producers and consumers.
> 2. We keep the overall dataset API, but don't introduce the filter and
> projection arguments until we have Substrait support. I'm not sure what the
> migration path looks like for producers and consumers, but I think this
> just implicitly becomes the same as (1), but with worse documentation.
> 3. We write a protocol completely from scratch, that doesn't try to
> describe the existing dataset API. Producers and consumers would then
> migrate to use the new protocol and deprecate their existing dataset
> integrations. We could introduce a dunder method in that API (sort of like
> __arrow_array__) that would make the migration seamless from the end-user
> perspective.
>
> *Which do you all think is the best path forward?*
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset.
> >
>
> This is a good point. I can add a section describing the differences. The
> main ones I can think of are that: (1) Datasets are "pruneable": one can
> select a subset of columns and apply a filter on rows to avoid IO and (2)
> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.
>
> Best,
>
> Will Jones
>
> On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:
>
> > Thanks Will for this proposal!
> >
> > For anyone familiar with PyArrow, this idea has a clear intuitive
> > logic to it. It provides an expedient solution to the current lack of
> > a practical means for interchanging "unmaterialized dataframes"
> > between different Python libraries.
> >
> > To elaborate on 

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Will Jones
Thanks Ian for your extensive feedback.

I strongly agree with the comments made by David,
> Weston, and Dewey arguing that we should avoid any use of PyArrow
> expressions in this API. Expressions are an implementation detail of
> PyArrow, not a part of the Arrow standard. It would be much safer for
> the initial version of this protocol to not define *any*
> methods/arguments that take expressions.
>

I would agree with this point, if we were starting from scratch. But one of
my goals is for this protocol to be descriptive of the existing dataset
integrations in the ecosystem, which all currently rely on PyArrow
expressions. For example, you'll notice in the PR that there are unit tests
to verify the current PyArrow Dataset classes conform to this protocol,
without changes.

I think there's three routes we can go here:

1. We keep PyArrow expressions in the API initially, but once we have
Substrait-based alternatives we deprecate the PyArrow expression support.
This is what I intended with the current design, and I think it provides
the most obvious migration paths for existing producers and consumers.
2. We keep the overall dataset API, but don't introduce the filter and
projection arguments until we have Substrait support. I'm not sure what the
migration path looks like for producers and consumers, but I think this
just implicitly becomes the same as (1), but with worse documentation.
3. We write a protocol completely from scratch, that doesn't try to
describe the existing dataset API. Producers and consumers would then
migrate to use the new protocol and deprecate their existing dataset
integrations. We could introduce a dunder method in that API (sort of like
__arrow_array__) that would make the migration seamless from the end-user
perspective.

*Which do you all think is the best path forward?*

Another concern I have is that we have not fully explained why we want
> to use Dataset instead of RecordBatchReader [9] as the basis of this
> protocol. I would like to see an explanation of why RecordBatchReader
> is not sufficient for this. RecordBatchReader seems like another
> possible way to represent "unmaterialized dataframes" and there are
> some parallels between RecordBatch/RecordBatchReader and
> Fragment/Dataset.
>

This is a good point. I can add a section describing the differences. The
main ones I can think of are that: (1) Datasets are "pruneable": one can
select a subset of columns and apply a filter on rows to avoid IO and (2)
they are splittable and serializable, so that fragments can be distributed
amongst processes / workers.

Best,

Will Jones

On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:

> Thanks Will for this proposal!
>
> For anyone familiar with PyArrow, this idea has a clear intuitive
> logic to it. It provides an expedient solution to the current lack of
> a practical means for interchanging "unmaterialized dataframes"
> between different Python libraries.
>
> To elaborate on that: If you look at how people use the Arrow Dataset
> API—which is implemented in the Arrow C++ library [1] and has bindings
> not just for Python [2] but also for Java [3] and R [4]—you'll see
> that Dataset is often used simply as a "virtual" variant of Table. It
> is used in cases when the data is larger than memory or when it is
> desirable to defer reading (materializing) the data into memory.
>
> So we can think of a Table as a materialized dataframe and a Dataset
> as an unmaterialized dataframe. That aspect of Dataset is I think what
> makes it most attractive as a protocol for enabling interoperability:
> it allows libraries to easily "speak Arrow" in cases where
> materializing the full data in memory upfront is impossible or
> undesirable.
>
> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.
>
> Will, I see that you've already addressed this issue to some extent in
> your proposal. For example, you mention that we should initially
> define this protocol to include only a minimal subset of the Dataset
> API. I agree, but I think there are some loose ends we should be
> careful to tie up. I strongly agree with the comments made by David,
> Weston, and Dewey arguing that we 

[DISCUSS] Possibility of 12.0.2 release

2023-06-23 Thread Bryan Cutler
Hi All,

I recently became aware of CVE issue
https://github.com/advisories/GHSA-6mjq-h674-j845 with the Java netty
libraries and using the fixed Netty library in version 4.1.94.Final
required a patch for Arrow, already merged in
https://github.com/apache/arrow/issues/36209.

I know the freeze for 13.0.0 is not too far away, but wanted to check about
any interest for a 12.0.2 in the meantime and if there were any other
pending issues that might make the minor release worthwhile?

Thanks,
Bryan


Re: [RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-23 Thread Dewey Dunnington
Ok! Post-release tasks are complete. Thank you all!

[x] Closed GitHub milestone
[x] Added release to Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[x] Submit R package to CRAN
[x] Sent announcement to annou...@apache.org
[x] Release blog post [2]
[x] Removed old artifacts from SVN
[x] Bumped versions on main

[1] https://arrow.apache.org/blog/2023/06/22/nanoarrow-0.120-release/

On Fri, Jun 23, 2023 at 9:28 AM Dewey Dunnington  wrote:
>
> Thanks for offering! Sorry for being slow to update the thread...David
> Li ran the upload script yesterday.
>
> -dewey
>
> On Thu, Jun 22, 2023 at 11:59 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > I believe the upload step requires a PMC member to run the script
> >
> > I can do it. Can I run
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/post-01-upload.sh
> > ?
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1" on Thu, 22 
> > Jun 2023 16:05:50 -0300,
> >   Dewey Dunnington  wrote:
> >
> > > Thank you everybody for verifying and voting! With 3 binding +1s and 3
> > > non-binding +1s, the vote passes! I have opened a PR to improve the
> > > verification instructions (particularly on conda where most problems
> > > occurred) [1].
> > >
> > > Apache Arrow nanoarrow 0.2.0 has the following post-release tasks. I
> > > believe the upload step requires a PMC member to run the script but
> > > the rest I'm happy to take care of!
> > >
> > > [x] Closed GitHub milestone
> > > [ ] Added release to Apache Reporter System
> > > [ ] Uploaded artifacts to Subversion
> > > [ ] Created GitHub release
> > > [ ] Submit R package to CRAN
> > > [ ] Sent announcement to annou...@apache.org
> > > [ ] Release blog post [2]
> > > [ ] Removed old artifacts from SVN
> > > [ ] Bumped versions on main
> > >
> > > [1] https://github.com/apache/arrow-nanoarrow/pull/243
> > > [2] https://github.com/apache/arrow-site/pull/364


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Dewey Dunnington
Thank you everybody for the welcome! I'm honoured!

On Fri, Jun 23, 2023 at 2:41 PM David Li  wrote:
>
> Welcome Dewey!
>
> On Fri, Jun 23, 2023, at 13:37, Weston Pace wrote:
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:
> >
> >>
> >> Welcome to the PMC Dewey!
> >>
> >>
> >> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
> >> > Congrats Dewey!
> >> >
> >> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
> >> >  wrote:
> >> >>
> >> >> Well deserved! Congratulations Dewey!
> >> >>
> >> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
> >> >>
> >> >>> Congratulations Dewey!
> >> >>>
> >> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> >> >>> wrote:
> >> 
> >>  Congrats Dewey!!
> >> 
> >>  On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
> >> 
> >>  wrote:
> >> 
> >> > Congrats Dewey!
> >> >
> >> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
> >> wrote:
> >> >
> >> >> Well-deserved Dewey, congratulations!
> >> >>
> >> >> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  >> >
> >> >> wrote:
> >> >>
> >> >>> Congratulations Dewey!
> >> >>>
> >> >>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> >> >>> ale...@voltrondata.com
> >> >>> .invalid>
> >> >>> wrote:
> >> >>>
> >>  Congratulations Dewey!! 
> >> 
> >>  On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> >> > raulcumpl...@gmail.com
> >> >>>
> >>  wrote:
> >> 
> >> > Congratulations Dewey!
> >> >
> >> > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> >> >>> escribió:
> >> >
> >> >> The Project Management Committee (PMC) for Apache Arrow has
> >> > invited
> >> >> Dewey Dunnington (paleolimbot) to become a PMC member and we
> >> >>> are
> >>  pleased
> >> > to
> >> >> announce
> >> >> that Dewey Dunnington has accepted.
> >> >>
> >> >> Congratulations and welcome!
> >> >>
> >> >
> >> 
> >> >>>
> >> >>
> >> >
> >> >>>
> >>


[ANNOUNCE] Apache Arrow nanoarrow 0.2.0 Released

2023-06-23 Thread Dewey Dunnington
The Apache Arrow community is pleased to announce the 0.2.0 release of
Apache Arrow nanoarrow. This initial release covers 19 resolved issues
from 6 contributors[1].

The release is available now from [2].

Release notes are available at:
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0/CHANGELOG.md

What is Apache Arrow?
-
Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides
low-overhead streaming and batch messaging, zero-copy interprocess
communication (IPC), and vectorized in-memory analytics libraries.
Languages currently supported include C, C++, C#, Go, Java,
JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

What is Apache Arrow nanoarrow?
--
Apache Arrow nanoarrow is a small C library for building and
interpreting Arrow C Data interface structures with bindings for users
of the R programming language. The vision of nanoarrow is that it
should be trivial for a library or application to implement an
Arrow-based interface. The library provides helpers to create types,
schemas, and metadata, an API for building arrays element-wise,
and an API to extract elements element-wise from an array. For a more
detailed description of the features nanoarrow provides and motivation
for its development, see [3].

Please report any feedback to the mailing lists ([4], [5]).

Regards,
The Apache Arrow Community

[1]: 
https://github.com/apache/arrow-nanoarrow/issues?q=is%3Aissue+milestone%3A%22nanoarrow+0.2.0%22+is%3Aclosed
[2]: https://www.apache.org/dyn/closer.cgi/arrow/apache-arrow-nanoarrow-0.2.0
[3]: https://github.com/apache/arrow-nanoarrow
[4]: https://lists.apache.org/list.html?u...@arrow.apache.org
[5]: https://lists.apache.org/list.html?dev@arrow.apache.org


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Ian Cook
Thanks Will for this proposal!

For anyone familiar with PyArrow, this idea has a clear intuitive
logic to it. It provides an expedient solution to the current lack of
a practical means for interchanging "unmaterialized dataframes"
between different Python libraries.

To elaborate on that: If you look at how people use the Arrow Dataset
API—which is implemented in the Arrow C++ library [1] and has bindings
not just for Python [2] but also for Java [3] and R [4]—you'll see
that Dataset is often used simply as a "virtual" variant of Table. It
is used in cases when the data is larger than memory or when it is
desirable to defer reading (materializing) the data into memory.

So we can think of a Table as a materialized dataframe and a Dataset
as an unmaterialized dataframe. That aspect of Dataset is I think what
makes it most attractive as a protocol for enabling interoperability:
it allows libraries to easily "speak Arrow" in cases where
materializing the full data in memory upfront is impossible or
undesirable.

The trouble is that Dataset was not designed to serve as a
general-purpose unmaterialized dataframe. For example, the PyArrow
Dataset constructor [5] exposes options for specifying a list of
source files and a partitioning scheme, which are irrelevant for many
of the applications that Will anticipates. And some work is needed to
reconcile the methods of the PyArrow Dataset object [6] with the
methods of the Table object. Some methods like filter() are exposed by
both and behave lazily on Datasets and eagerly on Tables, as a user
might expect. But many other Table methods are not implemented for
Dataset though they potentially could be, and it is unclear where we
should draw the line between adding methods to Dataset vs. encouraging
new scanner implementations to expose options controlling what lazy
operations should be performed as they see fit.

Will, I see that you've already addressed this issue to some extent in
your proposal. For example, you mention that we should initially
define this protocol to include only a minimal subset of the Dataset
API. I agree, but I think there are some loose ends we should be
careful to tie up. I strongly agree with the comments made by David,
Weston, and Dewey arguing that we should avoid any use of PyArrow
expressions in this API. Expressions are an implementation detail of
PyArrow, not a part of the Arrow standard. It would be much safer for
the initial version of this protocol to not define *any*
methods/arguments that take expressions. This will allow us to take
some more time to finish up the Substrait expression implementation
work that is underway [7][8], then introduce Substrait-based
expressions in a latter version of this protocol. This approach will
better position this protocol to be implemented in other languages
besides Python.

Another concern I have is that we have not fully explained why we want
to use Dataset instead of RecordBatchReader [9] as the basis of this
protocol. I would like to see an explanation of why RecordBatchReader
is not sufficient for this. RecordBatchReader seems like another
possible way to represent "unmaterialized dataframes" and there are
some parallels between RecordBatch/RecordBatchReader and
Fragment/Dataset. We should help developers and users understand why
Arrow needs both of these.

Thanks Will for your thoughtful prose explanations about this proposed
API. After we arrive at a decision about this, I think we should
reproduce some of these explanations in docs, blog posts, cookbook
recipes, etc. because there is some important nuance here that will be
important for integrators of this API to understand.

Ian

[1] https://arrow.apache.org/docs/cpp/api/dataset.html
[2] https://arrow.apache.org/docs/python/dataset.html
[3] https://arrow.apache.org/docs/java/dataset.html
[4] https://arrow.apache.org/docs/r/articles/dataset.html
[5] 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
[6] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
[7] https://github.com/apache/arrow/issues/33985
[8] https://github.com/apache/arrow/issues/34252
[9] 
https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html

On Wed, Jun 21, 2023 at 2:09 PM Will Jones  wrote:
>
> Hello Arrow devs,
>
> I have drafted a PR defining an experimental protocol which would allow
> third-party libraries to imitate the PyArrow Dataset API [5]. This protocol
> is intended to endorse an integration pattern that is starting to be used
> in the Python ecosystem, where some libraries are providing their own
> scanners with this API, while query engines are accepting these as
> duck-typed objects.
>
> To give some background: back at the end of 2021, we collaborated with
> DuckDB to be able to read datasets (an Arrow C++ concept), supporting
> column selection and filter pushdown. This was accomplished by having
> DuckDB manipulating Python (or R) objects to get a 

Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread David Li
Welcome Dewey!

On Fri, Jun 23, 2023, at 13:37, Weston Pace wrote:
> Congrats Dewey!
>
> On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:
>
>>
>> Welcome to the PMC Dewey!
>>
>>
>> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
>> > Congrats Dewey!
>> >
>> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
>> >  wrote:
>> >>
>> >> Well deserved! Congratulations Dewey!
>> >>
>> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
>> >>
>> >>> Congratulations Dewey!
>> >>>
>> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
>> >>> wrote:
>> 
>>  Congrats Dewey!!
>> 
>>  On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
>> 
>>  wrote:
>> 
>> > Congrats Dewey!
>> >
>> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
>> wrote:
>> >
>> >> Well-deserved Dewey, congratulations!
>> >>
>> >> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon > >
>> >> wrote:
>> >>
>> >>> Congratulations Dewey!
>> >>>
>> >>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
>> >>> ale...@voltrondata.com
>> >>> .invalid>
>> >>> wrote:
>> >>>
>>  Congratulations Dewey!! 
>> 
>>  On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
>> > raulcumpl...@gmail.com
>> >>>
>>  wrote:
>> 
>> > Congratulations Dewey!
>> >
>> > El vie, 23 jun 2023, 11:55, Andrew Lamb 
>> >>> escribió:
>> >
>> >> The Project Management Committee (PMC) for Apache Arrow has
>> > invited
>> >> Dewey Dunnington (paleolimbot) to become a PMC member and we
>> >>> are
>>  pleased
>> > to
>> >> announce
>> >> that Dewey Dunnington has accepted.
>> >>
>> >> Congratulations and welcome!
>> >>
>> >
>> 
>> >>>
>> >>
>> >
>> >>>
>>


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread David Li
That sort of thing can be handled by the client, though (and note it says that 
the error is if the statement is closed, not finished). So it doesn't seem 
strictly necessary, though it would allow a client to express intent.

On Fri, Jun 23, 2023, at 13:25, Weston Pace wrote:
> One small difference seems to be that Close is idempotent and Cancel is not.
>
>> void cancel()
>>  throws SQLException
>>
>> Cancels this Statement object if both the DBMS and driver support
> aborting an SQL statement. This method can be used by one thread to cancel
> a statement that is being executed by another thread.
>>
>> Throws:
>> SQLException - if a database access error occurs or this method is
> called on a closed Statement
>
> In other words, with cancel, you can display an error to the user if the
> statement is already finished (and thus was not able to be canceled).
> However, I don't know if that is significant at all.
>
> On Fri, Jun 23, 2023 at 12:17 AM Sutou Kouhei  wrote:
>
>> Hi,
>>
>> Thanks for sharing your thoughts.
>>
>> OK. I'll change the current specifications/implementations
>> to the followings:
>>
>> * Remove CloseFlightInfo (if nobody objects it)
>> * RefreshFlightEndpoint ->
>>   RenewFlightEndpoint
>> * RenewFlightEndpoint(FlightEndpoint) ->
>>   RenewFlightEndpoint(RenewFlightEndpointRequest)
>> * CancelFlightInfo(FlightInfo) ->
>>   CancelFlightInfo(CancelFlightInfoRequest)
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22
>> Jun 2023 12:51:55 -0400,
>>   Matt Topol  wrote:
>>
>> >> That said, I think it's reasonable to only have Cancel at the protocol
>> > level.
>> >
>> > I'd be in favor of only having Cancel too. In theory calling Cancel on
>> > something that has already completed should just be equivalent to calling
>> > Close anyways rather than requiring a client to guess and call Close if
>> > Cancel errors or something.
>> >
>> >> So this may not be needed for now. How about accepting a
>> >> specific request message instead of FlightEndpoint directly
>> >> as "PersistFlightEndpoint" input?
>> >
>> > I'm also in favor of this.
>> >
>> >> I think Refresh was fine, but if there's confusion, I like Kou's
>> > suggestion of Renew the best.
>> >
>> > I'm in the same boat as David here, I think Refresh was fine but like the
>> > suggestion of Renew best if we want to avoid any confusion.
>> >
>> >
>> >
>> > On Thu, Jun 22, 2023 at 2:55 AM Antoine Pitrou 
>> wrote:
>> >
>> >>
>> >> Doesn't protobuf ensure forwards compatibility? Why would it break?
>> >>
>> >> At worse, you can include the changes necessary for it to compile
>> >> cleanly, without adding support for the new fields/methods?
>> >>
>> >>
>> >> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
>> >> > Hi,
>> >> >
>> >> > The following part in the original e-mail is the one:
>> >> >
>> >> >> https://github.com/apache/arrow/pull/36009 is an
>> >> >> implementation of this proposal. The pull requests has the
>> >> >> followings:
>> >> >>
>> >> >> 1. Format changes:
>> >> >> * format/Flight.proto
>> >> >>
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
>> >> >> * format/FlightSql.proto
>> >> >>
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
>> >> >>
>> >> >> 2. Documentation changes:
>> >> >> docs/source/format/Flight.rst
>> >> >>
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
>> >> >
>> >> > We can split the part to a separated pull request. But if we
>> >> > split the part and merge the pull requests for format
>> >> > related changes and implementation related changes
>> >> > separately, our CI will be broken temporary. Because our
>> >> > implementations use auto-generated sources that are based on
>> >> > *.proto.
>> >> >
>> >> >
>> >> > Thanks,
>> >>
>>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Weston Pace
Congrats Dewey!

On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:

>
> Welcome to the PMC Dewey!
>
>
> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
> > Congrats Dewey!
> >
> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
> >  wrote:
> >>
> >> Well deserved! Congratulations Dewey!
> >>
> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> >>> wrote:
> 
>  Congrats Dewey!!
> 
>  On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
> 
>  wrote:
> 
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
> wrote:
> >
> >> Well-deserved Dewey, congratulations!
> >>
> >> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  >
> >> wrote:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> >>> ale...@voltrondata.com
> >>> .invalid>
> >>> wrote:
> >>>
>  Congratulations Dewey!! 
> 
>  On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > raulcumpl...@gmail.com
> >>>
>  wrote:
> 
> > Congratulations Dewey!
> >
> > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> >>> escribió:
> >
> >> The Project Management Committee (PMC) for Apache Arrow has
> > invited
> >> Dewey Dunnington (paleolimbot) to become a PMC member and we
> >>> are
>  pleased
> > to
> >> announce
> >> that Dewey Dunnington has accepted.
> >>
> >> Congratulations and welcome!
> >>
> >
> 
> >>>
> >>
> >
> >>>
>


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Weston Pace
One small difference seems to be that Close is idempotent and Cancel is not.

> void cancel()
>  throws SQLException
>
> Cancels this Statement object if both the DBMS and driver support
aborting an SQL statement. This method can be used by one thread to cancel
a statement that is being executed by another thread.
>
> Throws:
> SQLException - if a database access error occurs or this method is
called on a closed Statement

In other words, with cancel, you can display an error to the user if the
statement is already finished (and thus was not able to be canceled).
However, I don't know if that is significant at all.

On Fri, Jun 23, 2023 at 12:17 AM Sutou Kouhei  wrote:

> Hi,
>
> Thanks for sharing your thoughts.
>
> OK. I'll change the current specifications/implementations
> to the followings:
>
> * Remove CloseFlightInfo (if nobody objects it)
> * RefreshFlightEndpoint ->
>   RenewFlightEndpoint
> * RenewFlightEndpoint(FlightEndpoint) ->
>   RenewFlightEndpoint(RenewFlightEndpointRequest)
> * CancelFlightInfo(FlightInfo) ->
>   CancelFlightInfo(CancelFlightInfoRequest)
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22
> Jun 2023 12:51:55 -0400,
>   Matt Topol  wrote:
>
> >> That said, I think it's reasonable to only have Cancel at the protocol
> > level.
> >
> > I'd be in favor of only having Cancel too. In theory calling Cancel on
> > something that has already completed should just be equivalent to calling
> > Close anyways rather than requiring a client to guess and call Close if
> > Cancel errors or something.
> >
> >> So this may not be needed for now. How about accepting a
> >> specific request message instead of FlightEndpoint directly
> >> as "PersistFlightEndpoint" input?
> >
> > I'm also in favor of this.
> >
> >> I think Refresh was fine, but if there's confusion, I like Kou's
> > suggestion of Renew the best.
> >
> > I'm in the same boat as David here, I think Refresh was fine but like the
> > suggestion of Renew best if we want to avoid any confusion.
> >
> >
> >
> > On Thu, Jun 22, 2023 at 2:55 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Doesn't protobuf ensure forwards compatibility? Why would it break?
> >>
> >> At worse, you can include the changes necessary for it to compile
> >> cleanly, without adding support for the new fields/methods?
> >>
> >>
> >> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
> >> > Hi,
> >> >
> >> > The following part in the original e-mail is the one:
> >> >
> >> >> https://github.com/apache/arrow/pull/36009 is an
> >> >> implementation of this proposal. The pull requests has the
> >> >> followings:
> >> >>
> >> >> 1. Format changes:
> >> >> * format/Flight.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
> >> >> * format/FlightSql.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
> >> >>
> >> >> 2. Documentation changes:
> >> >> docs/source/format/Flight.rst
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
> >> >
> >> > We can split the part to a separated pull request. But if we
> >> > split the part and merge the pull requests for format
> >> > related changes and implementation related changes
> >> > separately, our CI will be broken temporary. Because our
> >> > implementations use auto-generated sources that are based on
> >> > *.proto.
> >> >
> >> >
> >> > Thanks,
> >>
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Antoine Pitrou



Welcome to the PMC Dewey!


Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :

Congrats Dewey!

On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
 wrote:


Well deserved! Congratulations Dewey!

Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:


Congratulations Dewey!

On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
wrote:


Congrats Dewey!!

On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
wrote:


Congrats Dewey!

On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:


Well-deserved Dewey, congratulations!

On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
wrote:


Congratulations Dewey!

On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <

ale...@voltrondata.com

.invalid>
wrote:


Congratulations Dewey!! 

On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <

raulcumpl...@gmail.com



wrote:


Congratulations Dewey!

El vie, 23 jun 2023, 11:55, Andrew Lamb 

escribió:



The Project Management Committee (PMC) for Apache Arrow has

invited

Dewey Dunnington (paleolimbot) to become a PMC member and we

are

pleased

to

announce
that Dewey Dunnington has accepted.

Congratulations and welcome!















Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Joris Van den Bossche
Congrats Dewey!

On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
 wrote:
>
> Well deserved! Congratulations Dewey!
>
> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
>
> > Congratulations Dewey!
> >
> > On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> > wrote:
> > >
> > > Congrats Dewey!!
> > >
> > > On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
> > > wrote:
> > >
> > > > Congrats Dewey!
> > > >
> > > > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:
> > > >
> > > > > Well-deserved Dewey, congratulations!
> > > > >
> > > > > On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> > > > > wrote:
> > > > >
> > > > > > Congratulations Dewey!
> > > > > >
> > > > > > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> > ale...@voltrondata.com
> > > > > > .invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > Congratulations Dewey!! 
> > > > > > >
> > > > > > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > > > raulcumpl...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Congratulations Dewey!
> > > > > > > >
> > > > > > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > > > > > escribió:
> > > > > > > >
> > > > > > > > > The Project Management Committee (PMC) for Apache Arrow has
> > > > invited
> > > > > > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we
> > are
> > > > > > > pleased
> > > > > > > > to
> > > > > > > > > announce
> > > > > > > > > that Dewey Dunnington has accepted.
> > > > > > > > >
> > > > > > > > > Congratulations and welcome!
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Jacob Wujciak-Jens
Well deserved! Congratulations Dewey!

Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:

> Congratulations Dewey!
>
> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> wrote:
> >
> > Congrats Dewey!!
> >
> > On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
> > wrote:
> >
> > > Congrats Dewey!
> > >
> > > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:
> > >
> > > > Well-deserved Dewey, congratulations!
> > > >
> > > > On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> > > > wrote:
> > > >
> > > > > Congratulations Dewey!
> > > > >
> > > > > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> ale...@voltrondata.com
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Congratulations Dewey!! 
> > > > > >
> > > > > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > > raulcumpl...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Congratulations Dewey!
> > > > > > >
> > > > > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > > > > escribió:
> > > > > > >
> > > > > > > > The Project Management Committee (PMC) for Apache Arrow has
> > > invited
> > > > > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we
> are
> > > > > > pleased
> > > > > > > to
> > > > > > > > announce
> > > > > > > > that Dewey Dunnington has accepted.
> > > > > > > >
> > > > > > > > Congratulations and welcome!
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Ian Cook
Congratulations Dewey!

On Fri, Jun 23, 2023 at 10:03 AM Matt Topol  wrote:
>
> Congrats Dewey!!
>
> On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
> wrote:
>
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:
> >
> > > Well-deserved Dewey, congratulations!
> > >
> > > On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> > > wrote:
> > >
> > > > Congratulations Dewey!
> > > >
> > > > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim  > > > .invalid>
> > > > wrote:
> > > >
> > > > > Congratulations Dewey!! 
> > > > >
> > > > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > raulcumpl...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Congratulations Dewey!
> > > > > >
> > > > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > > > escribió:
> > > > > >
> > > > > > > The Project Management Committee (PMC) for Apache Arrow has
> > invited
> > > > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> > > > > pleased
> > > > > > to
> > > > > > > announce
> > > > > > > that Dewey Dunnington has accepted.
> > > > > > >
> > > > > > > Congratulations and welcome!
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Matt Topol
Congrats Dewey!!

On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
wrote:

> Congrats Dewey!
>
> On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:
>
> > Well-deserved Dewey, congratulations!
> >
> > On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> > wrote:
> >
> > > Congratulations Dewey!
> > >
> > > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim  > > .invalid>
> > > wrote:
> > >
> > > > Congratulations Dewey!! 
> > > >
> > > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> raulcumpl...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Congratulations Dewey!
> > > > >
> > > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > > escribió:
> > > > >
> > > > > > The Project Management Committee (PMC) for Apache Arrow has
> invited
> > > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> > > > pleased
> > > > > to
> > > > > > announce
> > > > > > that Dewey Dunnington has accepted.
> > > > > >
> > > > > > Congratulations and welcome!
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Dane Pitkin
Congrats Dewey!

On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:

> Well-deserved Dewey, congratulations!
>
> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> wrote:
>
> > Congratulations Dewey!
> >
> > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim  > .invalid>
> > wrote:
> >
> > > Congratulations Dewey!! 
> > >
> > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido  >
> > > wrote:
> > >
> > > > Congratulations Dewey!
> > > >
> > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > escribió:
> > > >
> > > > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> > > pleased
> > > > to
> > > > > announce
> > > > > that Dewey Dunnington has accepted.
> > > > >
> > > > > Congratulations and welcome!
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Nic Crane
Well-deserved Dewey, congratulations!

On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  wrote:

> Congratulations Dewey!
>
> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim  .invalid>
> wrote:
>
> > Congratulations Dewey!! 
> >
> > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido 
> > wrote:
> >
> > > Congratulations Dewey!
> > >
> > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> escribió:
> > >
> > > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> > pleased
> > > to
> > > > announce
> > > > that Dewey Dunnington has accepted.
> > > >
> > > > Congratulations and welcome!
> > > >
> > >
> >
>


Re: [RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-23 Thread Dewey Dunnington
Thanks for offering! Sorry for being slow to update the thread...David
Li ran the upload script yesterday.

-dewey

On Thu, Jun 22, 2023 at 11:59 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > I believe the upload step requires a PMC member to run the script
>
> I can do it. Can I run
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/post-01-upload.sh
> ?
>
>
> Thanks,
> --
> kou
>
> In 
>   "[RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1" on Thu, 22 Jun 
> 2023 16:05:50 -0300,
>   Dewey Dunnington  wrote:
>
> > Thank you everybody for verifying and voting! With 3 binding +1s and 3
> > non-binding +1s, the vote passes! I have opened a PR to improve the
> > verification instructions (particularly on conda where most problems
> > occurred) [1].
> >
> > Apache Arrow nanoarrow 0.2.0 has the following post-release tasks. I
> > believe the upload step requires a PMC member to run the script but
> > the rest I'm happy to take care of!
> >
> > [x] Closed GitHub milestone
> > [ ] Added release to Apache Reporter System
> > [ ] Uploaded artifacts to Subversion
> > [ ] Created GitHub release
> > [ ] Submit R package to CRAN
> > [ ] Sent announcement to annou...@apache.org
> > [ ] Release blog post [2]
> > [ ] Removed old artifacts from SVN
> > [ ] Bumped versions on main
> >
> > [1] https://github.com/apache/arrow-nanoarrow/pull/243
> > [2] https://github.com/apache/arrow-site/pull/364


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Vibhatha Abeykoon
Congratulations Dewey!

On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim 
wrote:

> Congratulations Dewey!! 
>
> On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido 
> wrote:
>
> > Congratulations Dewey!
> >
> > El vie, 23 jun 2023, 11:55, Andrew Lamb  escribió:
> >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> pleased
> > to
> > > announce
> > > that Dewey Dunnington has accepted.
> > >
> > > Congratulations and welcome!
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Alenka Frim
Congratulations Dewey!! 

On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido 
wrote:

> Congratulations Dewey!
>
> El vie, 23 jun 2023, 11:55, Andrew Lamb  escribió:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Dewey Dunnington (paleolimbot) to become a PMC member and we are pleased
> to
> > announce
> > that Dewey Dunnington has accepted.
> >
> > Congratulations and welcome!
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Raúl Cumplido
Congratulations Dewey!

El vie, 23 jun 2023, 11:55, Andrew Lamb  escribió:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Dewey Dunnington (paleolimbot) to become a PMC member and we are pleased to
> announce
> that Dewey Dunnington has accepted.
>
> Congratulations and welcome!
>


[ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Andrew Lamb
The Project Management Committee (PMC) for Apache Arrow has invited
Dewey Dunnington (paleolimbot) to become a PMC member and we are pleased to
announce
that Dewey Dunnington has accepted.

Congratulations and welcome!


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Sutou Kouhei
Hi,

Thanks for sharing your thoughts.

OK. I'll change the current specifications/implementations
to the followings:

* Remove CloseFlightInfo (if nobody objects it)
* RefreshFlightEndpoint ->
  RenewFlightEndpoint
* RenewFlightEndpoint(FlightEndpoint) ->
  RenewFlightEndpoint(RenewFlightEndpointRequest)
* CancelFlightInfo(FlightInfo) ->
  CancelFlightInfo(CancelFlightInfoRequest)


Thanks,
-- 
kou

In 
  "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22 Jun 
2023 12:51:55 -0400,
  Matt Topol  wrote:

>> That said, I think it's reasonable to only have Cancel at the protocol
> level.
> 
> I'd be in favor of only having Cancel too. In theory calling Cancel on
> something that has already completed should just be equivalent to calling
> Close anyways rather than requiring a client to guess and call Close if
> Cancel errors or something.
> 
>> So this may not be needed for now. How about accepting a
>> specific request message instead of FlightEndpoint directly
>> as "PersistFlightEndpoint" input?
> 
> I'm also in favor of this.
> 
>> I think Refresh was fine, but if there's confusion, I like Kou's
> suggestion of Renew the best.
> 
> I'm in the same boat as David here, I think Refresh was fine but like the
> suggestion of Renew best if we want to avoid any confusion.
> 
> 
> 
> On Thu, Jun 22, 2023 at 2:55 AM Antoine Pitrou  wrote:
> 
>>
>> Doesn't protobuf ensure forwards compatibility? Why would it break?
>>
>> At worse, you can include the changes necessary for it to compile
>> cleanly, without adding support for the new fields/methods?
>>
>>
>> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
>> > Hi,
>> >
>> > The following part in the original e-mail is the one:
>> >
>> >> https://github.com/apache/arrow/pull/36009 is an
>> >> implementation of this proposal. The pull requests has the
>> >> followings:
>> >>
>> >> 1. Format changes:
>> >> * format/Flight.proto
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
>> >> * format/FlightSql.proto
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
>> >>
>> >> 2. Documentation changes:
>> >> docs/source/format/Flight.rst
>> >>
>> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
>> >
>> > We can split the part to a separated pull request. But if we
>> > split the part and merge the pull requests for format
>> > related changes and implementation related changes
>> > separately, our CI will be broken temporary. Because our
>> > implementations use auto-generated sources that are based on
>> > *.proto.
>> >
>> >
>> > Thanks,
>>


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Sutou Kouhei
Hi,

Could someone who is familiar with JDBC explain the behavior
of cancel/close for UPDATE/INSERT case?

If we can't find any useful use-case for providing both of
cancel and close, we'll provide only cancel in this
proposal. If we find an useful use-case for it, we can add
close later.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS][Format][Flight] Result set expiration support" on Wed, 21 Jun 
2023 15:53:54 +0200,
  Antoine Pitrou  wrote:

> 
> Ah... in JDBC, if the statement is something like an UPDATE or INSERT,
> than cancelling the statement is not the same thing as closing the
> result set? The latter would probably just discard the result set but
> still commit the results?
> 
> The problem is that Flight RPC doesn't have separate notions of
> queries and results sets...
> 
> 
> Le 21/06/2023 à 15:49, David Li a écrit :
>> There is a PR linked in the original message, but here it is again:
>> https://github.com/apache/arrow/pull/36009
>> Cancel and Close are close semantically, but Cancel is meant for when
>> the (client thinks that) computation is still ongoing, while Close is
>> meant to free server resources after reading a result set. (For
>> example, JDBC has Statement#cancel [1] and ResultSet#close [2].)
>> That said, I think it's reasonable to only have Cancel at the protocol
>> level.
>> [1]:
>> https://docs.oracle.com/javase/8/docs/api/java/sql/Statement.html#cancel--
>> [2]:
>> https://docs.oracle.com/javase/8/docs/api/java/sql/ResultSet.html#close--
>> On Wed, Jun 21, 2023, at 09:35, Antoine Pitrou wrote:
>>> Hi Kou,
>>>
>>> Can we have an actual PR with the proposed gRPC field, method and
>>> docstring additions?
>>>
>>> Regardless, I have some comments and questions:
>>>
>>> * "RefreshFlightEndpoint" suggests the server will recompute (refresh)
>>> the results; instead I would suggest "PersistFlightEndpoint"
>>>
>>> * Perhaps "PersistFlightEndpoint" can take an optional
>>> "suggested_expiration" timestamp, which the server is free to ignore
>>> (some clients may only need to extend the expiration by two minutes,
>>> others by two days...)
>>>
>>> * Does the client potentially have to call "PersistFlightEndpoint" on
>>> each returned endpoint? Can it pass several endpoints at once?
>>>
>>> * What is the expected difference between "CancelFlightInfo" and
>>> "CloseFlightInfo"? Both seem to have a similar effect, and the exact
>>> behaviour will probably be server-dependent anyway ("cancel" and
>>> "close"
>>> may have meaningful differences when putting/uploading data, not so
>>> much
>>> when getting/downloading data, IMHO?).
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>> Le 21/06/2023 à 02:28, Sutou Kouhei a écrit :
 Hi,

 David provided the Java implementation. Thanks!

 If anyone has any comments about this proposal, please share
 them.


 Thanks,


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Sutou Kouhei
Hi,

Sorry. I was wrong. I tried it locally and got no build
error. We added "deprecated" metadata in this case. So I
thought that we get some deprecated warnings and they are
treated as errors in CI.

> At worse, you can include the changes necessary for it to compile
> cleanly, without adding support for the new fields/methods?

Why do we want to split format/ changes even when we require
additional changes? Easy to review?

I can understand that we can review specification changes
without implementations. But some problems may be found by
implementing the specification changes. I think that this is
the reason why we require at least two reference
implementations to change our specifications.

So I think that we should not split specification changes
and their implementations without a reasonable reason.

If we should review/merge specification changes and then
review/merge their implementations, how about updating our
changing process?
https://arrow.apache.org/docs/dev/format/Changing.html


Thanks,
-- 
kou

In 
  "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22 Jun 
2023 08:55:33 +0200,
  Antoine Pitrou  wrote:

> 
> Doesn't protobuf ensure forwards compatibility? Why would it break?
> 
> At worse, you can include the changes necessary for it to compile
> cleanly, without adding support for the new fields/methods?
> 
> 
> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
>> Hi,
>> The following part in the original e-mail is the one:
>> 
>>> https://github.com/apache/arrow/pull/36009 is an
>>> implementation of this proposal. The pull requests has the
>>> followings:
>>>
>>> 1. Format changes:
>>> * format/Flight.proto
>>>   
>>> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
>>> * format/FlightSql.proto
>>>   
>>> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
>>>
>>> 2. Documentation changes:
>>> docs/source/format/Flight.rst
>>> 
>>> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
>> We can split the part to a separated pull request. But if we
>> split the part and merge the pull requests for format
>> related changes and implementation related changes
>> separately, our CI will be broken temporary. Because our
>> implementations use auto-generated sources that are based on
>> *.proto.
>> Thanks,