Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Dewey Dunnington
Thank you for collecting all of our opinions on this! I also agree
that (4) is the best option.

> Fields:
>
> | Name   | Type  | Comments |
> ||---|  |
> | column | utf8  | (2)  |

The uft8 type would presume that column names are unique (although I
like it better than referring to columns by integer position).

> If null, then the statistic applies to the entire table.

Perhaps the NULL column value could also be used for the other
statistics in addition to a row count if the array is not a struct
array?


On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
>
>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.


Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-29 Thread Dewey Dunnington
> INFRA tickets are required before migration.

Perhaps this is different for existing repositories, but just a note
that it may also be possible by editing .asf.yaml (e.g. [1])

[1] 
https://github.com/apache/arrow-nanoarrow/blob/81711045e8bb4ded1cb3b5a6fa354b35f18aa4e7/.asf.yaml#L24-L25

On Wed, May 29, 2024 at 10:39 PM Gang Wu  wrote:
>
> Just want to mention that these apache/parquet-* Github repositories
> have not yet enabled issues and INFRA tickets are required before
> migration.
>
> Best,
> Gang
>
> On Thu, May 30, 2024 at 1:55 AM Micah Kornfield 
> wrote:
>
> > SGTM +1
> >
> > On Wed, May 29, 2024 at 10:50 AM Rok Mihevc  wrote:
> >
> > > On Wed, May 29, 2024 at 4:39 PM Fokko Driesprong 
> > wrote:
> > >
> > > > Hey Rok,
> > > >
> > > > Thanks for bringing this up. I'm also very much in favor of Github.
> > Once
> > > > we've migrated cpp, I think migrating the other repositories is a great
> > > > idea. Let me know if I can help!
> > >
> > >
> > > Perfect! A question I think we want to answer is where to move which
> > > issues. My mapping by component would be:
> > >
> > > jira/parquet-avro   --> github/parquet-java
> > > jira/parquet-cascading  --> github/parquet-java
> > > jira/parquet-cli   --> github/parquet-java
> > > jira/parquet-cpp--> github/arrow
> > > jira/parquet-format--> github//parquet-format
> > > jira/parquet-hadoop  --> github//parquet-java
> > > jira/parquet-mr  --> github/parquet-java
> > > jira/parquet-pig --> github/parquet-java
> > > jira/parquet-protobuf --> github/parquet-java
> > > jira/parquet-site --> github/parquet-site
> > > jira/parquet-testing--> github/parquet-testing
> > > jira/parquet-thrift--> github/parquet-java
> > >
> > > Would this be ok for everyone?
> > >
> > > Rok
> > >
> >


Re: [RESULT] Release Apache Arrow nanoarrow 0.5.0

2024-05-29 Thread Dewey Dunnington
All post-release tasks are now complete!

[x] Closed GitHub milestone
[x] Added release to the Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[x] Submit R package to CRAN
[x] Submit Python package to PyPI
[x] Update Python package on conda-forge
[x] Finish release blog post at https://github.com/apache/arrow-site/pull/525
[x] Sent announcement to annou...@apache.org
[x] Removed old artifacts from SVN
[x] Bumped versions on main
[x] Update WrapDB entry

Thanks to Will Ayd for creating the WrapDB entry for Meson build users!

On Sat, May 25, 2024 at 9:57 PM Dewey Dunnington  wrote:
>
> The vote carries with 4 binding +1s and 3 non-binding +1s. Thank you
> everybody for voting!
>
> There are still a few post-release tasks to complete that I will take
> care of this week:
>
> [x] Closed GitHub milestone
> [x] Added release to the Apache Reporter System
> [x] Uploaded artifacts to Subversion
> [x] Created GitHub release
> [ ] Submit R package to CRAN
> [x] Submit Python package to PyPI
> [ ] Update Python package on conda-forge
> [ ] Finish release blog post at https://github.com/apache/arrow-site/pull/525
> [ ] Sent announcement to annou...@apache.org
> [x] Removed old artifacts from SVN
> [x] Bumped versions on main
>
>
> On Sat, May 25, 2024 at 9:08 PM Dewey Dunnington  
> wrote:
> >
> > +1 (binding)
> >
> > I ran ./verify-release-candidate.sh 0.5.0 0 on MacOS M1. Also see a
> > suite of successful verification runs from CI [1] and matrix of Python
> > wheel builds [2].
> >
> > [1] https://github.com/apache/arrow-nanoarrow/actions/runs/9194767396
> > [2] https://github.com/apache/arrow-nanoarrow/actions/runs/9173434897
> >
> > On Fri, May 24, 2024 at 12:45 AM Vibhatha Abeykoon  
> > wrote:
> > >
> > > +1 (non-binding)
> > >
> > > I have tested on Ubuntu 22.04
> > >
> > > ./verify-release-candidate.sh 0.5.0 0
> > >
> > > With Regards,
> > > Vibhatha Abeykoon
> > >
> > >
> > > On Thu, May 23, 2024 at 3:21 PM Raúl Cumplido  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > I've tested successfully on Ubuntu 22.04 without R.
> > > >
> > > > TEST_R=0 ./verify-release-candidate.sh 0.5.0 0
> > > >
> > > > Regards,
> > > > Raúl
> > > >
> > > > El jue, 23 may 2024 a las 6:49, David Li () 
> > > > escribió:
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > > Tested on Debian 12 'bookworm'
> > > > >
> > > > > On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote:
> > > > > > +1 (binding)
> > > > > >
> > > > > > I ran the following command line on Debian GNU/Linux sid:
> > > > > >
> > > > > >   dev/release/verify-release-candidate.sh 0.5.0 0
> > > > > >
> > > > > > with:
> > > > > >
> > > > > >   * Apache Arrow C++ main
> > > > > >   * gcc (Debian 13.2.0-23) 13.2.0
> > > > > >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> > > > > >   * Python 3.11.9
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > kou
> > > > > >
> > > > > >
> > > > > > In 
> > > > > >  > > > >
> > > > > >   "[VOTE] Release Apache Arrow nanoarrow 0.5.0" on Wed, 22 May 2024
> > > > > > 15:17:40 -0300,
> > > > > >   Dewey Dunnington  wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> I would like to propose the following release candidate (rc0) of
> > > > > >> Apache Arrow nanoarrow [0] version 0.5.0. This is an initial 
> > > > > >> release
> > > > > >> consisting of 79 resolved GitHub issues from 9 contributors [1].
> > > > > >>
> > > > > >> This release candidate is based on commit:
> > > > > >> c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]
> > > > > >>
> > > > > >> The source release rc0 is hosted at [3].
> > > > > >> The changelog is located at [4].
> > > > > >>
> > > > > >> Please download, verify checksums and signatures, run the unit 
> > > > > >> tests,
> > > > > >> and vote on the release. See [5] for how to validate a release
> > > > > >> candidate.
> > > > > >>
> > > > > >> The vote will be open for at least 72 hours.
> > > > > >>
> > > > > >> [ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
> > > > > >> [ ] +0
> > > > > >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 
> > > > > >> because...
> > > > > >>
> > > > > >> [0] https://github.com/apache/arrow-nanoarrow
> > > > > >> [1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
> > > > > >> [2]
> > > > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
> > > > > >> [3]
> > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
> > > > > >> [4]
> > > > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
> > > > > >> [5]
> > > > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > > >


[ANNOUNCE] Apache Arrow nanoarrow 0.5.0 Released

2024-05-29 Thread Dewey Dunnington
The Apache Arrow community is pleased to announce the 0.5.0 release of
Apache Arrow nanoarrow. This initial release covers 79 resolved issues
from 9 contributors[1].

The release is available now from [2], release notes are available at
[3], and a blog post highlighting new features and breaking changes is
available at [4].

What is Apache Arrow?
-
Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides
low-overhead streaming and batch messaging, zero-copy interprocess
communication (IPC), and vectorized in-memory analytics libraries.
Languages currently supported include C, C++, C#, Go, Java,
JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

What is Apache Arrow nanoarrow?
--
Apache Arrow nanoarrow is a C library for building and interpreting
Arrow C Data interface structures with bindings for users of R and
Python. The vision of nanoarrow is that it should be trivial for a
library or application to implement an Arrow-based interface. The
library provides helpers to create types, schemas, and metadata, an
API for building arrays element-wise,
and an API to extract elements element-wise from an array. For a more
detailed description of the features nanoarrow provides and motivation
for its development, see [5].

Please report any feedback to the mailing lists ([6], [7]).

Regards,
The Apache Arrow Community

[1] 
https://github.com/apache/arrow-nanoarrow/issues?q=milestone%3A%22nanoarrow+0.5.0%22+is%3Aclosed
[2] https://www.apache.org/dyn/closer.cgi/arrow/apache-arrow-nanoarrow-0.5.0
[3] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0/CHANGELOG.md
[4] https://arrow.apache.org/blog/2024/05/27/nanoarrow-0.5.0-release/
[5] https://arrow.apache.org/nanoarrow/
[6] https://lists.apache.org/list.html?u...@arrow.apache.org
[7] https://lists.apache.org/list.html?dev@arrow.apache.org


[RESULT] Release Apache Arrow nanoarrow 0.5.0

2024-05-25 Thread Dewey Dunnington
The vote carries with 4 binding +1s and 3 non-binding +1s. Thank you
everybody for voting!

There are still a few post-release tasks to complete that I will take
care of this week:

[x] Closed GitHub milestone
[x] Added release to the Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[ ] Submit R package to CRAN
[x] Submit Python package to PyPI
[ ] Update Python package on conda-forge
[ ] Finish release blog post at https://github.com/apache/arrow-site/pull/525
[ ] Sent announcement to annou...@apache.org
[x] Removed old artifacts from SVN
[x] Bumped versions on main


On Sat, May 25, 2024 at 9:08 PM Dewey Dunnington  wrote:
>
> +1 (binding)
>
> I ran ./verify-release-candidate.sh 0.5.0 0 on MacOS M1. Also see a
> suite of successful verification runs from CI [1] and matrix of Python
> wheel builds [2].
>
> [1] https://github.com/apache/arrow-nanoarrow/actions/runs/9194767396
> [2] https://github.com/apache/arrow-nanoarrow/actions/runs/9173434897
>
> On Fri, May 24, 2024 at 12:45 AM Vibhatha Abeykoon  wrote:
> >
> > +1 (non-binding)
> >
> > I have tested on Ubuntu 22.04
> >
> > ./verify-release-candidate.sh 0.5.0 0
> >
> > With Regards,
> > Vibhatha Abeykoon
> >
> >
> > On Thu, May 23, 2024 at 3:21 PM Raúl Cumplido  wrote:
> >
> > > +1 (binding)
> > >
> > > I've tested successfully on Ubuntu 22.04 without R.
> > >
> > > TEST_R=0 ./verify-release-candidate.sh 0.5.0 0
> > >
> > > Regards,
> > > Raúl
> > >
> > > El jue, 23 may 2024 a las 6:49, David Li () escribió:
> > > >
> > > > +1 (binding)
> > > >
> > > > Tested on Debian 12 'bookworm'
> > > >
> > > > On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote:
> > > > > +1 (binding)
> > > > >
> > > > > I ran the following command line on Debian GNU/Linux sid:
> > > > >
> > > > >   dev/release/verify-release-candidate.sh 0.5.0 0
> > > > >
> > > > > with:
> > > > >
> > > > >   * Apache Arrow C++ main
> > > > >   * gcc (Debian 13.2.0-23) 13.2.0
> > > > >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> > > > >   * Python 3.11.9
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > kou
> > > > >
> > > > >
> > > > > In  > > >
> > > > >   "[VOTE] Release Apache Arrow nanoarrow 0.5.0" on Wed, 22 May 2024
> > > > > 15:17:40 -0300,
> > > > >   Dewey Dunnington  wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> I would like to propose the following release candidate (rc0) of
> > > > >> Apache Arrow nanoarrow [0] version 0.5.0. This is an initial release
> > > > >> consisting of 79 resolved GitHub issues from 9 contributors [1].
> > > > >>
> > > > >> This release candidate is based on commit:
> > > > >> c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]
> > > > >>
> > > > >> The source release rc0 is hosted at [3].
> > > > >> The changelog is located at [4].
> > > > >>
> > > > >> Please download, verify checksums and signatures, run the unit tests,
> > > > >> and vote on the release. See [5] for how to validate a release
> > > > >> candidate.
> > > > >>
> > > > >> The vote will be open for at least 72 hours.
> > > > >>
> > > > >> [ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
> > > > >> [ ] +0
> > > > >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 because...
> > > > >>
> > > > >> [0] https://github.com/apache/arrow-nanoarrow
> > > > >> [1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
> > > > >> [2]
> > > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
> > > > >> [3]
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
> > > > >> [4]
> > > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
> > > > >> [5]
> > > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > >


Re: [VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-25 Thread Dewey Dunnington
+1 (binding)

I ran ./verify-release-candidate.sh 0.5.0 0 on MacOS M1. Also see a
suite of successful verification runs from CI [1] and matrix of Python
wheel builds [2].

[1] https://github.com/apache/arrow-nanoarrow/actions/runs/9194767396
[2] https://github.com/apache/arrow-nanoarrow/actions/runs/9173434897

On Fri, May 24, 2024 at 12:45 AM Vibhatha Abeykoon  wrote:
>
> +1 (non-binding)
>
> I have tested on Ubuntu 22.04
>
> ./verify-release-candidate.sh 0.5.0 0
>
> With Regards,
> Vibhatha Abeykoon
>
>
> On Thu, May 23, 2024 at 3:21 PM Raúl Cumplido  wrote:
>
> > +1 (binding)
> >
> > I've tested successfully on Ubuntu 22.04 without R.
> >
> > TEST_R=0 ./verify-release-candidate.sh 0.5.0 0
> >
> > Regards,
> > Raúl
> >
> > El jue, 23 may 2024 a las 6:49, David Li () escribió:
> > >
> > > +1 (binding)
> > >
> > > Tested on Debian 12 'bookworm'
> > >
> > > On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote:
> > > > +1 (binding)
> > > >
> > > > I ran the following command line on Debian GNU/Linux sid:
> > > >
> > > >   dev/release/verify-release-candidate.sh 0.5.0 0
> > > >
> > > > with:
> > > >
> > > >   * Apache Arrow C++ main
> > > >   * gcc (Debian 13.2.0-23) 13.2.0
> > > >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> > > >   * Python 3.11.9
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > >
> > > > In  > >
> > > >   "[VOTE] Release Apache Arrow nanoarrow 0.5.0" on Wed, 22 May 2024
> > > > 15:17:40 -0300,
> > > >   Dewey Dunnington  wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I would like to propose the following release candidate (rc0) of
> > > >> Apache Arrow nanoarrow [0] version 0.5.0. This is an initial release
> > > >> consisting of 79 resolved GitHub issues from 9 contributors [1].
> > > >>
> > > >> This release candidate is based on commit:
> > > >> c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]
> > > >>
> > > >> The source release rc0 is hosted at [3].
> > > >> The changelog is located at [4].
> > > >>
> > > >> Please download, verify checksums and signatures, run the unit tests,
> > > >> and vote on the release. See [5] for how to validate a release
> > > >> candidate.
> > > >>
> > > >> The vote will be open for at least 72 hours.
> > > >>
> > > >> [ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
> > > >> [ ] +0
> > > >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 because...
> > > >>
> > > >> [0] https://github.com/apache/arrow-nanoarrow
> > > >> [1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
> > > >> [2]
> > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
> > > >> [3]
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
> > > >> [4]
> > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
> > > >> [5]
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> >


Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
> > files). In some sense, the statistics actually *are* a property of the 
> > stream.
> >
> > In systems that I work on, we already use schema metadata to communicate 
> > information that is unrelated to the structure of the data. From my reading 
> > of the documentation [1], this sounds like a reasonable (and perhaps 
> > intended?) use of metadata, and nowhere is it mentioned that metadata must 
> > be used to determine schema equivalence. Unless there are other ways of 
> > producing stream-level application metadata outside of the schema/field 
> > metadata, the lack of purity was not a concern for me to begin with.
> >
> > I would appreciate an approach that communicates statistics via schema 
> > metadata, or at least in some in-band fashion that is consistent across the 
> > IPC and C data specifications. This would make it much easier to uniformly 
> > and transparently plumb statistics through applications, regardless of 
> > where they source Arrow data from. As developers are likely to create 
> > bespoke conventions for this anyways, it seems reasonable to standardize it 
> > as canonical metadata.
> >
> > I say this all as a happy user of DuckDB's Arrow scan functionality that is 
> > excited to see better query optimization capabilities. It's just that, in 
> > its current form, the changes in this proposal are not something I could 
> > foreseeably integrate with.
> >
> > Best,
> > Shoumyo
> >
> > [1]: 
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >
> > From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
> > dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Statistics through the C data interface
> >
> > I want to +1 on what Dewey is saying here and some comments.
> >
> > Sutou Kouhei wrote:
> > > ADBC may be a bit larger to use only for transmitting statistics. ADBC has
> > statistics related APIs but it has more other APIs.
> >
> > It's impossible to keep the responsibility of communication protocols
> > cleanly separated, but IMO, we should strive to keep the C Data
> > Interface more of a Transport Protocol than an Application Protocol.
> >
> > Statistics are application dependent and can complicate the
> > implementation of importers/exporters which would hinder the adoption
> > of the C Data Interface. Statistics also bring in security concerns
> > that are application-specific. e.g. can an algorithm trust min/max
> > stats and risk producing incorrect results if the statistics are
> > incorrect? A question that can't really be answered at the C Data
> > Interface level.
> >
> > The need for more sophisticated statistics only grows with time, so
> > there is no such thing as a "simple statistics schema".
> >
> > Protocols that produce/consume statistics might want to use the C Data
> > Interface as a primitive for passing Arrow arrays of statistics.
> >
> > ADBC might be too big of a leap in complexity now, but "we just need C
> > Data Interface + statistics" is unlikely to remain true for very long
> > as projects grow in complexity.
> >
> > --
> > Felipe
> >
> > On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
> >  wrote:
> > >
> > > Thank you for the background! I understand that these statistics are
> > > important for query planning; however, I am not sure that I follow why
> > > we are constrained to the ArrowSchema to represent them. The examples
> > > given seem to going through Python...would it be easier to request
> > > statistics at a higher level of abstraction? There would already need
> > > to be a separate mechanism to request an ArrowArrayStream with
> > > statistics (unless the PyCapsule `requested_schema` argument would
> > > suffice).
> > >
> > > > ADBC may be a bit larger to use only for transmitting
> > > > statistics. ADBC has statistics related APIs but it has more
> > > > other APIs.
> > >
> > > Some examples of producers given in the linked threads (Delta Lake,
> > > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> > > can implement an ADBC driver without defining all the methods (where
> > > the producer could call AdbcConnectionGetStatistics(), although
> > > AdbcStatementGetStatistics() might be more relevant here and doesn't
> > > exist). One example listed (using an Arrow Table as a source) seems a
> > > bit light to wrap in an ADBC driver; however, it would not take much
&g

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-23 Thread Dewey Dunnington
The adbcdrivermanager, adbcsqlite, and adbcpostgresql packages are all
updated on CRAN!

On Tue, May 21, 2024 at 10:41 PM David Li  wrote:
>
> [x] Close the GitHub milestone/project
> [x] Add the new release to the Apache Reporter System
> [x] Upload source release artifacts to Subversion
> [x] Create the final GitHub release
> [x] Update website
> [x] Upload wheels/sdist to PyPI
> [x] Publish Maven packages
> [x] Update tags for Go modules
> [x] Deploy APT/Yum repositories
> [ ] Update R packages
> [x] Upload Ruby packages to RubyGems
> [x] Upload C#/.NET packages to NuGet
> [x] Update conda-forge packages
> [x] Announce the new release
> [x] Remove old artifacts
> [x] Bump versions
> [IN PROGRESS] Publish release blog post [2]
>
> @Dewey, I'd appreciate your help as always with the R packages :)
>
> [1]: https://github.com/apache/arrow-site/pull/523
>
> On Tue, May 21, 2024, at 09:00, Sutou Kouhei wrote:
> > +1 (binding)
> >
> > I ran the following on Debian GNU/Linux sid:
> >
> >   TEST_DEFAULT=0 \
> > TEST_SOURCE=1 \
> > LANG=C \
> > TZ=UTC \
> > JAVA_HOME=/usr/lib/jvm/default-java \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_APT=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_BINARY=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_JARS=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_WHEELS=1 \
> > TEST_PYTHON_VERSIONS=3.11 \
> > LANG=C \
> > TZ=UTC \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_YUM=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> > with:
> >
> >   * g++ (Debian 13.2.0-23) 13.2.0
> >   * go version go1.22.2 linux/amd64
> >   * openjdk version "17.0.11" 2024-04-16
> >   * Python 3.11.9
> >   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> >   * Apache Arrow 17.0.0-SNAPSHOT
> >
> > Note:
> >
> > I needed to install arrow-glib-devel explicitly to verify
> > Yum repository:
> >
> > 
> > diff --git a/dev/release/verify-yum.sh b/dev/release/verify-yum.sh
> > index f7f023611..ff30176f1 100755
> > --- a/dev/release/verify-yum.sh
> > +++ b/dev/release/verify-yum.sh
> > @@ -170,6 +170,7 @@ echo "::endgroup::"
> >
> >  echo "::group::Test ADBC Arrow GLib"
> >
> > +${install_command} --enablerepo=epel arrow-glib-devel
> >  ${install_command} --enablerepo=epel 
> > adbc-arrow-glib-devel-${package_version}
> >  ${install_command} --enablerepo=epel adbc-arrow-glib-doc-${package_version}
> >
> > 
> >
> > adbc-arrow-glib-devel depends on "pkgconfig(arrow-glib)" and
> > libarrow-glib-devel provided by EPEL also provides it:
> >
> > $ sudo dnf repoquery --deplist adbc-arrow-glib-devel-12
> > Last metadata expiration check: 2:01:21 ago on Mon May 20 21:17:44 2024.
> > package: adbc-arrow-glib-devel-12-1.el9.x86_64
> > ...
> >   dependency: pkgconfig(arrow-glib)
> >provider: arrow-glib-devel-16.1.0-1.el9.x86_64
> >provider: libarrow-glib-devel-9.0.0-11.el9.x86_64
> > ...
> >
> >
> > If I don't install arrow-glib-devel explicitly,
> > libarrow-glib-devel may be installed. We may need to add
> > "Conflicts: libarrow-glib-devel" to Apache Arrow's
> > arrow-glib-devel to resolve this case automatically. Anyway,
> > this is not a ADBC problem. So it's not a blocker.
> >
> >
> >
> > Thanks,
> > --
> > kou
> >
> >
> > In 
> >   "[VOTE] Release Apache Arrow ADBC 12 - RC4" on Wed, 15 May 2024
> > 14:00:33 +0900,
> >   "David Li"  wrote:
> >
> >> Hello,
> >>
> >> I would like to propose the following release candidate (RC4) of Apache 
> >> Arrow ADBC version 12. This is a release consisting of 56 resolved GitHub 
> >> issues [1].
> >>
> >> Please note that the versioning scheme has changed.  This is the 12th 
> >> release of ADBC, and so is called version "12".  The subcomponents, 
> >> however, are versioned independently:
> >>
> >> - C/C++/GLib/Go/Python/Ruby: 1.0.0
> >> - C#: 0.12.0
> >> - Java: 0.12.0
> >> - R: 0.12.0
> >> - Rust: 0.12.0
> >>
> >> These are the versions you will see in the source and in actual packages.  
> >> The next release will be "13", and the subcomponents will increment their 
> >> versions independently (to either 1.1.0, 0.13.0, or 1.0.0).  At this 
> >> point, there is no plan to release subcomponents independently from the 
> >> project as a whole.
> >>
> >> Please note that there is a known issue when using the Flight SQL and 
> >> Snowflake drivers at the same time on x86_64 macOS [12].
> >>
> >> This release candidate is based on commit: 
> >> 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
> >>
> >> The source release rc4 is hosted at [3].
> >> The binary artifacts are hosted at [4][5][6][7][8].
> >> The changelog is located at [9].
> >>
> >> 

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thank you for the background! I understand that these statistics are
important for query planning; however, I am not sure that I follow why
we are constrained to the ArrowSchema to represent them. The examples
given seem to going through Python...would it be easier to request
statistics at a higher level of abstraction? There would already need
to be a separate mechanism to request an ArrowArrayStream with
statistics (unless the PyCapsule `requested_schema` argument would
suffice).

> ADBC may be a bit larger to use only for transmitting
> statistics. ADBC has statistics related APIs but it has more
> other APIs.

Some examples of producers given in the linked threads (Delta Lake,
Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
can implement an ADBC driver without defining all the methods (where
the producer could call AdbcConnectionGetStatistics(), although
AdbcStatementGetStatistics() might be more relevant here and doesn't
exist). One example listed (using an Arrow Table as a source) seems a
bit light to wrap in an ADBC driver; however, it would not take much
code to do so and the overhead of getting the reader via ADBC it is
something like 100 microseconds (tested via the ADBC R package's
"monkey driver" which wraps an existing stream as a statement). In any
case, the bulk of the code is building the statistics array.

> How about the following schema for the
> statistics ArrowArray? It's based on ADBC.

Whatever format for statistics is decided on, I imagine it should be
exactly the same as the ADBC standard? (Perhaps pushing changes
upstream if needed?).

On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>
> Hi,
>
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice
>
> It seems that we should use the approach because all
> feedback said so. How about the following schema for the
> statistics ArrowArray? It's based on ADBC.
>
> | Field Name   | Field Type| Comments |
> |--|---|  |
> | column_name  | utf8  | (1)  |
> | statistic_key| utf8 not null | (2)  |
> | statistic_value  | VALUE_SCHEMA not null |  |
> | statistic_is_approximate | bool not null | (3)  |
>
> 1. If null, then the statistic applies to the entire table.
>It's for "row_count".
> 2. We'll provide pre-defined keys such as "max", "min",
>"byte_width" and "distinct_count" but users can also use
>application specific keys.
> 3. If true, then the value is approximate or best-effort.
>
> VALUE_SCHEMA is a dense union with members:
>
> | Field Name | Field Type |
> |||
> | int64  | int64  |
> | uint64 | uint64 |
> | float64| float64|
> | binary | binary |
>
> If a column is an int32 column, it uses int64 for
> "max"/"min". We don't provide all types here. Users should
> use a compatible type (int64 for a int32 column) instead.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
> 17:04:57 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > Hi Kou,
> >
> > I agree that Dewey that this is overstretching the capabilities of the
> > C Data Interface. In particular, stuffing a pointer as metadata value
> > and decreeing it immortal doesn't sound like a good design decision.
> >
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice (Dewey mentioned ADBC but it is of course just
> > a possible API among others)?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> >> Hi,
> >> We're discussing how to provide statistics through the C
> >> data interface at:
> >> https://github.com/apache/arrow/issues/38837
> >> If you're interested in this feature, could you share your
> >> comments?
> >> Motivation:
> >> We can interchange Apache Arrow data by the C data interface
> >> in the same process. For example, we can pass Apache Arrow
> >> data read by Apache Arrow C++ (provider) to DuckDB
> >> (consumer) through the C data interface.
> >> A provider may know Apache Arrow data statistics. For
> >> example, a provider can know statistics when it reads Apache
> >> Parquet data because Apache Parquet may provide statistics.
> >> But a consumer can't know statistics that are known by a
> >> producer. Because there isn't a standard way to provide
> >> statistics through the C data interface. If a consumer can
> >> know statistics, it can process Apache Arrow data faster
> >> based on statistics.
> >> Proposal:
> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >> How about providing statistics as a metadata in ArrowSchema?
> >> We reserve "ARROW" namespace for internal Apache Arrow use:
> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >>
> >>> The ARROW pattern is a reserved 

[VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-22 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (rc0) of
Apache Arrow nanoarrow [0] version 0.5.0. This is an initial release
consisting of 79 resolved GitHub issues from 9 contributors [1].

This release candidate is based on commit:
c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]

The source release rc0 is hosted at [3].
The changelog is located at [4].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [5] for how to validate a release
candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
[5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Dewey Dunnington
I am definitely in favor of adding (or adopting an existing)
ABI-stable way to transmit statistics (the one that comes up most
frequently for me is just the number of values that are about to show
up in an ArrowArrayStream, since the producer often knows this and the
consumer often would like to preallocate).

I am skeptical of using the existing C ArrowSchema ABI to do this. The
ArrowSchema is exceptionally good at representing Arrow data types
(which in the presence of dictionaries, nested types, and extensions,
is difficult to do); however, using it to handle all aspects of a
consumer request/producer response I think dilutes its ability to do
this well.

If I'm understanding the proposal (and I may not be), the ArrowSchema
will be used to encode data-dependent values, which means the same
ArrowSchema is very tightly paired to a particular array stream (or
array). This means that one could no longer (e.g.) consume an array
stream and blindly assign each array in the stream the schema that was
returned by get_schema(). This is not impossible to work around but it
is a conceptual departure from the role the ArrowSchema has had in the
past. Encoding pointers as strings in metadata is also a departure
from what we have done previously.

It is possible to condense the boilerplate of an ADBC driver to about
10 lines of code [1]. Is there a reason we can't use ADBC (or an
extension to that standard) to more precisely handle those types of
requests/responses (and extensions to them that come up in the
future)? It is also not the first time it has come up to encode
data-dependent information in a schema (e.g., encoding scalar/record
batch-ness), so perhaps there is a need for another type of array
stream or descriptor struct?

[1] 
https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66

On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies
 wrote:
>
> Hi,
>
> One potential challenge with encoding statistics in the schema metadata
> is that some systems may consider this metadata as part of assessing
> schema equivalence.
>
> However, I think the bigger question is what the intended use-case for
> these statistics is? Often query engines want to collect statistics from
> multiple containers in one go, as this allows for efficient vectorised
> pruning across multiple files, row groups, etc... I therefore wonder if
> the solution is simply to return separate arrays of min, max, etc...
> potentially even grouped together into a single StructArray?
>
> This would have the benefit of not needing specification changes, whilst
> being significantly more efficient than an approach centered on scalar
> statistics. FWIW this is the approach taken by DataFusion for pruning
> statistics [1], and in arrow-rs we represent scalars as arrays to avoid
> needing to define a parallel serialization standard [2].
>
> Kind Regards,
>
> Raphael
>
> [1]:
> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
> [2]: https://github.com/apache/arrow-rs/pull/4393
>
> On 22/05/2024 03:37, Sutou Kouhei wrote:
> > Hi,
> >
> > We're discussing how to provide statistics through the C
> > data interface at:
> > https://github.com/apache/arrow/issues/38837
> >
> > If you're interested in this feature, could you share your
> > comments?
> >
> >
> > Motivation:
> >
> > We can interchange Apache Arrow data by the C data interface
> > in the same process. For example, we can pass Apache Arrow
> > data read by Apache Arrow C++ (provider) to DuckDB
> > (consumer) through the C data interface.
> >
> > A provider may know Apache Arrow data statistics. For
> > example, a provider can know statistics when it reads Apache
> > Parquet data because Apache Parquet may provide statistics.
> >
> > But a consumer can't know statistics that are known by a
> > producer. Because there isn't a standard way to provide
> > statistics through the C data interface. If a consumer can
> > know statistics, it can process Apache Arrow data faster
> > based on statistics.
> >
> >
> > Proposal:
> >
> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >
> > How about providing statistics as a metadata in ArrowSchema?
> >
> > We reserve "ARROW" namespace for internal Apache Arrow use:
> >
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >
> >> The ARROW pattern is a reserved namespace for internal
> >> Arrow use in the custom_metadata fields. For example,
> >> ARROW:extension:name.
> > So we can use "ARROW:statistics" for the metadata key.
> >
> > We can represent statistics as a ArrowArray like ADBC does.
> >
> > Here is an example ArrowSchema that is for a record batch
> > that has "int32 column1" and "string column2":
> >
> > ArrowSchema {
> >.format = "+siu",
> >.metadata = {
> >  "ARROW:statistics" => ArrowArray*, /* table-level statistics such as 
> > row count */
> >},
> >.children 

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-17 Thread Dewey Dunnington
+1 (binding)

Tested with MacOS M1 using TEST_YUM=0 TEST_APT=0 USE_CONDA=1
./verify-release-candidate.sh 12 4

On Fri, May 17, 2024 at 9:46 AM Jean-Baptiste Onofré  wrote:
>
> +1 (non binding)
>
> Testing on MacOS M2.
>
> Regards
> JB
>
> On Wed, May 15, 2024 at 7:00 AM David Li  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC4) of Apache 
> > Arrow ADBC version 12. This is a release consisting of 56 resolved GitHub 
> > issues [1].
> >
> > Please note that the versioning scheme has changed.  This is the 12th 
> > release of ADBC, and so is called version "12".  The subcomponents, 
> > however, are versioned independently:
> >
> > - C/C++/GLib/Go/Python/Ruby: 1.0.0
> > - C#: 0.12.0
> > - Java: 0.12.0
> > - R: 0.12.0
> > - Rust: 0.12.0
> >
> > These are the versions you will see in the source and in actual packages.  
> > The next release will be "13", and the subcomponents will increment their 
> > versions independently (to either 1.1.0, 0.13.0, or 1.0.0).  At this point, 
> > there is no plan to release subcomponents independently from the project as 
> > a whole.
> >
> > Please note that there is a known issue when using the Flight SQL and 
> > Snowflake drivers at the same time on x86_64 macOS [12].
> >
> > This release candidate is based on commit: 
> > 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
> >
> > The source release rc4 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests, and 
> > vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 12
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 12 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> > DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> > TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]: 
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+12%22+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow-adbc/commit/50cb9de621c4d72f4aefd18237cb4b73b82f4a0e
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-12-rc4/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]: 
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]: 
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-12-rc4
> > [9]: 
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-12-rc4/CHANGELOG.md
> > [10]: 
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/9089931356
> > [12]: https://github.com/apache/arrow-adbc/issues/1841


Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread Dewey Dunnington
Congrats!

On Tue, May 7, 2024 at 11:55 AM Raúl Cumplido  wrote:
>
> Congratulations Dane!
>
> El mar, 7 may 2024, 16:32, Weston Pace  escribió:
>
> > Congrats Dane!
> >
> > On Tue, May 7, 2024, 7:30 AM Nic Crane  wrote:
> >
> > > Congrats Dane, well deserved!
> > >
> > > On Tue, 7 May 2024 at 15:16, Gang Wu  wrote:
> > > >
> > > > Congratulations Dane!
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Tue, May 7, 2024 at 10:12 PM Ian Cook  wrote:
> > > >
> > > > > Congratulations Dane!
> > > > >
> > > > > On Tue, May 7, 2024 at 10:10 AM Alenka Frim  > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Yay, congratulations Dane!!
> > > > > >
> > > > > > On Tue, May 7, 2024 at 4:00 PM Rok Mihevc 
> > > wrote:
> > > > > >
> > > > > > > Congrats Dane!
> > > > > > >
> > > > > > > Rok
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 3:57 PM wish maple <
> > maplewish...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Congrats!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xuwei Fu
> > > > > > > >
> > > > > > > > Joris Van den Bossche 
> > > 于2024年5月7日周二
> > > > > > > 21:53写道:
> > > > > > > >
> > > > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Dane
> > > Pitkin
> > > > > > has
> > > > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > > > Welcome,
> > > > > > > > > and thank you for your contributions!
> > > > > > > > >
> > > > > > > > > Joris
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Dewey Dunnington
I don't think there is any current barrier to using implementation
features of one extension type to help with another. In Python, for
example, one might be able to do:

class GeoJSONExtensionType(pa.ExtensionType):

def __init__(self):
self._json_ext = pa.JSONExtensionType()

def some_action(self):
return self._json_ext.some_action()

One could do something similar with the Array/Scalar classes. I am not
sure there is anything "automatic" that any current implementation
would be able to offer even if this information were machine
parseable. The only thing I can think of is that implementations like
Arrow C++ that aggressively drop extension information might be able
to drop the extension type by assigning a different one; however, I am
not sure that it would be useful enough to ever be implemented.

-dewey

On Tue, Apr 30, 2024 at 1:31 PM Ian Cook  wrote:
>
> But consider that a user might want to define a
> non-canonical HLLSKETCH extension type and make use of Arrow
> implementations' features for handling JSON canonical extension type
> columns in order to handle HLLSKETCH extension type columns. The spec
> currently does not provide any means to enable this. I wonder if we should
> consider incorporating something like this into the spec.
>
> For example, maybe the colon character could have the special meaning
> "represented as" in extension type names, so that implementations would
> recognize "hllsketch:arrow.json" as meaning: a column with extension type
> hllsketch, which is represented as in the JSON canonical extension type.
>
> Ian
>
> On Tue, Apr 30, 2024 at 11:51 AM Weston Pace  wrote:
>
> > I think "inheritance" and "composition" are more concerns for
> > implementations than they are for spec (I could be wrong here).
> >
> > So it seems that it would be sufficient to write the HLLSKETCH's canonical
> > definition as "this is an extension of the JSON logical type and supports
> > all the same storage types" and then allow implementations to use whatever
> > inheritance / composition scheme they want to behind the scenes.
> >
> > On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:
> >
> > > I think the biggest blocker to doing this is the way that we pass
> > extension
> > > types through IPC. Extension types are sent as their underlying storage
> > > type with metadata key-value pairs of specific keys
> > "ARROW:extension:name"
> > > and "ARROW:extension:metadata". Since you can't have multiple values for
> > > the same key in the metadata, this would prevent the ability to define an
> > > extension type in terms of another extension type as you wouldn't be able
> > > to include the metadata for the second-level extension part.
> > >
> > > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> > > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> > > storage type. So the storage type needs to be a valid core Arrow data
> > type
> > > for this reason.
> > >
> > > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
> > >
> > > > The vote on adding a JSON canonical extension type [1] got me
> > wondering:
> > > Is
> > > > it possible to define an extension type that is based on a canonical
> > > > extension type? If so, how?
> > > >
> > > > For example, say I wanted to define a (non-canonical) HLLSKETCH
> > extension
> > > > type that corresponds to the type that Redshift uses for HyperLogLog
> > > > sketches and is represented as JSON [2]. Is there a way to do this by
> > > > building on the JSON canonical extension type?
> > > >
> > > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > > > [2]
> > https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> > > >
> > > > Ian
> > > >
> > >
> >


Re: ADBC - OS-level driver manager

2024-04-24 Thread Dewey Dunnington
I definitely see the problem here: we don't currently provide a way
for something like a Microsoft Excel or PowerBI or Tableau to use ADBC
drivers without bundling all of the ones they want to support or
requiring/embedding Python or R. I also see how this is a particular
problem for Windows and MacOS: on Linux we already provide system
packages, although I don't know the extent to which they are used in
the wild. I can also see that while it's "easy" to bundle/redistribute
the drivers in arrow-adbc for multiple package managers, some
proprietary database vendor might not want to do that (and I think
that we *do* want proprietary database vendors to distribute ADBC
drivers).

I am not sure that the behaviour should be baked into the driver
manager itself, but it does seem reasonable to establish a convention,
even if that convention is "wherever the system looks for shared
libraries" (e.g., Matt's LD_LIBRARY_PATH comment). I can't envision us
providing Windows-specific installers for drivers in the arrow-adbc
directory, but if somebody else is going to do this it might make
sense to establish that convention ourselves.

-dewey

On Tue, Apr 23, 2024 at 10:50 PM Ian Cook  wrote:
>
> Ha—no, I was thinking of a special ADBC-specific environment variable,
> which would work irrespective of the OS.
>
> On Tue, Apr 23, 2024 at 21:38 Matt Topol  wrote:
>
> > An environment variable like LD_LIBRARY_PATH perhaps? =p
> >
> > On Tue, Apr 23, 2024, 8:40 PM Ian Cook  wrote:
> >
> > > What if the driver managers respected an environment variable containing
> > a
> > > delimited list of driver search paths? I think that would get us closer
> > to
> > > having true system-level configurability while mostly avoiding surprises
> > > and inflexibility.
> > >
> > > Ian
> > >
> > > On Tue, Apr 23, 2024 at 8:22 PM David Li  wrote:
> > >
> > > > I'd rather not hard code it directly into the manager, both because
> > this
> > > > may surprise applications that don't want it and would be inflexible
> > for
> > > > applications who are looking to use it, but providing an additional
> > list
> > > of
> > > > search paths that (say) Excel can configure + some platform-specific
> > > > guidance on a standard list seems reasonable.
> > > >
> > > > On Wed, Apr 24, 2024, at 02:45, Ian Cook wrote:
> > > > > I wonder if there is a relatively simple way to solve this problem.
> > The
> > > > > ADBC driver manager libraries already make it possible to dynamically
> > > > load
> > > > > drivers, and I believe these libraries already allow the user to
> > > specify
> > > > > which driver to use by passing either a bare filename or a full file
> > > > path.
> > > > >
> > > > > So perhaps we could simply establish an ordered list of standard
> > > > directory
> > > > > locations in which the ADBC driver manager will look for drivers when
> > > > they
> > > > > are specified by bare filename. We would have to specify this
> > > differently
> > > > > for each mainstream type of OS, but I think that is doable. This
> > could
> > > be
> > > > > codified in the ADBC docs and implemented in the ADBC driver
> > managers.
> > > > > Anyone looking to achieve system-wide ADBC driver "registration"
> > could
> > > > take
> > > > > advantage of this, whereas anyone who prefers application-specific
> > > > > implementation could safely ignore it.
> > > > >
> > > > > I suspect that we would want the driver manager to look first in
> > > > > application-specific directories (which might vary depending on which
> > > > ADBC
> > > > > driver language library one is using), then fall back on user-level
> > > > config
> > > > > directories, then finally fall back on system-level config
> > directories.
> > > > >
> > > > > I believe that Windows, macOS, and Linux distros all have standard
> > > > > user-level and system-level config directories that are often used
> > for
> > > > this
> > > > > type of thing.
> > > > >
> > > > > Does this seem reasonable? Are there any gotchas that would prevent
> > an
> > > > > approach like this from working?
> > > > >
> > > > > Ian
> > > > >
> > > > > On Mon, Apr 1, 2024 at 5:44 PM Curt Hagenlocher <
> > c...@hagenlocher.org>
> > > > > wrote:
> > > > >
> > > > >> The advantage to system-wide registration of drivers (however that's
> > > > >> accomplished) is of course that it allows driver authors to provide
> > a
> > > > >> single installer or set of instructions for the driver to be
> > installed
> > > > >> without regard for different usage scenarios. So if Tableau and
> > Excel
> > > > can
> > > > >> both use ODBC drivers, then I (as a hypothetical author of a niche
> > > > driver)
> > > > >> don't have to solve N installation problems for N possible use
> > cases.
> > > > And
> > > > >> my spouse (as a non-developer finance user) can just run one
> > installer
> > > > and
> > > > >> know that the data source will be available in multiple tools. Or at
> > > > least
> > > > >> that's the principle.
> > > > >>
> > > > >> For a 

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-22 Thread Dewey Dunnington
Thank you for the background!

I still wonder if these distinctions are the responsibility of the
ArrowSchema to communicate (although perhaps links to the specific
discussions would help highlight use-cases that I am not envisioning).
I think these distinctions are definitely important in the contexts
you mentioned; however, I am not sure that the FFI layer is going to
be helpful.

> In the libcudf situation, it came up with what happens if you pass a 
> non-struct
> column to the from_arrow_device method which returns a cudf::table? Should
> it error, or should it create a table with a single column?

I suppose that I would have expected two functions (one to create a
table and one to create a column). As a consumer I can't envision a
situation where I would want to import an ArrowDeviceArray but where I
would want some piece of run-time information to decide what the
return type of the function would be? (With apologies if I am missing
a piece of the discussion).

> If A and B have different lengths, this is invalid

I believe several array implementations (e.g., numpy, R) are able to
broadcast/recycle a length-1 array. Run-end-encoding is also an option
that would make that broadcast explicit without expanding the scalar.

> Depending on the function in question, it could be valid to pass a struct 
> column vs a record batch with different results.

If this is an important distinction for an FFI signature of a UDF,
there would probably be a struct definition for the UDF where there
would be an opportunity to make this distinction (and perhaps others
that are relevant) without loading this concept onto the existing
structs.

> If no flags are set, then the behavior shouldn't change
> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> should error unless calling ImportRecordBatch.

I am not sure I would have expected that (since a struct array has an
unambiguous interpretation as a record batch and as a user I've very
explicitly decided that I want one, since I'm using that function).

In the other direction, I am not sure a producer would be able to set
these flags without breaking backwards compatibility with earlier
producers that did not set them (since earlier threads have suggested
that it is good practice to error when an unsupported flag is
encountered).

On Sun, Apr 21, 2024 at 6:16 PM Matt Topol  wrote:
>
> First, I forgot a flag in my examples. There should also be an
> ARROW_FLAG_SCALAR too!
>
> The motivation for this distinction came up from discussions during adding
> support for ArrowDeviceArray to libcudf in order to better indicate the
> difference between a cudf::table and a cudf::column which are handled quite
> differently. This also relates to the fact that we currently need external
> context like the explicit ImportArray() and ImportRecordBatch() functions
> since we can't determine which a given ArrowArray is on its own. In the
> libcudf situation, it came up with what happens if you pass a non-struct
> column to the from_arrow_device method which returns a cudf::table? Should
> it error, or should it create a table with a single column?
>
> The other motivation for this distinction is with UDFs in an engine that
> uses the C data interface. When dealing with queries and engines, it
> becomes important to be able to distinguish between a record batch, a
> column and a scalar. For example, take the expression A + B:
>
> If A and B have different lengths, this is invalid. unless one of them
> is a Scalar. This is because Scalars are broadcastable, columns are not.
>
> Depending on the function in question, it could be valid to pass a struct
> column vs a record batch with different results. It also resolves some
> ambiguity for UDFs and processing. For instance, given a single ArrowArray
> of length 1, which is a struct: Is that a Struct Column? A Record Batch? or
> is it a scalar? There's no way to know what the producer's intention was or
> the context without these flags or having to side-channel the information
> somehow.
>
> > It seems like it may cause some ambiguous
> situations...should C++'s ImportArray() error, for example, if the
> schema has a ARROW_FLAG_RECORD_BATCH flag?
>
> I would argue yes. If no flags are set, then the behavior shouldn't change
> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> should error unless calling ImportRecordBatch. It allows the producer to
> provide context as to the source and intention of the structure of the data.
>
> --Matt
>
> On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
>  wrote:
>
> > Thanks for bringing this up!
> >
> > Could you share the motivation where this distinction is important in
> > the context of transfer across the C data interface? The "struct ==
> > recor

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-19 Thread Dewey Dunnington
Thanks for bringing this up!

Could you share the motivation where this distinction is important in
the context of transfer across the C data interface? The "struct ==
record batch" concept has always made sense to me because in R, a
data.frame can have a column that is also a data.frame and there is no
distinction between the two. It seems like it may cause some ambiguous
situations...should C++'s ImportArray() error, for example, if the
schema has a ARROW_FLAG_RECORD_BATCH flag?

Cheers,

-dewey

On Fri, Apr 19, 2024 at 6:34 PM Matt Topol  wrote:
>
> Hey everyone,
>
> With some of the other developments surrounding libraries adopting the
> Arrow C Data interfaces, there's been a consistent question about handling
> tables (record batch) vs columns vs scalars.
>
> Right now, a Record Batch is sent through the C interface as a struct
> column whose children are the individual columns of the batch and a Scalar
> would be sent through as just an array of length 1. Applications would have
> to create their own contextual way of indicating whether the Array being
> passed should be interpreted as just a single array/column or should be
> treated as a full table/record batch.
>
> Rather than introducing new members or otherwise complicating the structs,
> I wanted to gauge how people felt about introducing new flags for the
> ArrowSchema object.
>
> Right now, we only have 3 defined flags:
>
> ARROW_FLAG_DICTIONARY_ORDERED
> ARROW_FLAG_NULLABLE
> ARROW_FLAG_MAP_KEYS_SORTED
>
> The flags member of the struct is an int64, so we have another 61 bits to
> play with! If no one has any strong objections, I wanted to propose adding
> at least 2 new flags:
>
> ARROW_FLAG_RECORD_BATCH
> ARROW_FLAG_SINGLE_COLUMN
>
> If neither flag is set, then it is contextual as to whether it should be
> expected that the corresponding data is a table or a single column. If
> ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a
> struct array and should be interpreted as a record batch by any consumers
> (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the
> corresponding ArrowArray should be interpreted and utilized as a single
> array/column regardless of its type.
>
> This provides a standardized way for producers of Arrow data to indicate in
> the schema to consumers how the data they produced should be used (as a
> table or column) rather than forcing everyone to come up with their own
> contextualized way of handling things (extra arguments, differently named
> functions for RecordBatch / Array, etc.).
>
> If there's no objections to this, I'll take a pass at implementing these
> flags in C++ and Go to put up a PR and make a Vote thread. I just wanted to
> see what others on the mailing list thought before I go ahead and put
> effort into this.
>
> Thanks everyone! Take care!
>
> --Matt


Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread Dewey Dunnington
Congratulations!

On Thu, Apr 11, 2024 at 2:23 PM Alenka Frim
 wrote:
>
> Congratulations Sarah!
>
> On Thu, Apr 11, 2024 at 6:21 PM Ruoxi Sun  wrote:
>
> > Congrats!
> >
> > *Regards,*
> > *Rossi SUN*
> >
> >
> > Weston Pace  于2024年4月12日周五 00:13写道:
> >
> > > Congratulations!
> > >
> > > On Thu, Apr 11, 2024 at 9:12 AM wish maple 
> > wrote:
> > >
> > > > Congrats!
> > > >
> > > > Best,
> > > > Xuwei Fu
> > > >
> > > > Kevin Gurney  于2024年4月11日周四 23:22写道:
> > > >
> > > > > Congratulations, Sarah!! Well deserved!
> > > > > 
> > > > > From: Jacob Wujciak 
> > > > > Sent: Thursday, April 11, 2024 11:14 AM
> > > > > To: dev@arrow.apache.org 
> > > > > Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore
> > > > >
> > > > > Congratulations and welcome!
> > > > >
> > > > > Am Do., 11. Apr. 2024 um 17:11 Uhr schrieb Raúl Cumplido <
> > > > > rau...@apache.org
> > > > > >:
> > > > >
> > > > > > Congratulations Sarah!
> > > > > >
> > > > > > El jue, 11 abr 2024 a las 13:13, Sutou Kouhei ( > >)
> > > > > > escribió:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Sarah
> > > > > > > Gilmore has accepted an invitation to become a committer on
> > > > > > > Apache Arrow. Welcome, and thank you for your contributions!
> > > > > > >
> > > > > > > Thanks,
> > > > > > > --
> > > > > > > kou
> > > > > >
> > > > >
> > > >
> > >
> >


Re: Unsupported/Other Type

2024-04-11 Thread Dewey Dunnington
Depending where your Arrow-encoded data is used, either extension
types or generic field metadata are options. We have this problem in
the ADBC Postgres driver, where we can convert *most* Postgres types
to an Arrow type but there are some others where we can't or don't
know or don't implement a conversion. Currently for these we return
opaque binary (the Postgres COPY representation of the value) but put
field metadata so that a consumer can implement a workaround for an
unsupported type. It would be arguably better to have implemented this
as an extension type; however, field metadata felt like less of a
commitment when I first worked on this.

Cheers,

-dewey

On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
 wrote:
>
> I was using UUID as an example. It looks like extension types covers my 
> original request.
> 
> From: Felipe Oliveira Carvalho 
> Sent: Thursday, April 11, 2024 7:15 AM
> To: dev@arrow.apache.org 
> Subject: Re: Unsupported/Other Type
>
> The OP used UUID as an example. Would that be enough or the request is for
> a flexible mechanism that allows the creation of one-off nominal types for
> very specific use-cases?
>
> —
> Felipe
>
> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou  wrote:
>
> >
> > Yes, JSON and UUID are obvious candidates for new canonical extension
> > types. XML also comes to mind, but I'm not sure there's much of a use
> > case for it.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 10/04/2024 à 22:55, Wes McKinney a écrit :
> > > In the past we have discussed adding a canonical type for UUID and JSON.
> > I
> > > still think this is a good idea and could improve ergonomics in
> > downstream
> > > language bindings (e.g. by exposing JSON querying function or
> > automatically
> > > boxing UUIDs in built-in UUID types, like the Python uuid library). Has
> > > anyone done any work on this to anyone's knowledge?
> > >
> > > On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
> > > wrote:
> > >
> > >> Hi Norman,
> > >> Arrow has a concept of extension types [1] along with the possibility of
> > >> proposing new canonical extension types [2].  This seems to cover the
> > >> use-cases you mention but I might be misunderstanding?
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1]
> > >>
> > >>
> > https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
> > >> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
> > >>
> > >> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
> > >>  wrote:
> > >>
> > >>> Problem Description
> > >>>
> > >>> Currently Arrow schemas can only contain columns of types supported by
> > >>> Arrow. In some cases an Arrow schema maps to an external schema. This
> > can
> > >>> result in the Arrow schema not being able to support all the columns
> > from
> > >>> the external schema.
> > >>>
> > >>> Consider an external system that contains a column of type UUID. To
> > model
> > >>> the schema in Arrow, the user has two choices:
> > >>>
> > >>>1.  Do not include the UUID column in the Arrow schema
> > >>>
> > >>>2.  Map the column to an existing Arrow type. This will not include
> > the
> > >>> original type information. A UUID can be mapped to a FixedSizeBinary,
> > but
> > >>> consumers of the Arrow schema will be unable to distinguish a
> > >>> FixedSizeBinary field from a UUID field.
> > >>>
> > >>> Possible Solution
> > >>>
> > >>>*   Add a new type code that represents unsupported types
> > >>>
> > >>>*   Values for the new type are represented as variable length
> > binary
> > >>>
> > >>> Some drivers can expose data even when they don’t understand the data
> > >>> type. For example, the PostgreSQL driver will return the raw bytes for
> > >>> fields of an unknown type. Using an explicit type lets clients know
> > that
> > >>> they should convert values if they were able to determine the actual
> > data
> > >>> type.
> > >>>
> > >>> Questions
> > >>>
> > >>>*   What is the impact on existing clients when they encounter
> > fields
> > >> of
> > >>> the unsupported type?
> > >>>
> > >>>*   Is it safe to assume that all unsupported values can safely be
> > >>> converted to a variable length binary?
> > >>>
> > >>>*   How can we preserve information about the original type?
> > >>>
> > >>>
> > >>
> > >
> >
> Warning: The sender of this message could not be validated and may not be the 
> actual sender.


Re: Arrow community meeting April 10 at 16:00 UTC

2024-04-10 Thread Dewey Dunnington
Hi Ian,

I'll be attending and I'm happy to run the meeting.

Cheers!

-dewey

On Tue, Apr 9, 2024 at 9:41 PM Ian Cook  wrote:
>
> Our next biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
> EDT.
>
> I will not be able to attend tomorrow. Could someone please volunteer to
> lead the meeting and take notes in the Google Doc? The Zoom meeting will
> work as usual; it does not require a host to start it.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Meeting notes will be captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the document to
> add the topics that you would like to discuss.
>
> Thanks,
> Ian


Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-03 Thread Dewey Dunnington
Thank you Jacob for bringing this up! I am also in favor of decoupling
versions (provided that the release managers are also in favor of
this, since their time is required to implement this and because the
ongoing consequences of separate releases disproportionately affects
them).

Part of the vote fatigue is, I think, partly due to the complexity of
releasing all of the components at the same time. Running the script
for ADBC, nanoarrow, Rust, and Julia are all fairly straightforward
because those subprojects have a more limited scope. In contrast, I am
rarely successful running the Arrow verification script without
running into an error I don't understand and have become hesitant to
vote (or try) as a cumulative result of many releases worth of this
happening (and because R has never been a part of verification, which
is the component that I unofficially verify anyway). Voting on a batch
of version numbers seems like a good first step.

I am also not concerned about messaging of different versions of
different components. The fact that integration tests pass at the
moment of the release may be meaningful for those familiar with the
repo, but I don't think that many people are aware of which components
are tested in that way. As Weston noted, even for components that use
Arrow C++, the implementation of Arrow C++ features may lag behind or
be completely unrelated (Python being the exception).

On Fri, Mar 29, 2024 at 9:47 AM Weston Pace  wrote:
>
> Thank you for bringing this up.  I'm in favor of this.  I think there are
> several motivations but the main ones are:
>
>  1. Decoupling the versions will allow components to have no release, or
> only a minor release, when there are no breaking changes
>  2. We do have some vote fatigue I think and we don't want to make that
> more difficult.
>  3. Anything we can do to ease the burden of release managers is good
>
> If I understand what you are describing then I think it satisfies points 1
> & 2.  I am not familiar enough with the release management process to speak
> to #3.
>
> > Voting in one thread on
> > all components/a subset of components per voter and the surrounding
> > technicalities is something I would like to hear some opinions on.
>
> I am in favor of decoupling the version numbers.  I do think batched
> quarterly releases are still a good thing to avoid vote fatigue.  Perhaps
> we can have a single vote on a batch of version numbers (e.g. please vote
> on the batched release containing CPP version X, Go version Y, JS version
> Z).
>
> > A more meta question is about the messaging that different versioning
> > schemes carry, as it might no longer be obvious on first glance which
> > versions are compatible or have the newest features.
>
> I am not concerned about this.  One of the advantages of Arrow is that we
> have a stable C ABI (C Data Interface) and a stable IPC mechanism (IPC
> serialization) and this means that version compatibility is rarely a
> difficulty or major concern.  Plus, regarding individual features, our
> solution already requires a compatibility table (
> https://arrow.apache.org/docs/status.html).  Changing the versioning
> strategy will not make this any worse.
>
> On Thu, Mar 28, 2024 at 1:42 PM Jacob Wujciak  wrote:
>
> > Hello Everyone!
> >
> > I would like to resurface the discussion of separate
> > versioning/releases/voting for monorepo components. We have previously
> > touched on this topic mostly in the community meetings and spread across
> > multiple, only tangential related threads. I think a focused discussion can
> > be a bit more results oriented, especially now that we almost regularly
> > deviate from the quarterly release cadence with minor releases. My hope is
> > that discussing this and adapting our process can lower the amount of work
> > required and ease the pressure on our release managers (Thank you Raúl and
> > Kou!).
> >
> > I think the base of the topic is the separate versioning for components as
> > otherwise separate releases only have limited value. From a technical
> > perspective standalone implementations like Go or JS are the easiest to
> > handle in that regard, they can just follow their ecosystem standards,
> > which has been requested by users already (major releases in Go require
> > manual editing across a code base as dependencies are usually pinned to a
> > major version).
> >
> > For Arrow C++ bindings like Arrow R and PyArrow having distinct versions
> > would require additional work to both enable the use of different versions
> > and ensure version compatibility is monitored and potentially updated if
> > needed.
> >
> > For Arrow R we have already implemented these changes for different reasons
> > and have backwards compatibility with  libarrow >= 13.0.0. From a user
> > standpoint of PyArrow this is likely irrelevant as most users get binary
> > wheels from pypi, if a user regularly builds PyArrow from source they are
> > also capable of managing potentially different 

Re: [VOTE] Release Apache Arrow ADBC 0.11.0 - RC0

2024-03-28 Thread Dewey Dunnington
+1!

I ran:
export DOCKER_DEFAULT_PLATFORM=linux/amd64
USE_CONDA=1 dev/release/verify-release-candidate.sh 0.11.0 0

Matt - could you open an issue? The R package is not supposed to run
those tests unless some very specific environment variables are
defined in ~/.Renviron.

On Thu, Mar 28, 2024 at 3:27 PM Matt Topol  wrote:
>
> +1 (binding)
>
> Verified on PopOS! 22.04 amd64 using Conda with:
>
> USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.11.0 0
>
> Though there's one issue that i don't think should block the release:
>
> > Running the tests in ‘tests/testthat.R’ failed.
> > Last 13 lines of output:
> >  ── Error ('test-adbcsnowflake-package.R:28:3'): default options can open
> a database and execute a query ──
> >  
> >  Error in `force(code)`: IO: 390100 (08004): Incorrect username or
> password was specified.
> >  Backtrace:
> >  ▆
> >   1. ├─adbcdrivermanager::adbc_connection_init(db) at
> test-adbcsnowflake-package.R:28:3
> >   2. └─adbcsnowflake:::adbc_connection_init.adbcsnowflake_database(db)
> >   3.   └─adbcdrivermanager::adbc_connection_init_default(...)
> >   4. ├─adbcdrivermanager::with_adbc(...)
> >   5. │ └─base::force(code)
> >   6. └─adbcdrivermanager:::stop_for_error(status, error)
> >
> >  [ FAIL 1 | WARN 0 | SKIP 0 | PASS 2 ]
> >  Error: Test failures
> >  Execution halted
> > * DONE
>
> It looks like the R package needs to skip snowflake tests when the
> username/password aren't provided.
>
> On Thu, Mar 28, 2024 at 12:40 PM Dane Pitkin 
> wrote:
>
> > +1 (non-binding)
> >
> > Verified on MacOS 14.4.1 aarch64 with Conda using:
> >
> > DOCKER_DEFAULT_PLATFORM=linux/amd64 USE_CONDA=1
> > ./dev/release/verify-release-candidate.sh 0.11.0 0
> >
> > On Thu, Mar 28, 2024 at 11:07 AM David Li  wrote:
> >
> > > Hello,
> > >
> > > I would like to propose the following release candidate (RC0) of Apache
> > > Arrow ADBC version 0.11.0. This is a release consisting of 36 resolved
> > > GitHub issues [1].
> > >
> > > This release candidate is based on commit:
> > > 3cb5825bf551ae93d0e9ed2f64be226b569b27a7 [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7][8].
> > > The changelog is located at [9].
> > >
> > > Please download, verify checksums and signatures, run the unit tests, and
> > > vote on the release. See [10] for how to validate a release candidate.
> > >
> > > See also a verification result on GitHub Actions [11].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow ADBC 0.11.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow ADBC 0.11.0 because...
> > >
> > > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> > > DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export
> > > TEST_APT=0 TEST_YUM=0`.)
> > >
> > > [1]:
> > >
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.11.0%22+is%3Aclosed
> > > [2]:
> > >
> > https://github.com/apache/arrow-adbc/commit/3cb5825bf551ae93d0e9ed2f64be226b569b27a7
> > > [3]:
> > >
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.11.0-rc0/
> > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > [7]:
> > >
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > > [8]:
> > >
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.11.0-rc0
> > > [9]:
> > >
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.11.0-rc0/CHANGELOG.md
> > > [10]:
> > >
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > > [11]: https://github.com/apache/arrow-adbc/actions/runs/8468352632
> > >
> >


Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-19 Thread Dewey Dunnington
Congratulations Bryce! And thank you!

On Mon, Mar 18, 2024 at 2:16 PM Wes McKinney  wrote:
>
> Congrats!
>
> On Mon, Mar 18, 2024 at 12:15 PM James Duong
>  wrote:
>
> > Congratulations Bryce!
> >
> > From: Dane Pitkin 
> > Date: Monday, March 18, 2024 at 7:28 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Bryce Mecum
> > Congratulations, Bryce!!
> >
> > On Mon, Mar 18, 2024 at 9:18 AM David Li  wrote:
> >
> > > Congrats Bryce!
> > >
> > > On Mon, Mar 18, 2024, at 08:52, Ian Cook wrote:
> > > > Congratulations Bryce!
> > > >
> > > > Ian
> > > >
> > > > On Sun, Mar 17, 2024 at 22:24 Nic Crane  wrote:
> > > >
> > > >> On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
> > > >> accepted an invitation to become a committer on Apache Arrow. Welcome,
> > > and
> > > >> thank you for your contributions!
> > > >>
> > > >> Nic
> > > >>
> > >
> >


Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-02 Thread Dewey Dunnington
+1 (binding)

On Sat, Mar 2, 2024 at 8:08 AM vin jake  wrote:
>
> +1 (binding)
>
> On Fri, Mar 1, 2024 at 7:33 PM Andrew Lamb  wrote:
>
> > Hello,
> >
> > As we have discussed[1][2] I would like to vote on the proposal to
> > create a new Apache Top Level Project for DataFusion. The text of the
> > proposed resolution and background document is copy/pasted below
> >
> > If the community is in favor of this, we plan to submit the resolution
> > to the ASF board for approval with the next Arrow report (for the
> > April 2024 board meeting).
> >
> > The vote will be open for at least 7 days.
> >
> > [ ] +1 Accept this Proposal
> > [ ] +0
> > [ ] -1 Do not accept this proposal because...
> >
> > Andrew
> >
> > [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> > [2] https://github.com/apache/arrow-datafusion/discussions/6475
> >
> > -- Proposed Resolution -
> >
> > Resolution to Create the Apache DataFusion Project from the Apache
> > Arrow DataFusion Sub Project
> >
> > =
> >
> > X. Establish the Apache DataFusion Project
> >
> > WHEREAS, the Board of Directors deems it to be in the best
> > interests of the Foundation and consistent with the
> > Foundation's purpose to establish a Project Management
> > Committee charged with the creation and maintenance of
> > open-source software related to an extensible query engine
> > for distribution at no charge to the public.
> >
> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> > Committee (PMC), to be known as the "Apache DataFusion Project",
> > be and hereby is established pursuant to Bylaws of the
> > Foundation; and be it further
> >
> > RESOLVED, that the Apache DataFusion Project be and hereby is
> > responsible for the creation and maintenance of software
> > related to an extensible query engine; and be it further
> >
> > RESOLVED, that the office of "Vice President, Apache DataFusion" be
> > and hereby is created, the person holding such office to
> > serve at the direction of the Board of Directors as the chair
> > of the Apache DataFusion Project, and to have primary responsibility
> > for management of the projects within the scope of
> > responsibility of the Apache DataFusion Project; and be it further
> >
> > RESOLVED, that the persons listed immediately below be and
> > hereby are appointed to serve as the initial members of the
> > Apache DataFusion Project:
> >
> > * Andy Grove (agr...@apache.org)
> > * Andrew Lamb (al...@apache.org)
> > * Daniël Heres (dhe...@apache.org)
> > * Jie Wen (jake...@apache.org)
> > * Kun Liu (liu...@apache.org)
> > * Liang-Chi Hsieh (vii...@apache.org)
> > * Qingping Hou: (ho...@apache.org)
> > * Wes McKinney(w...@apache.org)
> > * Will Jones (wjones...@apache.org)
> >
> > RESOLVED, that the Apache DataFusion Project be and hereby
> > is tasked with the migration and rationalization of the Apache
> > Arrow DataFusion sub-project; and be it further
> >
> > RESOLVED, that all responsibilities pertaining to the Apache
> > Arrow DataFusion sub-project encumbered upon the
> > Apache Arrow Project are hereafter discharged.
> >
> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
> > be appointed to the office of Vice President, Apache DataFusion, to
> > serve in accordance with and subject to the direction of the
> > Board of Directors and the Bylaws of the Foundation until
> > death, resignation, retirement, removal or disqualification,
> > or until a successor is appointed.
> > =
> >
> >
> > ---
> >
> >
> > Summary:
> >
> > We propose creating a new top level project, Apache DataFusion, from
> > an existing sub project of Apache Arrow to facilitate additional
> > community and project growth.
> >
> > Abstract
> >
> > Apache Arrow DataFusion[1]  is a very fast, extensible query engine
> > for building high-quality data-centric systems in Rust, using the
> > Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
> > APIs, excellent performance, built-in support for CSV, Parquet, JSON,
> > and Avro, extensive customization, and a great community.
> >
> > [1] https://arrow.apache.org/datafusion/
> >
> >
> > Proposal
> >
> > We propose creating a new top level ASF project, Apache DataFusion,
> > governed initially by a subset of the Apache Arrow project’s PMC and
> > committers. The project’s code is in five existing git repositories,
> > currently governed by Apache Arrow which would transfer to the new top
> > level project.
> >
> > Background
> >
> > When DataFusion was initially donated to the Arrow project, it did not
> > have a strong enough community to stand on its own. It has since grown
> > significantly, and benefited immensely from being part of Arrow and
> > nurturing of the Apache Way, and now has a community strong enough to
> > stand on its own and that would benefit from focused governance
> > attention.
> >
> > The 

Re: [RESULT][VOTE] Release Apache Arrow ADBC 0.10.0 - RC1

2024-02-23 Thread Dewey Dunnington
[x] All the driver packages currently on CRAN have been updated!
(adbcdrivermanager, adbcpostgresql, adbcsqlite)

On Thu, Feb 22, 2024 at 10:44 AM David Li  wrote:
>
> [x] Close the GitHub milestone/project
> [x] Add the new release to the Apache Reporter System
> [x] Upload source release artifacts to Subversion
> [x] Create the final GitHub release
> [x] Update website
> [x] Upload wheels/sdist to PyPI
> [x] Publish Maven packages
> [x] Update tags for Go modules
> [x] Deploy APT/Yum repositories
> [x] Upload Ruby packages to RubyGems
> [IN PROGRESS] Update conda-forge packages [1]
> [x] Announce the new release
> [x] Remove old artifacts
> [IN PROGRESS] Bump versions [2]
> [IN PROGRESS] Publish release blog post [3]
>
> [1]: https://github.com/conda-forge/arrow-adbc-split-feedstock/pull/21
> [2]: https://github.com/apache/arrow-adbc/pull/1560
> [3]: https://github.com/apache/arrow-site/pull/477
>
> On Thu, Feb 22, 2024, at 09:12, David Li wrote:
> > The vote passes with 4 binding, 2 non-binding +1 votes.
> >
> > I'll take care of the release tasks.
> >
> > On Wed, Feb 21, 2024, at 19:02, Dane Pitkin wrote:
> >> +1 (non-binding)
> >>
> >> Verified on Mac M1 using conda.
> >>
> >> On Tue, Feb 20, 2024 at 11:27 PM Dewey Dunnington
> >>  wrote:
> >>
> >>> +1!
> >>>
> >>> I ran USE_CONDA=1 dev/release/verify-release-candidate.sh 0.10.0 1 on
> >>> MacOS Sonoma (M1).
> >>>
> >>> On Tue, Feb 20, 2024 at 9:43 AM Jean-Baptiste Onofré 
> >>> wrote:
> >>> >
> >>> > +1 (non binding)
> >>> >
> >>> > I quickly tested on MacOS arm64.
> >>> >
> >>> > Regards
> >>> > JB
> >>> >
> >>> > On Sun, Feb 18, 2024 at 9:47 PM David Li  wrote:
> >>> > >
> >>> > > Hello,
> >>> > >
> >>> > > I would like to propose the following release candidate (RC1) of
> >>> Apache Arrow ADBC version 0.10.0. This is a release consisting of 30
> >>> resolved GitHub issues [1].
> >>> > >
> >>> > > This release candidate is based on commit:
> >>> 9a8e44cc62f23a68ffc0d3d4c7362214b221bea0 [2]
> >>> > >
> >>> > > The source release rc1 is hosted at [3].
> >>> > > The binary artifacts are hosted at [4][5][6][7][8].
> >>> > > The changelog is located at [9].
> >>> > >
> >>> > > Please download, verify checksums and signatures, run the unit tests,
> >>> and vote on the release. See [10] for how to validate a release candidate.
> >>> > >
> >>> > > See also a verification result on GitHub Actions [11].
> >>> > >
> >>> > > The vote will be open for at least 72 hours.
> >>> > >
> >>> > > [ ] +1 Release this as Apache Arrow ADBC 0.10.0
> >>> > > [ ] +0
> >>> > > [ ] -1 Do not release this as Apache Arrow ADBC 0.10.0 because...
> >>> > >
> >>> > > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> >>> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export
> >>> TEST_APT=0 TEST_YUM=0`.)
> >>> > >
> >>> > > [1]:
> >>> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.10.0%22+is%3Aclosed
> >>> > > [2]:
> >>> https://github.com/apache/arrow-adbc/commit/9a8e44cc62f23a68ffc0d3d4c7362214b221bea0
> >>> > > [3]:
> >>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.10.0-rc1/
> >>> > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >>> > > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >>> > > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >>> > > [7]:
> >>> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >>> > > [8]:
> >>> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.10.0-rc1
> >>> > > [9]:
> >>> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.10.0-rc1/CHANGELOG.md
> >>> > > [10]:
> >>> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >>> > > [11]: https://github.com/apache/arrow-adbc/actions/runs/7951302316
> >>>


Re: [VOTE] Release Apache Arrow ADBC 0.10.0 - RC1

2024-02-20 Thread Dewey Dunnington
+1!

I ran USE_CONDA=1 dev/release/verify-release-candidate.sh 0.10.0 1 on
MacOS Sonoma (M1).

On Tue, Feb 20, 2024 at 9:43 AM Jean-Baptiste Onofré  wrote:
>
> +1 (non binding)
>
> I quickly tested on MacOS arm64.
>
> Regards
> JB
>
> On Sun, Feb 18, 2024 at 9:47 PM David Li  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC1) of Apache 
> > Arrow ADBC version 0.10.0. This is a release consisting of 30 resolved 
> > GitHub issues [1].
> >
> > This release candidate is based on commit: 
> > 9a8e44cc62f23a68ffc0d3d4c7362214b221bea0 [2]
> >
> > The source release rc1 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests, and 
> > vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.10.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.10.0 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> > DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> > TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]: 
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.10.0%22+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow-adbc/commit/9a8e44cc62f23a68ffc0d3d4c7362214b221bea0
> > [3]: 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.10.0-rc1/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]: 
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]: 
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.10.0-rc1
> > [9]: 
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.10.0-rc1/CHANGELOG.md
> > [10]: 
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/7951302316


Re: [ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-16 Thread Dewey Dunnington
Thanks for the suggestion! I opened up a PR to update that language [1].

Cheers!

-dewey

[1] https://github.com/apache/arrow-nanoarrow/pull/389

On Mon, Feb 12, 2024 at 2:57 PM Antoine Pitrou  wrote:
>
>
> Hi Dewey,
>
> Le 12/02/2024 à 15:01, Dewey Dunnington a écrit :
> > Apache Arrow nanoarrow is a small C library for building and
> > interpreting Arrow C Data interface structures with bindings for users
> > of the R programming language.
>
> Do you want to reconsider this sentence? It seems nanoarrow is starting
> to be more versatile now.
>
> Regards
>
> Antoine.


Re: [RESULT] Release Apache Arrow nanoarrow 0.4.0 - RC0

2024-02-12 Thread Dewey Dunnington
Apologies for the delay...these are all done now!

[x] Closed GitHub milestone
[x] Added release to the Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[x] Submit R package to CRAN
[x] Submit Python package to PyPI
[x] Update Python package on conda-forge
[x] Release blog post
[x] Sent announcement to annou...@apache.org
[x] Removed old artifacts from SVN
[x] Bumped versions on main

On Thu, Feb 1, 2024 at 3:21 PM Dewey Dunnington  wrote:
>
> With 4 binding +1 and 1 non-binding +1, the vote carries!
>
> If somebody is up for reviewing the release blog post [1] it would be
> much appreciated!
>
> I'll take care of the following release tasks:
>
> [x] Closed GitHub milestone
> [x] Added release to the Apache Reporter System
> [x] Uploaded artifacts to Subversion
> [x] Created GitHub release
> [ ] Submit R package to CRAN
> [ ] Submit Python package to PyPI
> [ ] Update Python package on conda-forge
> [ ] Release blog post [1] (will do after review)
> [ ] Sent announcement to annou...@apache.org (will do after blog post)
> [x] Removed old artifacts from SVN
> [x] Bumped versions on main
>
> [1] https://github.com/apache/arrow-site/pull/469
>
> On Wed, Jan 31, 2024 at 1:50 AM Sutou Kouhei  wrote:
> >
> > +1
> >
> > I ran the following command line on Debian GNU/Linux sid:
> >
> >   dev/release/verify-release-candidate.sh 0.4.0 0
> >
> > with:
> >
> >   * Apache Arrow C++ main
> >   * gcc (Debian 13.2.0-9) 13.2.0
> >   * R version 4.3.2 (2023-10-31) -- "Eye Holes"
> >   * Python 3.11.7
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[VOTE] Release Apache Arrow nanoarrow 0.4.0 - RC0" on Mon, 29 Jan 2024 
> > 11:10:37 -0400,
> >   Dewey Dunnington  wrote:
> >
> > > Hello,
> > >
> > > I would like to propose the following release candidate (rc0) of
> > > Apache Arrow nanoarrow [0] version 0.4.0. This release consists of 46
> > > resolved GitHub issues from 5 contributors [1].
> > >
> > > This release candidate is based on commit:
> > > 3f83f4c48959f7a51053074672b7a330888385b1 [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The changelog is located at [4].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [5] for how to validate a release
> > > candidate. Note also a successful verification CI run at [6].
> > >
> > > This release contains experimental Python bindings to the nanoarrow C
> > > library. This vote is on the source tarball only; however, wheels have
> > > also been prepared and tested for convenience and are available from
> > > [7].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow nanoarrow 0.4.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.4.0 because...
> > >
> > > [0] https://github.com/apache/arrow-nanoarrow
> > > [1] https://github.com/apache/arrow-nanoarrow/milestone/4?closed=1
> > > [2] 
> > > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.4.0-rc0
> > > [3] 
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.4.0-rc0/
> > > [4] 
> > > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0-rc0/CHANGELOG.md
> > > [5] 
> > > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/7697719271
> > > [7] https://github.com/apache/arrow-nanoarrow/actions/runs/7697710625


[ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-12 Thread Dewey Dunnington
The Apache Arrow community is pleased to announce the 0.4.0 release of
Apache Arrow nanoarrow. This initial release covers 44 resolved issues
from 5 contributors[1].

The release is available now from [2], release notes are available at
[3], and a blot post documenting new contributions is available at
[4].

What is Apache Arrow?
-
Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides
low-overhead streaming and batch messaging, zero-copy interprocess
communication (IPC), and vectorized in-memory analytics libraries.
Languages currently supported include C, C++, C#, Go, Java,
JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

What is Apache Arrow nanoarrow?
--
Apache Arrow nanoarrow is a small C library for building and
interpreting Arrow C Data interface structures with bindings for users
of the R programming language. The vision of nanoarrow is that it
should be trivial for a library or application to implement an
Arrow-based interface. The library provides helpers to create types,
schemas, and metadata, an API for building arrays element-wise,
and an API to extract elements element-wise from an array. For a more
detailed description of the features nanoarrow provides and motivation
for its development, see [3].

Please report any feedback to the mailing lists ([4], [5]).

Regards,
The Apache Arrow Community

[1]: 
https://github.com/apache/arrow-nanoarrow/issues?q=is%3Aissue+milestone%3A%22nanoarrow+0.4.0%22+is%3Aclosed
[2]: https://www.apache.org/dyn/closer.cgi/arrow/apache-arrow-nanoarrow-0.4.0
[3]: 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0/CHANGELOG.md
[4]: https://arrow.apache.org/blog/2024/01/29/nanoarrow-0.4.0-release/
[3]: https://github.com/apache/arrow-nanoarrow
[4]: https://lists.apache.org/list.html?u...@arrow.apache.org
[5]: https://lists.apache.org/list.html?dev@arrow.apache.org


[RESULT] Release Apache Arrow nanoarrow 0.4.0 - RC0

2024-02-01 Thread Dewey Dunnington
With 4 binding +1 and 1 non-binding +1, the vote carries!

If somebody is up for reviewing the release blog post [1] it would be
much appreciated!

I'll take care of the following release tasks:

[x] Closed GitHub milestone
[x] Added release to the Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[ ] Submit R package to CRAN
[ ] Submit Python package to PyPI
[ ] Update Python package on conda-forge
[ ] Release blog post [1] (will do after review)
[ ] Sent announcement to annou...@apache.org (will do after blog post)
[x] Removed old artifacts from SVN
[x] Bumped versions on main

[1] https://github.com/apache/arrow-site/pull/469

On Wed, Jan 31, 2024 at 1:50 AM Sutou Kouhei  wrote:
>
> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   dev/release/verify-release-candidate.sh 0.4.0 0
>
> with:
>
>   * Apache Arrow C++ main
>   * gcc (Debian 13.2.0-9) 13.2.0
>   * R version 4.3.2 (2023-10-31) -- "Eye Holes"
>   * Python 3.11.7
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow nanoarrow 0.4.0 - RC0" on Mon, 29 Jan 2024 
> 11:10:37 -0400,
>   Dewey Dunnington  wrote:
>
> > Hello,
> >
> > I would like to propose the following release candidate (rc0) of
> > Apache Arrow nanoarrow [0] version 0.4.0. This release consists of 46
> > resolved GitHub issues from 5 contributors [1].
> >
> > This release candidate is based on commit:
> > 3f83f4c48959f7a51053074672b7a330888385b1 [2]
> >
> > The source release rc0 is hosted at [3].
> > The changelog is located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [5] for how to validate a release
> > candidate. Note also a successful verification CI run at [6].
> >
> > This release contains experimental Python bindings to the nanoarrow C
> > library. This vote is on the source tarball only; however, wheels have
> > also been prepared and tested for convenience and are available from
> > [7].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.4.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.4.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/4?closed=1
> > [2] 
> > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.4.0-rc0
> > [3] 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.4.0-rc0/
> > [4] 
> > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0-rc0/CHANGELOG.md
> > [5] 
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/7697719271
> > [7] https://github.com/apache/arrow-nanoarrow/actions/runs/7697710625


Re: [VOTE][Julia] Release Apache Arrow Julia 2.7.1 RC1

2024-01-31 Thread Dewey Dunnington
+1

Tested on MacOS Sonoma (aarch64). I ran

export PATH="/Applications/Julia-1.9.app/Contents/Resources/julia/bin:${PATH}"
&&
dev/release/verify_rc.sh 2.7.1 1

On Wed, Jan 31, 2024 at 2:01 PM Jacob Quinn  wrote:
>
> +1, tested on macos.
>
> -Jacob
>
> On Wed, Jan 31, 2024 at 10:11 AM Ben Baumgold  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.7.1.
> >
> > This release candidate is based on commit:
> > ac199b0e377502ea0f1fa5ced7fda897a01b82a9 [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.7.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.7.1 because...
> >
> > [1]:
> >
> > https://github.com/apache/arrow-julia/tree/ac199b0e377502ea0f1fa5ced7fda897a01b82a9
> > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.7.1-rc1/
> > [3]:
> >
> > https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
> >


Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-30 Thread Dewey Dunnington
I also find it a useful tool to follow other projects...there may be a
good replacement for it at some point but in the meantime I would love
to see releases + blog posts tweeted (or retweeted by) the official
account.

-dewey

On Tue, Jan 30, 2024 at 6:01 AM Raúl Cumplido  wrote:
>
> El lun, 29 ene 2024 a las 20:24, Felipe Oliveira Carvalho
> () escribió:
> >
> > > I have found Twitter an extremely effective way for an open-source
> > project to communicate with the “exo-community” — people who are interested
> > in the project but not so invested that they join the email list. An open
> > source project needs to perform pretty much all of the functions of a
> > for-profit company, and Twitter fulfills the marketing function. Clearly
> > Twitter is not what it used to be, but I don’t know what, if anything, has
> > replaced it.
> >
> > +1.
>
> I also agree with this view. I find it a very good communication tool
> for lots of users that are interested in the project but not that
> close.
>
> I am happy to tweet about the releases of Arrow, blog posts, etcetera
> on the official account.
>
> Raúl
>
> >
> > Unfortunately, all the alternatives to Twitter haven't reached the tipping
> > point yet. Saw some people trying to push tech content on Threads but the
> > audience is simply not there yet. @ApacheArrow has 12K followers on Twitter
> > making it a great tool for spreading news and getting people excited about
> > the project.
> >
> > --
> > Felipe
> >
> > On Mon, Jan 29, 2024 at 4:07 PM Julian Hyde  wrote:
> >
> > > The easiest thing is to share the Twitter credentials with any PMC member
> > > who is interested in sending tweets (which is usually a very small 
> > > number).
> > >
> > > To answer Antoine’s point. I have found Twitter an extremely effective way
> > > for an open-source project to communicate with the “exo-community” — 
> > > people
> > > who are interested in the project but not so invested that they join the
> > > email list. An open source project needs to perform pretty much all of the
> > > functions of a for-profit company, and Twitter fulfills the marketing
> > > function. Clearly Twitter is not what it used to be, but I don’t know 
> > > what,
> > > if anything, has replaced it.
> > >
> > > Julian
> > >
> > >
> > > > On Jan 29, 2024, at 10:50 AM, Wes McKinney  wrote:
> > > >
> > > > Is there a different tool other than TweetDeck available that can
> > > > synchronize posts that go out on different social channels (LinkedIn,
> > > > Twitter, Mastodon, etc.)? I've heard of things like Hootsuite but that's
> > > > pretty expensive and definitely overkill for an open source project, but
> > > > perhaps there is a more modest tool that would help with mirroring
> > > content
> > > > onto different platforms.
> > > >
> > > > On Sat, Jan 27, 2024 at 5:39 PM Antoine Pitrou 
> > > wrote:
> > > >
> > > >>
> > > >> My 2 cents : I don't understand what an open source project gains by
> > > >> publishing on a microblogging platform.
> > > >>
> > > >> As for Twitter specifically, its recent governance changes would be 
> > > >> good
> > > >> reason for terminating the @ApacheArrow account, IMHO.
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >> Le 27/01/2024 à 23:06, Bryce Mecum a écrit :
> > > >>> I noticed that the @ApacheArrow Twitter account [1] hasn't posted
> > > >>> since June 2023 which is around the time of the Arrow 12 release. When
> > > >>> I asked on Zulip [2] about who runs or has access to post as that
> > > >>> account, Kou indicated the account was managed using TweetDeck [3] and
> > > >>> that this may no longer be an option due to subscription changes.
> > > >>>
> > > >>> I'm writing to get a sense of who currently has access and how the
> > > >>> community would like to move forward with using the account. I'm also
> > > >>> volunteering to help manage it.
> > > >>>
> > > >>> My questions are:
> > > >>>
> > > >>> - Who has access to @ApacheArrow [1]?
> > > >>> - Is the community still interested in engaging on Twitter?
> > > >>> - Is the community interested in other platforms, potentially just
> > > >>> engaging with them through cross-posting?
> > > >>>
> > > >>> Thanks,
> > > >>> Bryce
> > > >>>
> > > >>> [1] https://twitter.com/ApacheArrow
> > > >>> [2]
> > > >>
> > > https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/ApacheArrow.20Twitter.20account/near/418346643
> > > >>> [3] https://en.wikipedia.org/wiki/Tweetdeck
> > > >>
> > >
> > >


Re: [VOTE] Release Apache Arrow nanoarrow 0.4.0 - RC0

2024-01-30 Thread Dewey Dunnington
+1

Verified on MacOS Ventura.

I also put together a draft release blog post with highlights included
in this release! https://github.com/apache/arrow-site/pull/469/files

On Tue, Jan 30, 2024 at 6:37 AM Raúl Cumplido  wrote:
>
> +1 (binding)
>
> Verified on Ubuntu 22.04.
>
> El mar, 30 ene 2024 a las 0:30, David Li () escribió:
> >
> > +1 (binding)
> >
> > Tested on Debian Linux 'bookworm'
> >
> > On Mon, Jan 29, 2024, at 10:45, Dane Pitkin wrote:
> > > +1 (non-binding)
> > >
> > > Verified on MacOS 14 using conda.
> > >
> > > On Mon, Jan 29, 2024 at 10:11 AM Dewey Dunnington
> > >  wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to propose the following release candidate (rc0) of
> > >> Apache Arrow nanoarrow [0] version 0.4.0. This release consists of 46
> > >> resolved GitHub issues from 5 contributors [1].
> > >>
> > >> This release candidate is based on commit:
> > >> 3f83f4c48959f7a51053074672b7a330888385b1 [2]
> > >>
> > >> The source release rc0 is hosted at [3].
> > >> The changelog is located at [4].
> > >>
> > >> Please download, verify checksums and signatures, run the unit tests,
> > >> and vote on the release. See [5] for how to validate a release
> > >> candidate. Note also a successful verification CI run at [6].
> > >>
> > >> This release contains experimental Python bindings to the nanoarrow C
> > >> library. This vote is on the source tarball only; however, wheels have
> > >> also been prepared and tested for convenience and are available from
> > >> [7].
> > >>
> > >> The vote will be open for at least 72 hours.
> > >>
> > >> [ ] +1 Release this as Apache Arrow nanoarrow 0.4.0
> > >> [ ] +0
> > >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.4.0 because...
> > >>
> > >> [0] https://github.com/apache/arrow-nanoarrow
> > >> [1] https://github.com/apache/arrow-nanoarrow/milestone/4?closed=1
> > >> [2]
> > >> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.4.0-rc0
> > >> [3]
> > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.4.0-rc0/
> > >> [4]
> > >> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0-rc0/CHANGELOG.md
> > >> [5]
> > >> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > >> [6] https://github.com/apache/arrow-nanoarrow/actions/runs/7697719271
> > >> [7] https://github.com/apache/arrow-nanoarrow/actions/runs/7697710625
> > >>


[VOTE] Release Apache Arrow nanoarrow 0.4.0 - RC0

2024-01-29 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (rc0) of
Apache Arrow nanoarrow [0] version 0.4.0. This release consists of 46
resolved GitHub issues from 5 contributors [1].

This release candidate is based on commit:
3f83f4c48959f7a51053074672b7a330888385b1 [2]

The source release rc0 is hosted at [3].
The changelog is located at [4].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [5] for how to validate a release
candidate. Note also a successful verification CI run at [6].

This release contains experimental Python bindings to the nanoarrow C
library. This vote is on the source tarball only; however, wheels have
also been prepared and tested for convenience and are available from
[7].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.4.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.4.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/4?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.4.0-rc0
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.4.0-rc0/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0-rc0/CHANGELOG.md
[5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
[6] https://github.com/apache/arrow-nanoarrow/actions/runs/7697719271
[7] https://github.com/apache/arrow-nanoarrow/actions/runs/7697710625


Re: [VOTE] Release Apache Arrow ADBC 0.9.0 - RC0

2024-01-04 Thread Dewey Dunnington
+1

I ran: export DOCKER_DEFAULT_PLATFORM=linux/amd64 && USE_CONDA=1
dev/release/verify-release-candidate.sh 0.9.0 0

...on MacOS M1 Ventura

On Thu, Jan 4, 2024 at 9:47 AM Jean-Baptiste Onofré  wrote:
>
> +1 (non binding)
>
> I checked:
> - LICENSE is OK but maybe worth to keep only LICENSE.txt at root level
> and move license.tpl in dev folder (I can propose a PR about that)
> - NOTICE.txt looks a big short, we should mention Included Software /
> Used Software. I will do a more detailed pass. Definitely not a
> release blocker.
> - Java build is OK
> - ASF headers are present in source files, but some files don't have
> it (listed in exclude but weird to me, I will do a pass as well).. NB:
> dev/run-rat.sh needs some fixes and updates (I did several
> improvements for RAT 0.16).
> - Quickly tested on my blog samples (new blog post in preparation
> following 
> https://nanthrax.blogspot.com/2023/12/exposing-apache-karaf-configurations.html)
>
> Thanks !
> Regards
> JB
>
> On Wed, Jan 3, 2024 at 9:54 PM David Li  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC0) of Apache 
> > Arrow ADBC version 0.9.0. This is a release consisting of 34 resolved 
> > GitHub issues [1].
> >
> > This release candidate is based on commit: 
> > 37a27717fb94fb84211f1b17486cc8f0be7df59c [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests, and 
> > vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.9.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.9.0 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> > DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> > TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]: 
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.9.0%22+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow-adbc/commit/37a27717fb94fb84211f1b17486cc8f0be7df59c
> > [3]: 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.9.0-rc0/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]: 
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]: 
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.9.0-rc0
> > [9]: 
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.9.0-rc0/CHANGELOG.md
> > [10]: 
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/7401997231


Re: [VOTE] Release Apache Arrow 14.0.2 - RC3

2023-12-15 Thread Dewey Dunnington
+1

I ran TEST_DEFAULT=0 TEST_CPP=1
dev/release/verify-release-candidate.sh 14.0.2 3 on MacOS M1. I do get
one failing test (gandiva-internals-test) but this has failed for me
for the last three versions.

Note that the R bindings will have to patch the static libraries we
host for convenience installation of the R package with
https://github.com/apache/arrow/pull/39186 because of the warning
flags that are set on the CRAN builder. I consider R a "packaging"
step, and because we (and other downstream maintainers) often apply
tweaks between versions as post-release steps, I think this is fine.
However, I did want to highlight that here in case anybody feels
strongly that adding two static_casts<>s into the R package warrants
another release candidate.

On Fri, Dec 15, 2023 at 1:36 AM wish maple  wrote:
>
> +1 (binding)
>
> Verified C++ and Python in my M1 MacOS
>
> Best,
> Xuwei Fu
>
> Jean-Baptiste Onofré  于2023年12月15日周五 00:19写道:
>
> > +1 (non binding)
> >
> > I checked:
> > - hash and signature are OK
> > - build is OK as soon as submodule are added (see the discussion on
> > another thread)
> > - LICENSE and NOTICE look good (maybe worth updating copyright date)
> > - I checked RAT, and some files in the exclude should actually contain
> > ASF header. I will propose a PR to improve this, not a blocker for
> > release though.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Wed, Dec 13, 2023 at 10:32 PM Raúl Cumplido  wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose the following release candidate (RC3) of Apache
> > > Arrow version 14.0.2. This is a release consisting of 30
> > > resolved GitHub issues[1].
> > >
> > > This release candidate is based on commit:
> > > 740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97 [2]
> > >
> > > The source release rc3 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > > The changelog is located at [12].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [13] for how to validate a release
> > candidate.
> > >
> > > See also a verification result on GitHub pull request [14].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow 14.0.2
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow 14.0.2 because...
> > >
> > > [1]:
> > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A14.0.2+is%3Aclosed
> > > [2]:
> > https://github.com/apache/arrow/tree/740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97
> > > [3]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-14.0.2-rc3
> > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/14.0.2-rc3
> > > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/14.0.2-rc3
> > > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/14.0.2-rc3
> > > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > [12]:
> > https://github.com/apache/arrow/blob/740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97/CHANGELOG.md
> > > [13]:
> > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > > [14]: https://github.com/apache/arrow/pull/39193
> >


Re: [DISCUSS] Semantics of extension types

2023-12-15 Thread Dewey Dunnington
I also like these equivalence traits...in addition to being easy for
extension type authors to specify when registering an extension type
in Arrow C++, implementations that allow registration like pyarrow and
arrow/R would be able to specify them easily, whereas implementing
methods, compute functions, or overloads to handle it (e.g., like is
done in vctrs with vec_proxy_equal, which often just returns its
input) would have performance implications (since the methods might
have to be defined in R or Python).

It may also be worth adding a compute function for "force storage" (a
no-op for anything except an extension array), which is maybe safer
than a cast (which implies, I think, some logical equivalence between
the input and the result). That would let a user work around a
situation where the extension type author didn't handle a case that
the user expected to work.

Cheers!

-dewey

On Fri, Dec 15, 2023 at 3:13 AM Jin Shang  wrote:
>
> I'm in favor of Antoine's proposal of storage equivalence traits[1]. For
> the sake of clarity I'll paste it here:
>
> I would suggest we perhaps need a more general semantic description of
> > storage type equivalence.
> > Draft:
> > class ExtensionType {
> > public:
> > // Storage equivalence for equality testing and hashing
> > static constexpr uint32_t kEquality = 1;
> > // Storage equivalence for ordered comparisons
> > static constexpr uint32_t kOrdering = 2;
> > // Storage equivalence for selections (filter, take, etc.)
> > static constexpr uint32_t kSelection = 4;
> > // Storage equivalence for arithmetic
> > static constexpr uint32_t kArithmetic = 8;
> > // Storage equivalence for explicit casts
> > static constexpr uint32_t kCasting = 16;
> > // Storage equivalence for all operations
> > static constexpr uint32_t kAny = std::numeric_limits::max();
> > // By default, an extension type can be implicitly handled as its storage
> > type
> > // for selections, equality testing and hashing.
> > virtual uint32_t storage_equivalence() const { return kEquality |
> > kSelection; }
> >
>
> I think this is well balanced between convenience and safety. The default
> option ensures the "normal" operations like take, group-by, unique... just
> work, and extension type authors can easily opt into additional functions.
>
> It also requires minimum engineering efforts. Each function only needs to
> specify what traits it requires, rather than the actual types.
>
> BTW I've checked every C++ compute function and I think the only traits
> missing here are one for string operations, and one for generation such as
> `random`.
>
> [1]  https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
>
> Best,
> Jin
>
> On Thu, Dec 14, 2023 at 10:04 PM Weston Pace  wrote:
>
> > I agree engines can use their own strategy.  Requiring explicit casts is
> > probably ok as long as it is well documented but I think I lean slightly
> > towards implicitly falling back to the storage type.  I do think think
> > people still shy away from extension types.  Adding the extension type to
> > an implicit cast registry is another hurdle to their use, albeit a small
> > one.
> >
> > Substrait has a similar consideration for extension types.  They can be
> > declared "inherits" (meaning the storage type can be used implicitly in
> > compute functions) or "separate" (meaning the storage type cannot be used
> > implicitly in compute functions).  This would map nicely to an Arrow
> > metadata field.
> >
> > Unfortunately, I think the truth is more nuanced than a simple
> > separate/inherits flag.  Take UUID for example (everyone's favorite fixed
> > size binary extension type).  We would definitely want to implicitly reuse
> > the hash, equality, and sorting functions.
> >
> > However, for other functions it gets trickier.  Imagine you have a
> > `replace_slice` function.  Should it return a new UUID (change some bytes
> > in a UUID and you have a new UUID) or not (once you start changing bytes in
> > a UUID you no longer have a UUID).  Or what if there was a `slice`
> > function.  This function should either be prohibited (you can't slice a
> > UUID) or should return a fixed size binary string (you can still slice it
> > but you no longer have a UUID).
> >
> > Given the complication I think users will always need to carefully consider
> > each use of an extension function no matter how smart a system is.  I'm not
> > convinced any metadata exists that could define the right approach in a
> > consistent number of cases.  This means our choice is whether we force
> > users to explicitly declare each such decision or we just trust that they
> > are doing the proper consideration when they design their plan.  I'm not
> > sure there is a right answer.  One can point to the vast diversity of ways
> > that programming languages have approached implicit vs explicit integer
> > casts.
> >
> > My last concern is that we rely on compute functions in operators other
> > than project/filter.  For example, to use a 

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Dewey Dunnington
Thank you for opening the discussion here and opening it up!

I agree that attaching semantics as metadata and/or documenting them
in a central repository is an unreasonable burden to put on extension
type authors and Arrow implementations in general.

I also agree that operations other than filter/take/concatenate should
error by default: just because a storage type happens to be an
integer, it doesn't necessarily mean that arithmetic (for example) is
meaningful. (For example, an extension type implementing a bitpacked
uint64 such as an S2 cell or H3 index would result in an invalid value
for "plus one" or "times three").

For query engines and/or implementations with extensive compute
capability like Arrow C++, it is useful to be able to leverage those
for extension types: for the S2/H3 index example, it would be very
cool to allow a group_by + aggregate to "just work" (since ==/hash
*is* valid for this example), although I don't imagine it's a
development priority for anybody right now. I agree with Antoine that
implementations should be able to choose how/if extension type authors
can leverage other capabilities of the engine.

If this is pursued further, it might be worth checking out a
particularly successful extensible vector system implemented in R via
the vctrs package ( https://vctrs.r-lib.org/ ). "vector" class authors
can implement one or more S3 methods (i.e., traits):

- vec_proxy(x) (get me the storage array)
- vec_ptype2(type1, type2) (given two types, get me a type that I can
cast both to or error)
- vec_cast(x, type) (perform a lossless cast to type or error)
- vec_proxy_equal(x) (get me storage array where == does the right thing)
- vec_proxy_order(x) (get me a storage array that sorts in the correct order)
- vec_math(op, x) (perform unary math ops like sum)
- vec_arith(op, lhs, rhs) (perform binary math ops like +, -, etc.)

Cheers!

-dewey

On Wed, Dec 13, 2023 at 12:39 PM Benjamin Kietzman  wrote:
>
> The main problem I see with adding properties to ExtensionType is I'm not
> sure where that information would reside. Allowing type authors to declare
> in which ways the type is equivalent (or not) to its storage is appealing,
> but it seems to need an official extension field like
> ARROW:extension:semantics. Otherwise I think each extension type's
> semantics would need to be maintained within every implementation as well
> as in a central reference (probably in Columnar.rst), which seems
> unreasonable to expect of extension type authors. I'm also skeptical that
> useful information could be packed into an ARROW:extension:semantics field;
> even if the type can declare that ordering-as-with-storage is invalid I
> don't think it'd be feasible to specify the correct ordering.
>
> If we cannot attach this information to extension types, the question
> becomes which defaults are most reasonable for engines and how can the
> engine most usefully be configured outside those defaults. My own
> preference would be to refuse operations other than selection or
> casting-to-storage, with a runtime extensible registry of allowed implicit
> casts. This will allow users of the engine to configure their extension
> types as they need, and the error message raised when an implicit
> cast-to-storage is not allowed could include the suggestion to register the
> implicit cast. For applications built against a specific engine, this
> approach would allow recovering much of the advantage of attaching
> properties to an ExtensionType by including registration of implicit casts
> in the ExtensionType's initialization.
>
> On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
> wrote:
>
> > Hello all,
> >
> > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
> > any extension type to its storage type in acero. This raises questions
> > about the validity of applying operations to an extension array's storage.
> > For example, some extension type authors may intend different ordering for
> > arrays of their new type than would be applied to the array's storage or
> > may not intend for the type to participate in arithmetic even though its
> > storage could.
> >
> > Suggestions/observations from discussion on that PR included:
> > - Extension types could provide general semantic description of storage
> > type equivalence [2], so that a flag on the extension type enables ordering
> > by storage but disables arithmetic on it
> > - Compute functions or kernels could be augmented with a filter declaring
> > which extension types are supported [3].
> > - Currently arrow-rs considers extension types metadata only [4], so all
> > kernels treat extension arrays equivalently to their storage.
> > - Currently arrow c++ only supports explicitly casting from an extension
> > type to its storage (and from storage to ext), so any operation can be
> > performed on an extension array's storage but it requires opting in.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] 

Re: [VOTE][Julia] Release Apache Arrow Julia 2.7.0 RC1

2023-12-08 Thread Dewey Dunnington
+1

I ran

export PATH="/Applications/Julia-1.9.app/Contents/Resources/julia/bin:$PATH"
dev/release/verify_rc.sh 2.7.0 1

...on MacOS M1 Ventura

On Tue, Dec 5, 2023 at 4:38 PM Sutou Kouhei  wrote:
>
> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.7.0.
>
> This release candidate is based on commit:
> 37122911c24f44318e6d4a0840408adb3364cf2a [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.7.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.7.0 because...
>
> [1]: 
> https://github.com/apache/arrow-julia/tree/37122911c24f44318e6d4a0840408adb3364cf2a
> [2]: 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.7.0-rc1/
> [3]: 
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>
>
> Thanks,
> --
> kou


Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread Dewey Dunnington
Congrats!

On Thu, Dec 7, 2023 at 4:28 PM Andrew Lamb  wrote:
>
> Congratulations!
>
> On Thu, Dec 7, 2023 at 3:09 PM Kevin Gurney 
> wrote:
>
> > Congratulations, Felipe!
> > 
> > From: Daniël Heres 
> > Sent: Thursday, December 7, 2023 2:59 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho
> >
> > Congrats!
> >
> > Op do 7 dec 2023 om 20:52 schreef Ben Harkins  > >:
> >
> > > Congrats, Felipe!
> > >
> > > On Thu, Dec 7, 2023 at 2:00 PM Vibhatha Abeykoon 
> > > wrote:
> > >
> > > > Congratulations Felipe.
> > > >
> > > > Vibhatha Abeykoon
> > > >
> > > >
> > > > On Fri, Dec 8, 2023 at 12:25 AM David Li  wrote:
> > > >
> > > > > Congrats Felipe!
> > > > >
> > > > > On Thu, Dec 7, 2023, at 13:02, Raúl Cumplido wrote:
> > > > > > Congratulations Felipe!
> > > > > >
> > > > > > El jue, 7 dic 2023, 18:02, Dane Pitkin
> >  > > >
> > > > > > escribió:
> > > > > >
> > > > > >> Congrats, Felipe!
> > > > > >>
> > > > > >> On Thu, Dec 7, 2023 at 11:41 AM hsseo0501 
> > > > wrote:
> > > > > >>
> > > > > >> > Congrats. Felipe :)내 Galaxy에서 보냄
> > > > > >> >  원본 이메일 발신: Ian Cook  날짜:
> > > > > 23/12/8
> > > > > >> > 오전 1:24  (GMT+09:00) 받은 사람: dev@arrow.apache.org 제목: Re:
> > > [ANNOUNCE]
> > > > > New
> > > > > >> > Arrow committer: Felipe Oliveira Carvalho Congratulations
> > > > Felipe!!!On
> > > > > >> Thu,
> > > > > >> > Dec 7, 2023 at 10:43 AM Benjamin Kietzman 
> > > > > wrote:>>
> > > > > >> > On behalf of the Arrow PMC, I'm happy to announce that Felipe
> > > > > Oliveira>
> > > > > >> > Carvalho> has accepted an invitation to become a committer on
> > > > Apache>
> > > > > >> > Arrow. Welcome, and thank you for your contributions!>> Ben
> > > Kietzman
> > > > > >>
> > > > >
> > > >
> > >
> >
> >
> > --
> > Daniël Heres
> >


Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-21 Thread Dewey Dunnington
I also think a set of best practices for Arrow over HTTP would be a
valuable resource for the community...even if it never becomes a
specification of its own, it will be beneficial for API developers and
consumers of those APIs to have a place to look to understand how
Arrow can help improve throughput/latency/maybe other things. Possibly
something like httpbin.org but for requests/responses that use Arrow
would be helpful as well. Thank you Ian for leading this effort!

It has mostly been covered already, but in the (ubiquitous) situation
where a response contains some schema/table and some non-schema/table
information there is some tension between throughput (best served by a
JSON response plus one or more IPC stream responses) and latency (best
served by a single HTTP response? JSON? IPC with metadata/header?). In
addition to Antoine's list, I would add:

- How to serve the same table in multiple requests (e.g., to saturate
a network connection, or because separate worker nodes are generating
results anyway).
- How to inline a small schema/table into a single request with other
metadata (I have seen this done as base64-encoded IPC in JSON, but
perhaps there is a better way)

If anybody is interested in experimenting, I repurposed a previous
experiment I had as a flask app that can stream IPC to a client:
https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
.

> - recommendations about compression

Just a note that there is also Content-Encoding: gzip (for consumers
like Arrow JS that don't currently support buffer compression but that
can leverage the facilities of the browser/http library)

Cheers!

-dewey


On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > But how is the performance?
>
> It's faster than the original JSON based API.
>
> I implemented Apache Arrow support for a C# client. So I
> measured only with Apache Arrow C# but the Apache Arrow
> based API is faster than JSON based API.
>
> > Have you measured the throughput of this approach to see
> > if it is comparable to using Flight SQL?
>
> Sorry. I didn't measure the throughput. In the case, elapsed
> time of one request/response pair is important than
> throughput. And it was faster than JSON based API and enough
> performance.
>
> I couldn't compare to a Flight SQL based approach because
> Groonga doesn't support Flight SQL yet.
>
> > Is this approach able to saturate a fast network
> > connection?
>
> I think that we can't measure this with the Groonga case
> because the Groonga case doesn't send data without
> stopping. Here is one of request patterns:
>
> 1. Groonga has log data partitioned by day
> 2. Groonga does full text search against one partition (2023-11-01)
> 3. Groonga sends the result to client as Apache Arrow
>streaming format record batches
> 4. Groonga does full text search against the next partition (2023-11-02)
> 5. Groonga sends the result to client as Apache Arrow
>streaming format record batches
> 6. ...
>
> In the case, the result data aren't always sending. (search
> -> send -> search -> send -> ...) So it doesn't saturate a
> fast network connection.
>
> (3. and 4. can be parallel but it's not implemented yet.)
>
> If we optimize this approach, this approach may be able to
> saturate a fast network connection.
>
> > And what about the case in which the server wants to begin sending batches
> > to the client before the total number of result batches / records is known?
>
> Ah, sorry. I forgot to explain the case. Groonga uses the
> above approach for it.
>
> > - server should not return the result data in the body of a response to a
> > query request; instead server should return a response body that gives
> > URI(s) at which clients can GET the result data
>
> If we want to do this, the standard "Location" HTTP headers
> may be suitable.
>
> > - transmit result data in chunks (Transfer-Encoding: chunked), with
> > recommendations about chunk size
>
> Ah, sorry. I forgot to explain this case too. Groonga uses
> "Transfer-Encoding: chunked". But recommended chunk size may
> be case-by-case... If a server can produce enough data as
> fast as possible, larger chunk size may be
> faster. Otherwise, larger chunk size may be slower.
>
> > - support range requests (Accept-Range: bytes) to allow clients to request
> > result ranges (or not?)
>
> In the Groonga case, it's not supported. Because Groonga
> drops the result after one request/response pair. Groonga
> can't return only the specified range result after the
> response is returned.
>
> > - recommendations about compression
>
> In the case that network is the bottleneck, LZ4 or Zstandard
> compression will improve total performance.
>
> > - recommendations about TCP receive window size
> > - recommendation to open multiple TCP connections on very fast networks
> > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
>
> HTTP/3 may be better for these cases.
>
>
> Thanks,
> --
> kou
>
> In 
>   

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Dewey Dunnington
Congrats, Raùl!

On Mon, Nov 13, 2023 at 3:54 PM Dane Pitkin
 wrote:
>
> Congrats, Raul!
>
> On Mon, Nov 13, 2023 at 2:45 PM Kevin Gurney 
> wrote:
>
> > Congratulations, Raúl!
> >
> > 
> > From: Nic Crane 
> > Sent: Monday, November 13, 2023 2:31 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido
> >
> > Congrats Raul!
> >
> > On Tue, 14 Nov 2023, 03:28 Andrew Lamb,  wrote:
> >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Raúl Cumplido  to become a PMC member and we are pleased to announce
> > > that  Raúl Cumplido has accepted.
> > >
> > > Please join me in congratulating them.
> > >
> > > Andrew
> > >
> >


Re: [DISCUSS][MATLAB] Proposal for incremental point releases of the MATLAB interface

2023-11-07 Thread Dewey Dunnington
For argument's sake, I might suggest that the process you described in
your initial note would probably work best in another repo: you would
be able to iterate faster and release/version at your own pace. The
flexibility you get from moving to a separate repo comes at the cost
of extra responsibility: you have to set up your own CI, manage your
own issues, and set up your own release verification scripts + release
votes on the mailing list. Because you bind Arrow C++, you would have
to take sufficient steps to ensure that the Arrow C++ developers are
made aware of changes that break the Matlab bindings and vice versa
(i.e., test against dev Arrow C++ in a CI job).

Setting up that infrastructure for apache/arrow-nanoarrow took ~a week
of development time, and it now takes ~half a day to release a new
version (it took more for the first few versions, and the matlab
version has considerably higher complexity). Probably the biggest
barrier to releasing from another repo is that you have to ensure a
critical mass of PMC members can/will run your release verification
script and vote.

I happen to feel that it's the PMC's/wider community's responsibility
to help language binding contributors adopt a workflow that suits
their needs. If active Matlab contributors agree that they want to
release version 0.1 from another repo, (I feel that) we're here to
help you do that. If the active contributors want to stay in
apache/arrow, there is less flexibility about what you release and
when; however, the release process is well-defined.

On Tue, Nov 7, 2023 at 8:43 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > As a point of reference, we noticed that PyArrow is on
> > version 14.0.0, but it feels "misleading" to say that the
> > MATLAB interface is at version 14.0.0 when we haven't yet
> > implemented or stabilized all core Arrow APIs.
>
> I can understand this but I suggest that we use the same
> version as other packages in apache/arrow. Because:
>
> * Using isolated version increases release complexity.
> * Using isolated version may introduce another
>   "misleading"/"confusion": For example, "the MATLAB
>   interface 1.0.0 uses Apache Arrow C++ 20.0.0" may be
>   misleading/confused:
>   * The MATLAB interface 1.0.0 doesn't use Apache Arrow C++
> 1.0.0.
>   * It may be difficult to find the corresponding
> Apache Arrow C++ version from the MATLAB interface
> version.
>
> Can we just mention "This is not stable yet!!!" in the
> documentation instead of using isolated version?
>
> We may want to use the status page for it:
> https://arrow.apache.org/docs/status.html
>
> > 1. Manually build the MATLAB interface on Windows, macOS, and Linux
>
> It's better that we use CI for this like other binary
> packages such as .deb/.rpm/.wheel/.jar/...
>
> If we release the MATLAB interface separately, which Apache
> Arrow C++ version is used? If we release the MATALB
> interface right now, is Apache Arrow C++ 14.0.0 (the latest
> release) used or is Apache Arrow C++ main (not released yet)
> used? The MATLAB interface on main will depend on Apache
> Arrow C++ main, we may not be able to use the latest release
> for the MATLAB interface on main.
>
> > 2. Combine all of the cross platform build artifacts into
> >a single MLTBX file [1] for distribution
>
> Does the MLTBX file include Apache Arrow C++ binaries too
> like .wheel/.jar?
>
> > 3. Host the MLTBX somewhere that is easliy accessible for download
>
> MATLAB doesn't provide the official package repository such
> as PyPI for Python and https://rubygems.org/ for Ruby, right?
>
> > 1. Is there a recommended location where we can host the MLTBX file? e.g. 
> > GitHub Releases [2], JFrog [3], etc.?
>
> If the official package repository for MATLAB doesn't exist,
> JFrog is better because the MLTBX file will be large (Apache
> Arrow C++ binaries are large).
>
> > 2. Is there a recommended location for hosting release notes?
>
> How about creating https://arrow.apache.org/docs/matlab/ ?
> We can use Sphinx like the Python docs
> https://arrow.apache.org/docs/python/ or another
> documentation tools like the R docs
> https://arrow.apache.org/docs/r/ .
> If we use Sphinx, we can create
> https://github.com/apache/arrow/tree/main/docs/source/matlab/
> .
>
> > 3. Is there a recommended cadence for incremental point releases?
>
> I suggest avoiding separated release as above.
>
> > 4. Are there any notable ASF procedures [4] [5] (e.g. voting on a new 
> > release proposal) that we should be aware of as we consider creating an 
> > initial release?
>
> We don't need additional task for an initial release.
>
> > 5. How should the Arrow project release (i.e. 14.0.0)
> >relate to the MATLAB interface version (i.e. 0.1)? As a
> >point of reference, we noticed that PyArrow is on
> >version 14.0.0, but it feels "misleading" to say that
> >the MATLAB interface is at version 14.0.0 when we
> >haven't yet implemented or stabilized all core Arrow
> >APIs. Is there 

Re: [VOTE] Release Apache Arrow ADBC 0.8.0 - RC0

2023-11-07 Thread Dewey Dunnington
+1!

I ran: TEST_APT=0 TEST_YUM=0 USE_CONDA=1
dev/release/verify-release-candidate.sh 0.8.0 0

On Fri, Nov 3, 2023 at 12:18 PM David Li  wrote:
>
> Hello,
>
> I would like to propose the following release candidate (RC0) of Apache Arrow 
> ADBC version 0.8.0. This is a release consisting of 42 resolved GitHub issues 
> [1].
>
> This release candidate is based on commit: 
> 95f13231f49494bcf78df45de1f65aa25620981b [2]
>
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8].
> The changelog is located at [9].
>
> Please download, verify checksums and signatures, run the unit tests, and 
> vote on the release. See [10] for how to validate a release candidate.
>
> See also a verification result on GitHub Actions [11].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow ADBC 0.8.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow ADBC 0.8.0 because...
>
> Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> TEST_APT=0 TEST_YUM=0`.)
>
> Note: it is not currently possible to verify with Conda and Python 3.12 (some 
> test dependencies do not yet have a Python 3.12 build available). The 
> verification script defaults to Python 3.11. Binary artifacts are available 
> for 3.12.
>
> [1]: 
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.8.0%22+is%3Aclosed
> [2]: 
> https://github.com/apache/arrow-adbc/commit/95f13231f49494bcf78df45de1f65aa25620981b
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.8.0-rc0/
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [7]: 
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> [8]: 
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.8.0-rc0
> [9]: 
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.8.0-rc0/CHANGELOG.md
> [10]: 
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> [11]: https://github.com/apache/arrow-adbc/actions/runs/6746653191


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-29 Thread Dewey Dunnington
In the absence of a general solution to the C data interface omitting
buffer sizes, I think the original proposal is the best way
forward...this is the first type to be added whose buffer sizes cannot
be calculated without looping over every element of the array; the
buffer sizes are needed to efficiently serialize the imported array to
IPC if imported by a consumer that cares about buffer sizes.

Using a schema's flags to indicate something about a specific paired
array (particularly one that, if misinterpreted, would lead to a
crash) is a precedent that is probably not worth introducing for just
one type. Currently a schema is completely independent of any
particular ArrowArray, and I think that is a feature that is worth
preserving. My gripes about not having buffer sizes on the CPU to more
efficiently copy between devices is a concept almost certainly better
suited to the ArrowDeviceArray struct.

On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman  wrote:
>
> > This begs the question of what happens if a consumer receives an unknown
> > flag value.
>
> It seems to me that ignoring unknown flags is the primary case to consider
> at
> this point, since consumers may ignore unknown flags. Since that is the
> case,
> it seems adding any flag which would break such a consumer would be
> tantamount to an ABI breakage. I don't think this can be averted unless all
> consumers are required to error out on unknown flag values.
>
> In the specific case of Utf8View it seems certain that consumers would add
> support for the buffer sizes flag simultaneously with adding support for the
> new type (since Utf8View is difficult to import otherwise), so any consumer
> which would error out on the new flag would already be erroring out on an
> unsupported data type.
>
> > I might be the only person who has implemented
> > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > schema's flag value
>
> I think passing a schema's flag value including unknown flags is an error.
> The ABI defines moving structures but does not define deep copying. I think
> in order to copy deeply in terms of operations which *are* specified: we
> import then export the schema. Since this includes an export step, it
> should not
> include flags which are not supported by the exporter.
>
> On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou  wrote:
>
> >
> > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > >> Is this buffer lengths buffer only present if the array type is
> > Utf8View?
> > >
> > > IIUC, the proposal would add the buffer lengths buffer for all types if
> > the
> > > schema's
> > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > > the special case and that `n_buffers` would continue to be consistent
> > with
> > > IPC.
> >
> > This begs the question of what happens if a consumer receives an unknown
> > flag value. We haven't specified that unknown flag values should be
> > ignored, so a consumer could judiciously choose to error out instead of
> > potentially misinterpreting the data.
> >
> > All in all, personally I'd rather we make a special case for Utf8View
> > instead of adding a flag that can lead to worse interoperability.
> >
> > Regards
> >
> > Antoine.
> >


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> This begs the question of what happens if a consumer receives an unknown flag 
> value

That's a great point...I might be the only person who has implemented
a deep copy of an ArrowSchema in C, but it does blindly pass along a
schema's flag value (which in the scenario I proposed could lead to a
consumer accessing a pointer that didn't exist).

I do think there is utility in considering buffer sizes more
generically in the future...if it is apparently so essential that
every Arrow implementation implements them in this way, it seems like
an oversight to have producers constantly omitting buffer sizes and
consumers constantly recalculating them.

On Thu, Oct 26, 2023 at 4:35 PM Dewey Dunnington  wrote:
>
> I'm afraid I've derailed the discussion into solving a bigger problem
> than strictly necessary. I don't think this is the time to solve the
> general problem of the C data interface having no way to communicate
> buffer sizes, particularly since there's no immediate agreement on its
> utility or implementation, but perhaps it is possible to solve it in a
> way that does not preclude implementing it in some generic way in the
> future.
>
> I think Ben's initial proposal of incrementing n_buffers by one and
> appending an int64_t* pointing to the buffer sizes accomplishes that,
> so consider me a +1. It might perhaps be more general if it included
> all buffer sizes (not just variadic ones), but given that it would
> only be useful for a few other types I don't think that is a game
> changer.
>
> It is probably also worth noting whether we expect the buffer
> containing the sizes to live on the CPU device always or whether we
> want it to live on the same device as the data buffers.
>
> On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou  wrote:
> >
> >
> > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > >> Is this buffer lengths buffer only present if the array type is Utf8View?
> > >
> > > IIUC, the proposal would add the buffer lengths buffer for all types if 
> > > the
> > > schema's
> > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > > the special case and that `n_buffers` would continue to be consistent with
> > > IPC.
> >
> > This begs the question of what happens if a consumer receives an unknown
> > flag value. We haven't specified that unknown flag values should be
> > ignored, so a consumer could judiciously choose to error out instead of
> > potentially misinterpreting the data.
> >
> > All in all, personally I'd rather we make a special case for Utf8View
> > instead of adding a flag that can lead to worse interoperability.
> >
> > Regards
> >
> > Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
I'm afraid I've derailed the discussion into solving a bigger problem
than strictly necessary. I don't think this is the time to solve the
general problem of the C data interface having no way to communicate
buffer sizes, particularly since there's no immediate agreement on its
utility or implementation, but perhaps it is possible to solve it in a
way that does not preclude implementing it in some generic way in the
future.

I think Ben's initial proposal of incrementing n_buffers by one and
appending an int64_t* pointing to the buffer sizes accomplishes that,
so consider me a +1. It might perhaps be more general if it included
all buffer sizes (not just variadic ones), but given that it would
only be useful for a few other types I don't think that is a game
changer.

It is probably also worth noting whether we expect the buffer
containing the sizes to live on the CPU device always or whether we
want it to live on the same device as the data buffers.

On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou  wrote:
>
>
> Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> >> Is this buffer lengths buffer only present if the array type is Utf8View?
> >
> > IIUC, the proposal would add the buffer lengths buffer for all types if the
> > schema's
> > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > the special case and that `n_buffers` would continue to be consistent with
> > IPC.
>
> This begs the question of what happens if a consumer receives an unknown
> flag value. We haven't specified that unknown flag values should be
> ignored, so a consumer could judiciously choose to error out instead of
> potentially misinterpreting the data.
>
> All in all, personally I'd rather we make a special case for Utf8View
> instead of adding a flag that can lead to worse interoperability.
>
> Regards
>
> Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> I expect C code to not be much longer then this :-)

nanoarrow's buffer-length-calculation and validation concepts are
(perhaps inadvisably) intertwined...even with both it is not that much
code (perhaps I was remembering how much time it took me to figure out
which 35 lines to write :-))

> That sounds a bit hackish to me.

Including only *some* buffer sizes in array->buffers[array->n_buffers]
special-cased for only two types (or altering the number of buffers
required by the IPC format vs. the number of buffers required by the C
Data interface) seem equally hackish to me (not that I'm opposed to
either necessarily...the alternatives really are very bad).

> How can you *not* care about buffer sizes, if you for example need to send 
> the buffers over IPC?

I think IPC is the *only* operation that requires that information?
(Other than perhaps copying to another device?) I don't think there's
any barrier to accessing the content of all the array elements but I
could be mistaken.

On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou  wrote:
>
>
> Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > The lack of buffer sizes is something that has come up for me a few
> > times working with nanoarrow (which dedicates a significant amount of
> > code to calculating buffer sizes, which it uses to do validation and
> > more efficient copying).
>
> By the way, this is a bit surprising since it's really 35 lines of code
> in C++ currently:
>
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
>
> I expect C code to not be much longer then this :-)
>
> Regards
>
> Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
Ben kindly explained to me offline that the need for the buffer sizes
is because when Arrow C++ imports an Array it creates Buffer class
wrappers around the imported pointers. Arrow C++ does not have a
notion of a buffer of unknown size to my knowledge, which leaves two
undesirable alternatives: (1) loop over every string to calculate the
maximum referenced buffer size for each buffer or (2) overhaul the
Buffer class to allow unknown buffer sizes and suffer the
corresponding performance/support issues when doing something with the
array data that would otherwise be type-agnostic (e.g., converting to
IPC).

The lack of buffer sizes is something that has come up for me a few
times working with nanoarrow (which dedicates a significant amount of
code to calculating buffer sizes, which it uses to do validation and
more efficient copying). The most recent issue I have had was when
implementing the Arrow C Device Interface: for string and binary (+
the large counterparts) it is necessary to access the buffers to
calculate the sizes, which makes it difficult to write
generic/performant code copying an entire array between devices.

A potential alternative might be to allow any ArrowArray to declare
its buffer sizes in array->buffers[array->n_buffers], perhaps with a
new flag in schema->flags to advertise that capability. I'm happy to
defer that discussion to another time but if there is no opposition,
it might be cleaner to include sooner than later (because it does not
involve special-casing specific types).

> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...

I would prefer the buffer sizes to be after as it preserves the
connection between Columnar/IPC format and the C Data interface...the
need for buffer_sizes is more of a convenience for implementations
that care about this kind of thing than something inherent to the
array data.

Cheers!

-dewey

On Wed, Oct 25, 2023 at 1:47 PM Antoine Pitrou  wrote:
>
>
> Hello,
>
> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...
>
> Regards
>
> Antoine.
>
>
> Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > The C ABI does not store buffer lengths explicitly, which presents a
> > problem for Utf8View since buffer lengths are not trivially extractable
> > from other data in the array. A potential solution is to store the lengths
> > in an extra buffer after the variadic data buffers. I've adopted this
> > approach in my (currently draft) PR [1] to add c++ library import/export
> > for Utf8VIew, but I thought this warranted raising on the ML in case anyone
> > has a better idea.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]
> > https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549
> >


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Dewey Dunnington
+1!

On Wed, Oct 18, 2023 at 2:14 PM Matt Topol  wrote:
>
> +1
>
> On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou  wrote:
>
> > +1
> >
> > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > Hello all,
> > >
> > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > BinaryView types in the Arrow C data interface [1].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of these new C data format strings
> > > [ ] +0
> > > [ ] -1 - I'm against adding these new format strings because
> > >
> > > Ben Kietzman
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
> >


Re: Apache Arrow file format

2023-10-18 Thread Dewey Dunnington
Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).

> you're going to read all the columns for a record batch in the file, no 
> matter what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple  wrote:
>
> Arrow IPC file is great, it focuses on in-memory representation and direct
> computation.
> Basically, it can support compression and dictionary encoding, and can
> zero-copy
> deserialize the file to memory Arrow format.
>
> Parquet provides some strong functionality, like Statistics, which could
> help pruning
> unnecessary data during scanning and avoid cpu and io cust. And it has high
> efficient
> encoding, which could make the Parquet file smaller than the Arrow IPC file
> under the same
> data. However, currently some arrow data type cannot be convert to
> correspond Parquet type
> in the current arrow-cpp implementation. You can goto the arrow document to
> take a look.
>
> Adam Lippai  于2023年10月18日周三 10:50写道:
>
> > Also there is
> > https://github.com/lancedb/lance between the two formats. Depending on the
> > use case it can be a great choice.
> >
> > Best regards
> > Adam Lippai
> >
> > On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:
> >
> > > One benefit of the feather format (i.e. Arrow IPC file format) is the
> > > ability to mmap the file to easily handle reading sections of a larger
> > than
> > > memory file of data. Since, as Felipe mentioned, the format is focused on
> > > in-memory representation, you can easily and simply mmap the file and use
> > > the raw bytes directly. For a large file that you only want to read
> > > sections of, this can be beneficial for IO and memory usage.
> > >
> > > Unfortunately, you are correct that it doesn't allow for easy column
> > > projecting (you're going to read all the columns for a record batch in
> > the
> > > file, no matter what). So it's going to be a trade off based on your
> > needs
> > > as to whether it makes sense, or if you should use a file format like
> > > Parquet instead.
> > >
> > > -Matt
> > >
> > >
> > > On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com>
> > > wrote:
> > >
> > > > It’s not the best since the format is really focused on in- memory
> > > > representation and direct computation, but you can do it:
> > > >
> > > > https://arrow.apache.org/docs/python/feather.html
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Tue, 17 Oct 2023 at 23:26 Nara 
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Is it a good idea to use Apache Arrow as a file format? Looks like
> > > > > projecting columns isn't available by default.
> > > > >
> > > > > One of the benefits of Parquet file format is column projection,
> > where
> > > > the
> > > > > IO is limited to just the columns projected.
> > > > >
> > > > > Regards ,
> > > > > Nara
> > > > >
> > > >
> > >
> >


Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-15 Thread Dewey Dunnington
Congrats, Jon!

On Sun, Oct 15, 2023 at 7:53 AM Nic Crane  wrote:
>
> Congrats Jon!
>
> On Sun, 15 Oct 2023, 05:52 Jacob Wujciak-Jens,
>  wrote:
>
> > Congratulations !
> >
> > Raúl Cumplido  schrieb am So., 15. Okt. 2023,
> > 00:58:
> >
> > > Congratulations Jon!
> > >
> > > El dom, 15 oct 2023, 0:05, Antoine Pitrou  escribió:
> > >
> > > >
> > > > Welcome to the PMC, Jon!
> > > >
> > > > Le 14/10/2023 à 19:42, David Li a écrit :
> > > > > Congrats Jon!
> > > > >
> > > > > On Sat, Oct 14, 2023, at 13:25, Ian Cook wrote:
> > > > >> Congratulations Jonathan!
> > > > >>
> > > > >> On Sat, Oct 14, 2023 at 13:24 Andrew Lamb 
> > > wrote:
> > > > >>
> > > > >>> The Project Management Committee (PMC) for Apache Arrow has invited
> > > > >>> Jonathan Keane to become a PMC member and we are pleased to
> > announce
> > > > >>> that Jonathan Keane has accepted.
> > > > >>>
> > > > >>> Congratulations and welcome!
> > > > >>>
> > > > >>> Andrew
> > > > >>>
> > > >
> > >
> >


Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Dewey Dunnington
Hi Alva,

I would encourage you to do whatever will make life more pleasant for
you and other contributors to the Swift Arrow implementation. I have
found development of an Arrow subproject (nanoarrow) in a separate
repository very pleasant. While I don't run integration tests there,
it's not because of any technical limitation (instead of pulling one
repo in your CI job, just pull two).

For the R bindings to Arrow, which do depend on the C++ bindings, we
do have some benefit because Arrow C++ changes that break R tend to
get fixed by the C++ contributor in their PR, rather than that
responsibility always falling on us. That said, it doesn't happen very
often, and we have informally toyed with the idea of moving out of the
monorepo to make it less intimidating for outside contributors.

Cheers,

-dewey

On Tue, Oct 10, 2023 at 2:33 PM Antoine Pitrou  wrote:
>
>
> Hi Alva,
>
> I'll let others give their opinions on the repo.
>
> Regards
>
> Antoine.
>
>
> Le 10/10/2023 à 19:25, Alva Bandy a écrit :
> > Hi Antoine,
> >
> > Thanks for the reply.
> >
> > It would be great to get the Swift implementation added to the integration 
> > test.  I have a task for adding the C Data Interface and I will work on 
> > getting the integration test running for Swift after that task.  Can we 
> > move forward with setting up the repo as long as there is a task/issue to 
> > ensure the integration test will be run against Swift soon or would this be 
> > a blocker?
> >
> > Also, I am not sure about Julia, I have not looked into Julia’s 
> > implementation.
> >
> > Thank you,
> > Alva Bandy
> >
> > On 2023/10/10 08:54:30 Antoine Pitrou wrote:
> >>
> >> Hello Alva,
> >>
> >> This is a reasonable request, but it might come with its own drawbacks
> >> as well.
> >>
> >> One significant drawback is that adding the Swift implementation to the
> >> cross-implementation integration tests will be slightly more complicated.
> >> It is very important that all Arrow implementations are
> >> integration-tested against each other, otherwise we only have a
> >> theoretical guarantee that they are compatible. See how this is done here:
> >> https://arrow.apache.org/docs/dev/format/Integration.html
> >>
> >> Unless I'm mistaken, neither Swift nor Julia are running the integration
> >> tests.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>
> >> Le 09/10/2023 à 22:26, Alva Bandy a écrit :
> >>> Hi,
> >>>
> >>> I would like to request a repo for Arrow Swift (similar to arrow-rs).  
> >>> Swift arrow is currently fully Swift and doesn't leverage the C++ 
> >>> libraries. One of the goals of Arrow Swift was to provide a fully Swift 
> >>> impl and splitting them now would help ensure that Swift Arrow stays on 
> >>> this path.
> >>>
> >>> Also, the Swift Package Manager uses a git repo url to pull down a 
> >>> package.  This can lead to a large download since the entire arrow repo 
> >>> will be pulled down just to include Arrow Swift.  It would be great to 
> >>> make this change before registering Swift Arrow with a Swift registry 
> >>> (such as Swift Package Registry).
> >>>
> >>> Please let me know if this is possible and if so, what would be the 
> >>> process going forward.
> >>>
> >>> Thank you,
> >>> Alva Bandy
> >>>


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Dewey Dunnington
+1!

On Fri, Oct 6, 2023, 8:03 PM Matt Topol  wrote:

> +1
>
> On Fri, Oct 6, 2023, 6:55 PM Benjamin Kietzman 
> wrote:
>
> > +1
> >
> > On Fri, Oct 6, 2023, 17:27 Felipe Oliveira Carvalho  >
> > wrote:
> >
> > > Hello,
> > >
> > > I'm writing to propose "+vl" and "+vL" as format strings for list-view
> > and
> > > large list-view arrays passing through the Arrow C data interface [1].
> > >
> > > The previous proposal was considered a bad idea because existing
> parsers
> > of
> > > these format strings might be looking at only the first `l` (or `L`)
> > after
> > > the `+` and assuming the classic list format from that alone, so now
> I'm
> > > proposing we start with a `+v` as this prefix is not shared with any
> > other
> > > existing type so far.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of this new C Data Format string
> > > [ ] +0
> > > [ ] -1 - I'm against adding this new format string because
> > >
> > > Thanks everyone!
> > >
> > > --
> > > Felipe
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
> >
>


Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Dewey Dunnington
I won't belabour the point any more, but the difference in layout
between a list and a list view is consequential enough to deserve its
own top-level character in my opinion. My vote would be +1 for +vl and
+vL.

On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho
 wrote:
>
> > Union format strings share enough properties that having them in the
> > same switch case doesn't result in additional complexity...lists and
> > list views are completely different types (for the purposes of parsing
> > the format string).
>
> Dense and sparse union differ a bit more than list and list-view.
>
> Not starting with `+l` for list-views would be a deviation from this
> pattern started by unions.
>
> ++---++
> | ``+ud:I,J,...``| dense union with type ids I,J...
>  ||
> ++---++
> | ``+us:I,J,...``| sparse union with type ids I,J...
>   ||
> ++---++
>
> Is sharing prefixes an issue?
>
> To make this more concrete, these are the parser changes for supporting
> `+lv` and `+Lv` as I proposed in the beginning:
>
> @@ -1097,9 +1101,9 @@ struct SchemaImporter {
>  RETURN_NOT_OK(f_parser_.CheckHasNext());
>  switch (f_parser_.Next()) {
>case 'l':
> -return ProcessListLike();
> +return ProcessVarLengthList();
>case 'L':
> -return ProcessListLike();
> +return ProcessVarLengthList();
>case 'w':
>  return ProcessFixedSizeList();
>case 's':
> @@ -1195,12 +1199,30 @@ struct SchemaImporter {
>  return CheckNoChildren(type);
>}
>
> -  template 
> -  Status ProcessListLike() {
> -RETURN_NOT_OK(f_parser_.CheckAtEnd());
> -RETURN_NOT_OK(CheckNumChildren(1));
> -ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> -type_ = std::make_shared(field);
> +  template 
> +  Status ProcessVarLengthList() {
> +if (f_parser_.AtEnd()) {
> +  RETURN_NOT_OK(CheckNumChildren(1));
> +  ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> +  if constexpr (is_large_variation) {
> +type_ = large_list(field);
> +  } else {
> +type_ = list(field);
> +  }
> +} else {
> +  if (f_parser_.Next() == 'v') {
> +RETURN_NOT_OK(CheckNumChildren(1));
> +ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> +if constexpr (is_large_variation) {
> +  type_ = large_list_view(field);
> +} else {
> +  type_ = list_view(field);
> +}
> +  } else {
> +return f_parser_.Invalid();
> +  }
> +}
> +
>  return Status::OK();
>}
>
> --
> Felipe
>
>
> On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou  wrote:
>
> >
> > I don't think the parsing will be a problem even in C. It's not like you
> > have to backtrack anyway.
> >
> > +1 from me on Felipe's proposal.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :
> > > This mailing list thread is going to be the discussion.
> > >
> > > The union types also use two characters, so I didn’t think it would be a
> > > problem.
> > >
> > > —
> > > Felipe
> > >
> > > On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington
> > 
> > > wrote:
> > >
> > >> I'm sorry for missing earlier discussion on this or a PR into the
> > >> format where this discussion may have occurred...is there a reason
> > >> that +lv and +Lv were chosen over a single-character version (i.e.,
> > >> maybe +v and +V)? A single-character version is (slightly) easier to
> > >> parse in C.
> > >>
> > >> On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
> > >>  wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> I'm writing to propose "+lv" and "+Lv" as format strings for list-view
> > >> and
> > >>> large list-view arrays passing through the Arrow C data interface [1].
> > >>>
> > >>> The vote will be open for at least 72 hours.
> > >>>
> > >>> [ ] +1 - I'm in favor of this new C Data Format string
> > >>> [ ] +0
> > >>> [ ] -1 - I'm against adding this new format string because
> > >>>
> > >>> Thanks everyone!
> > >>>
> > >>> --
> > >>> Felipe
> > >>>
> > >>> [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >>
> > >
> >


Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Dewey Dunnington
+vl and +vL sound good to me!

On Thu, Oct 5, 2023 at 5:06 PM Ben Harkins  wrote:
>
> Not sure how consequential it'd be in practice, but my first thought is
> that "+vl" and "+vL" (or "+v"/"+V") would require fewer logic changes and
> extra checks for parsers. Plus, establishing a v-prefixed convention for
> views would avoid those downsides for plain binary types when BinaryView
> and Utf8View are added (e.g. "vz"/"vZ", "vu"/"vU"), assuming they'll be
> getting format strings as well.
>
> On Thu, Oct 5, 2023 at 1:00 PM Felipe Oliveira Carvalho 
> wrote:
>
> > Hello,
> >
> > I'm writing to propose "+lv" and "+Lv" as format strings for list-view and
> > large list-view arrays passing through the Arrow C data interface [1].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 - I'm in favor of this new C Data Format string
> > [ ] +0
> > [ ] -1 - I'm against adding this new format string because
> >
> > Thanks everyone!
> >
> > --
> > Felipe
> >
> > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> >


Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Dewey Dunnington
Union format strings share enough properties that having them in the
same switch case doesn't result in additional complexity...lists and
list views are completely different types (for the purposes of parsing
the format string). Is there any reason *not* to use +v and +V? The
switch statements used to parse the format string are already rather
unwieldy...it would be a nice quality-of-life improvement (although by
no means a required one) to use a separate character.

On Thu, Oct 5, 2023 at 3:34 PM Felipe Oliveira Carvalho
 wrote:
>
> This mailing list thread is going to be the discussion.
>
> The union types also use two characters, so I didn’t think it would be a
> problem.
>
> —
> Felipe
>
> On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington 
> wrote:
>
> > I'm sorry for missing earlier discussion on this or a PR into the
> > format where this discussion may have occurred...is there a reason
> > that +lv and +Lv were chosen over a single-character version (i.e.,
> > maybe +v and +V)? A single-character version is (slightly) easier to
> > parse in C.
> >
> > On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
> >  wrote:
> > >
> > > Hello,
> > >
> > > I'm writing to propose "+lv" and "+Lv" as format strings for list-view
> > and
> > > large list-view arrays passing through the Arrow C data interface [1].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of this new C Data Format string
> > > [ ] +0
> > > [ ] -1 - I'm against adding this new format string because
> > >
> > > Thanks everyone!
> > >
> > > --
> > > Felipe
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> >


Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Dewey Dunnington
I'm sorry for missing earlier discussion on this or a PR into the
format where this discussion may have occurred...is there a reason
that +lv and +Lv were chosen over a single-character version (i.e.,
maybe +v and +V)? A single-character version is (slightly) easier to
parse in C.

On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
 wrote:
>
> Hello,
>
> I'm writing to propose "+lv" and "+Lv" as format strings for list-view and
> large list-view arrays passing through the Arrow C data interface [1].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --
> Felipe
>
> [1] https://arrow.apache.org/docs/format/CDataInterface.html


Re: [RESULT] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-10-04 Thread Dewey Dunnington
All post-release tasks complete! Thanks all for voting!

[x] Closed GitHub milestone
[x] Added release to the Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[x] Submit R package to CRAN
[x] Release blog post
[x] Sent announcement to annou...@apache.org
[x] Removed old artifacts from SVN
[x] Bumped versions on main





On Fri, Sep 29, 2023 at 1:16 PM Dewey Dunnington  wrote:
>
> The vote passes with 4 +1 binding and 4 +1 non-binding votes!
>
> I will take care of the following post-release tasks:
>
> [ ] Closed GitHub milestone
> [ ] Added release to the Apache Reporter System
> [ ] Uploaded artifacts to Subversion
> [ ] Created GitHub release
> [ ] Submit R package to CRAN
> [ ] Release blog post
> [ ] Sent announcement to annou...@apache.org
> [ ] Removed old artifacts from SVN
> [ ] Bumped versions on main
>
> On Fri, Sep 29, 2023 at 1:14 PM Dewey Dunnington  
> wrote:
> >
> > My vote is +1 (verified on MacOS 13.6 aarch64)
> >
> > On Fri, Sep 29, 2023 at 10:04 AM Jean-Baptiste Onofré  
> > wrote:
> > >
> > > +1 (non binding)
> > >
> > > Tested on MacOS 13.5 (aarch64).
> > >
> > > Regards
> > > JB
> > >
> > > On Tue, Sep 26, 2023 at 5:23 PM Dewey Dunnington
> > >  wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to propose the following release candidate (rc0) of
> > > > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
> > > > consisting of 42 resolved GitHub issues from 4 contributors [1].
> > > >
> > > > This release candidate is based on commit:
> > > > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
> > > >
> > > > The source release rc0 is hosted at [3].
> > > > The changelog is located at [4].
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > > > and vote on the release. See [5] for how to validate a release
> > > > candidate.
> > > >
> > > > See also a successful suite of verification runs at [6].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
> > > >
> > > > [0] https://github.com/apache/arrow-nanoarrow
> > > > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
> > > > [2] 
> > > > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
> > > > [3] 
> > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
> > > > [4] 
> > > > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
> > > > [5] 
> > > > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > > > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


[ANNOUNCE] Apache Arrow nanoarrow 0.3.0 Released

2023-09-29 Thread Dewey Dunnington
The Apache Arrow community is pleased to announce the 0.3.0 release of
Apache Arrow nanoarrow. This release covers 42 resolved issues from 4
contributors[1].

The release is available now from [2].

Release notes are available at:
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0/CHANGELOG.md

What is Apache Arrow?
-
Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides
low-overhead streaming and batch messaging, zero-copy interprocess
communication (IPC), and vectorized in-memory analytics libraries.
Languages currently supported include C, C++, C#, Go, Java,
JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

What is Apache Arrow nanoarrow?
--
Apache Arrow nanoarrow is a small C library for building and
interpreting Arrow C Data interface structures with bindings for users
of the R programming language. The vision of nanoarrow is that it
should be trivial for a library or application to implement an
Arrow-based interface. The library provides helpers to create types,
schemas, and metadata, an API for building arrays element-wise,
and an API to extract elements element-wise from an array. For a more
detailed description of the features nanoarrow provides and motivation
for its development, see [3].

Please report any feedback to the mailing lists ([4], [5]).

Regards,
The Apache Arrow Community

[1]: 
https://github.com/apache/arrow-nanoarrow/issues?q=is%3Aissue+milestone%3A%22nanoarrow+0.3.0%22+is%3Aclosed
[2]: https://www.apache.org/dyn/closer.cgi/arrow/apache-arrow-nanoarrow-0.3.0
[3]: https://github.com/apache/arrow-nanoarrow
[4]: https://lists.apache.org/list.html?u...@arrow.apache.org
[5]: https://lists.apache.org/list.html?dev@arrow.apache.org


[RESULT] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-29 Thread Dewey Dunnington
The vote passes with 4 +1 binding and 4 +1 non-binding votes!

I will take care of the following post-release tasks:

[ ] Closed GitHub milestone
[ ] Added release to the Apache Reporter System
[ ] Uploaded artifacts to Subversion
[ ] Created GitHub release
[ ] Submit R package to CRAN
[ ] Release blog post
[ ] Sent announcement to annou...@apache.org
[ ] Removed old artifacts from SVN
[ ] Bumped versions on main

On Fri, Sep 29, 2023 at 1:14 PM Dewey Dunnington  wrote:
>
> My vote is +1 (verified on MacOS 13.6 aarch64)
>
> On Fri, Sep 29, 2023 at 10:04 AM Jean-Baptiste Onofré  
> wrote:
> >
> > +1 (non binding)
> >
> > Tested on MacOS 13.5 (aarch64).
> >
> > Regards
> > JB
> >
> > On Tue, Sep 26, 2023 at 5:23 PM Dewey Dunnington
> >  wrote:
> > >
> > > Hello,
> > >
> > > I would like to propose the following release candidate (rc0) of
> > > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
> > > consisting of 42 resolved GitHub issues from 4 contributors [1].
> > >
> > > This release candidate is based on commit:
> > > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The changelog is located at [4].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [5] for how to validate a release
> > > candidate.
> > >
> > > See also a successful suite of verification runs at [6].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
> > >
> > > [0] https://github.com/apache/arrow-nanoarrow
> > > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
> > > [2] 
> > > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
> > > [3] 
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
> > > [4] 
> > > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
> > > [5] 
> > > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-29 Thread Dewey Dunnington
My vote is +1 (verified on MacOS 13.6 aarch64)

On Fri, Sep 29, 2023 at 10:04 AM Jean-Baptiste Onofré  wrote:
>
> +1 (non binding)
>
> Tested on MacOS 13.5 (aarch64).
>
> Regards
> JB
>
> On Tue, Sep 26, 2023 at 5:23 PM Dewey Dunnington
>  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (rc0) of
> > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
> > consisting of 42 resolved GitHub issues from 4 contributors [1].
> >
> > This release candidate is based on commit:
> > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
> >
> > The source release rc0 is hosted at [3].
> > The changelog is located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [5] for how to validate a release
> > candidate.
> >
> > See also a successful suite of verification runs at [6].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
> > [2] 
> > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
> > [3] 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
> > [4] 
> > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
> > [5] 
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


Re: [VOTE][Format] Variable shape tensor canonical extension type

2023-09-29 Thread Dewey Dunnington
+1! Thank you for iterating on this with all of us!

On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim
 wrote:
>
> +1
> Thanks for pushing this through!
>
> On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc  wrote:
>
> > Hi all,
> >
> > Following the discussion [1][2] I would like to propose a vote to add
> > variable shape tensor canonical extension type language to
> > CanonicalExtensions.rst [3] as written below.
> > A draft C++ implementation and a Python wrapper can be seen here [2].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept this proposal
> > [ ] +0
> > [ ] -1 Do not accept this proposal because...
> >
> >
> > [1] https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
> > [2] https://github.com/apache/arrow/pull/37166
> > [3]
> >
> > https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
> >
> >
> > Variable shape tensor
> > =
> >
> > * Extension name: `arrow.variable_shape_tensor`.
> >
> > * The storage type of the extension is: ``StructArray`` where struct
> >   is composed of **data** and **shape** fields describing a single
> >   tensor per row:
> >
> >   * **data** is a ``List`` holding tensor elements of a single tensor.
> > Data type of the list elements is uniform across the entire column.
> >   * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape
> > where
> > the size of the list ``ndim`` is equal to the number of dimensions of
> > the
> > tensor.
> >
> > * Extension type parameters:
> >
> >   * **value_type** = the Arrow data type of individual tensor elements.
> >
> >   Optional parameters describing the logical layout:
> >
> >   * **dim_names** = explicit names of tensor dimensions
> > as an array. The length of it should be equal to the shape
> > length and equal to the number of dimensions.
> >
> > ``dim_names`` can be used if the dimensions have well-known
> > names and they map to the physical layout (row-major).
> >
> >   * **permutation**  = indices of the desired ordering of the
> > original dimensions, defined as an array.
> >
> > The indices contain a permutation of the values [0, 1, .., N-1] where
> > N is the number of dimensions. The permutation indicates which
> > dimension of the logical layout corresponds to which dimension of the
> > physical tensor (the i-th dimension of the logical view corresponds
> > to the dimension with number ``permutations[i]`` of the physical
> > tensor).
> >
> > Permutation can be useful in case the logical order of
> > the tensor is a permutation of the physical order (row-major).
> >
> > When logical and physical layout are equal, the permutation will always
> > be ([0, 1, .., N-1]) and can therefore be left out.
> >
> >   * **uniform_dimensions** = indices of dimensions whose sizes are
> > guaranteed to remain constant. Indices are a subset of all possible
> > dimension indices ([0, 1, .., N-1]).
> > The uniform dimensions must still be represented in the ``shape``
> > field,
> > and must always be the same value for all tensors in the array -- this
> > allows code to interpret the tensor correctly without accounting for
> > uniform dimensions while still permitting optional optimizations that
> > take advantage of the uniformity. ``uniform_dimensions`` can be left
> > out,
> > in which case it is assumed that all dimensions might be variable.
> >
> >   * **uniform_shape** = shape of the dimensions that are guaranteed to stay
> > constant over all tensors in the array, with the shape of the ragged
> > dimensions
> > set to 0.
> > An array containing a tensor with shape (2, 3, 4) and
> > ``uniform_dimensions``
> > (0, 2) would have ``uniform_shape`` (2, 0, 4).
> >
> > * Description of the serialization:
> >
> >   The metadata must be a valid JSON object, that optionally includes
> >   dimension names with keys **"dim_names"**, ordering of
> >   dimensions with key **"permutation"**, indices of dimensions whose sizes
> >   are guaranteed to remain constant with key **"uniform_dimensions"** and
> >   shape of those dimensions with key **"uniform_shape"**.
> >   Minimal metadata is an empty JSON object.
> >
> >   - Example of minimal metadata is:
> >
> > ``{}``
> >
> >   - Example with ``dim_names`` metadata for NCHW ordered data:
> >
> > ``{ "dim_names": ["C", "H", "W"] }``
> >
> >   - Example with ``uniform_dimensions`` metadata for a set of color images
> > with variable width:
> >
> > ``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }``
> >
> >   - Example of permuted 3-dimensional tensor:
> >
> > ``{ "permutation": [2, 0, 1] }``
> >
> > This is the physical layout shape and the shape of the logical
> > layout given an individual tensor of shape [100, 200, 500] would
> > be ``[500, 100, 200]``.
> >
> > .. note::
> >
> >   With the exception of permutation all other parameters and storage
> >   of 

Re: [Format] C Data Interface integration testing

2023-09-26 Thread Dewey Dunnington
Thank you for setting this up! I look forward to adding nanoarrow as
soon as time allows.

Cheers,

-dewey

On Tue, Sep 26, 2023 at 9:48 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> We have added some infrastructure for integration testing of the C Data
> Interface between Arrow implementations. We are now testing the C++ and
> Go implementations, but the goal in the future is for all major
> implementations to be tested there (perhaps including nanoarrow).
>
> - PR to add the testing infrastructure and enable the C++ implementation:
> https://github.com/apache/arrow/pull/37769
>
> - PR to enable the Go implementation
> https://github.com/apache/arrow/pull/37788
>
> Feel free to ask any questions.
>
> Regards
>
> Antoine.
>
>
>


[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (rc0) of
Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
consisting of 42 resolved GitHub issues from 4 contributors [1].

This release candidate is based on commit:
c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]

The source release rc0 is hosted at [3].
The changelog is located at [4].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [5] for how to validate a release
candidate.

See also a successful suite of verification runs at [6].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
[5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
[6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


Re: [VOTE] Release Apache Arrow ADBC 0.7.0 - RC0

2023-09-22 Thread Dewey Dunnington
+1! I ran:

export DOCKER_DEFAULT_PLATFORM=linux/amd64 && USE_CONDA=1
dev/release/verify-release-can
didate.sh 0.7.0 0

On Thu, Sep 21, 2023 at 12:52 AM Sutou Kouhei  wrote:
>
> +1
>
> I ran the following on Debian GNU/Linux sid:
>
>   JAVA_HOME=/usr/lib/jvm/default-java \
> TEST_PYTHON=0 \
> TEST_WHEELS=0 \
> dev/release/verify-release-candidate.sh 0.7.0 0
>
> with:
>
>   * g++ (Debian 13.2.0-2) 13.2.0
>   * go version go1.21.0 linux/amd64
>   * openjdk version "17.0.9-ea" 2023-10-17
>   * ruby 3.3.0dev (2023-08-30T23:37:11Z master 0aa404b957) [x86_64-linux]
>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>
>
> Thanks,
> --
> kou
>
> In <7471dd14-ea4c-4953-bb68-79f11d4a4...@app.fastmail.com>
>   "[VOTE] Release Apache Arrow ADBC 0.7.0 - RC0" on Wed, 20 Sep 2023 13:03:43 
> -0400,
>   "David Li"  wrote:
>
> > Hello,
> >
> > I would like to propose the following release candidate (RC0) of Apache 
> > Arrow ADBC version 0.7.0. This is a release consisting of 50 resolved 
> > GitHub issues [1].
> >
> > This release candidate is based on commit: 
> > efb72b4729e0f99c7d1f6723c1a966e011fa478f [2]
> > This is the first release using API specification 1.1.0.
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests, and 
> > vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.7.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.7.0 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> > DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> > TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]: 
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.7.0%22+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow-adbc/commit/efb72b4729e0f99c7d1f6723c1a966e011fa478f
> > [3]: 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.7.0-rc0/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]: 
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]: 
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.7.0-rc0
> > [9]: 
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.7.0-rc0/CHANGELOG.md
> > [10]: 
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/6251522630


Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

2023-09-19 Thread Dewey Dunnington
Hi all,

Sorry for the late reply!

I would lean towards signed integers because we don't use unsigned
integers anywhere in the existing specification (other than as a data
type). While they are allowed as dictionary index values, the spec
specifically discourages their use [1]. If the times have changed and
this is no longer the case perhaps there should be a wider effort to
support unsigned values in other places that extend beyond a single
type?

-dewey

[1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L453-L457

On Tue, Sep 19, 2023 at 5:26 PM Benjamin Kietzman  wrote:
>
> Hello again all,
>
> It seems there hasn't been much interest in this point so I'm leaning
> toward keeping unsigned integers. If anyone has a concern please respond
> here and/or on the PR [1].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
>
> On Thu, Sep 14, 2023 at 9:31 AM David Li  wrote:
>
> > I think Java was usually raised as the odd child out when this has come up
> > before. Since Java 8 there are standard library methods to manipulate
> > signed integers as if they were unsigned, so in principle Java shouldn't be
> > a blocker anymore.
> >
> > That said, ByteBuffer is still indexed by int so in practice Java wouldn't
> > be able to handle more than 2 GB in a single buffer, at least until we can
> > use the Java 21+ APIs (MemorySegment is finally indexed by (signed) long).
> >
> > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > Hello all,
> > >
> > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > spec/schema changes [2]. In review [3], it was requested that signed 32
> > bit
> > > integers be used for the fields of view structs instead of 32 bit
> > unsigned.
> > >
> > > This divergence has been discussed on the ML previously [4], but in light
> > > of my reviewer's request for a change it should be raised again for
> > focused
> > > discussion. (At this stage, I don't *think* the change would require
> > > another vote.) I'll enumerate the motivations for signed and unsigned as
> > I
> > > understand them.
> > >
> > > Signed:
> > > - signed integers are conventional in the arrow format
> > > - unsigned integers may cause some difficulty of implementation in
> > > languages which don't natively support them
> > >
> > > Unsigned:
> > > - unsigned integers are used by engines which already implement Utf8View
> > >
> > > My own bias is toward compatibility with existing implementers, but using
> > > signed integers will only affect the case of arrays which include data
> > > buffers larger than 2GB. For reference, the default buffer size in velox
> > is
> > > 32KB so such a massive data buffer would only occur when a single slot
> > of a
> > > string array has 2.1GB of characters. This seems sufficiently unlikely
> > that
> > > I wouldn't consider it a blocker.
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > [2] https://github.com/apache/arrow/pull/37526
> > > [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > [5]
> > >
> > https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> >


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-01 Thread Dewey Dunnington
Thank you for proposing this! I left a comment on the PR as well, but
I'm excited for this to standardize a few concepts that I have run
into whilst working on ADBC and GeoArrow:

- Properly returning an array with >1 dimension from the PostgreSQL ADBC driver
- As the basis for encoding raster tiles as rows in a table (e.g.,
http://www.geopackage.org/spec/#_tile_matrix_introduction )

Excited to see the PR progress!

-dewey

On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc  wrote:
>
> Hey all!
>
>
> Besides the recently added FixedShapeTensor [1] canonical extension type
> there appears to be a need for an already proposed VariableShapeTensor
> [2]. VariableShapeTensor
> would store tensors of variable shapes but uniform number of
> dimensions, dimension names and dimension permutations.
>
> There are examples of such types: Ray implements
> ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].
>
> I propose we discuss adding the below text to
> format/CanonicalExtensions.rst to read as [5] and a C++/Python
> implementation as proposed in [6]. A vote can be called after a discussion
> here.
>
> Variable shape tensor
>
> =
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>
>   is composed of **data** and **shape** fields describing a single
>
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
>
> Data type of the list elements is uniform across the entire column
>
> and also provided in metadata.
>
>   * **shape** is a ``FixedSizeList`` of the tensor shape where
>
> the size of the list is equal to the number of dimensions of the
>
> tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   * **ndim** = the number of dimensions of the tensor.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
>
> as an array. The length of it should be equal to the shape
>
> length and equal to the number of dimensions.
>
> ``dim_names`` can be used if the dimensions have well-known
>
> names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>
> original dimensions, defined as an array.
>
> The indices contain a permutation of the values [0, 1, .., N-1] where
>
> N is the number of dimensions. The permutation indicates which
>
> dimension of the logical layout corresponds to which dimension of the
>
> physical tensor (the i-th dimension of the logical view corresponds
>
> to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
> Permutation can be useful in case the logical order of
>
> the tensor is a permutation of the physical order (row-major).
>
> When logical and physical layout are equal, the permutation will always
>
> be ([0, 1, .., N-1]) and can therefore be left out.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>
>   dimensions of the contained tensors as an integer with key **"ndim"**
>
>   plus optional dimension names with keys **"dim_names"** and ordering of
>
>   the dimensions with key **"permutation"**.
>
>   - Example: ``{ "ndim": 2}``
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
> ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``
>
>   - Example of permuted 3-dimensional tensor:
>
> ``{ "ndim": 3, "permutation": [2, 0, 1]}``
>
> This is the physical layout shape and the shape of the logical
>
> layout would given an individual tensor of shape [100, 200, 500]
>
> be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>
>   in row-major/C-contiguous order.
>
>
> [1] https://github.com/apache/arrow/issues/33924
>
> [2] https://github.com/apache/arrow/issues/24868
>
> [3]
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809
>
> [4] https://pytorch.org/docs/stable/nested.html
>
> [5]
> https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor
>
> [6] https://github.com/apache/arrow/pull/37166
>
>
>
> Best,
>
> Rok


Re: [VOTE] Release Apache Arrow ADBC 0.6.0 - RC0

2023-08-26 Thread Dewey Dunnington
My vote: +1

I ran USE_CONDA=1 dev/release/verify-release-candidate.sh 0.6.0 0

...on Ubuntu 22.04.

On Thu, Aug 24, 2023 at 12:13 PM David Li  wrote:
>
> My vote: +1
>
> Tested sources, binaries on macOS 13.5/AArch64
>
> On Thu, Aug 24, 2023, at 07:02, Raúl Cumplido wrote:
> > +1 non-binding
> >
> > I did run:
> > USE_CONDA=1 dev/release/verify-release-candidate.sh 0.6.0 0
> >
> > on Ubuntu 22.04.
> >
> > Thanks,
> > Raúl
> >
> > El jue, 24 ago 2023 a las 8:13, Sutou Kouhei () 
> > escribió:
> >>
> >> +1
> >>
> >> I ran the following on Debian GNU/Linux sid:
> >>
> >>   JAVA_HOME=/usr/lib/jvm/default-java \
> >> TEST_PYTHON=0 \
> >> TEST_WHEELS=0 \
> >> dev/release/verify-release-candidate.sh 0.6.0 0
> >>
> >> with:
> >>
> >>   * g++ (Debian 13.2.0-2) 13.2.0
> >>   * go version go1.20.7 linux/amd64
> >>   * openjdk version "17.0.8" 2023-07-18
> >>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
> >>
> >> Note:
> >>
> >> I disabled Python related tests because I got segmentation
> >> faults. It seems that my Python is broken because Apache
> >> Arrow 13.0.0 tests were crashed too.
> >>
> >>   https://github.com/apache/arrow/issues/37297
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >>
> >> In <43932c10-78d2-4aa5-ba60-a9a017647...@app.fastmail.com>
> >>   "[VOTE] Release Apache Arrow ADBC 0.6.0 - RC0" on Wed, 23 Aug 2023 
> >> 18:15:16 -0400,
> >>   "David Li"  wrote:
> >>
> >> > Hello,
> >> >
> >> > I would like to propose the following release candidate (RC0) of Apache 
> >> > Arrow ADBC version 0.6.0. This is a release consisting of 46 resolved 
> >> > GitHub issues [1].
> >> >
> >> > This release candidate is based on commit: 
> >> > 40ff492c5f78ff3867a2b768bc46c2c17017a76f [2]
> >> >
> >> > The source release rc0 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8].
> >> > The changelog is located at [9].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests, 
> >> > and vote on the release. See [10] for how to validate a release 
> >> > candidate.
> >> >
> >> > See also a verification result on GitHub Actions [11].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow ADBC 0.6.0
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow ADBC 0.6.0 because...
> >> >
> >> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> >> > DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export 
> >> > TEST_APT=0 TEST_YUM=0`.)
> >> >
> >> > [1]: 
> >> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.6.0%22+is%3Aclosed
> >> > [2]: 
> >> > https://github.com/apache/arrow-adbc/commit/40ff492c5f78ff3867a2b768bc46c2c17017a76f
> >> > [3]: 
> >> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.6.0-rc0/
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [7]: 
> >> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> > [8]: 
> >> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.6.0-rc0
> >> > [9]: 
> >> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.6.0-rc0/CHANGELOG.md
> >> > [10]: 
> >> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5956799244


Re: [Discuss] Do we need a release verification script?

2023-08-23 Thread Dewey Dunnington
I can certainly empathize with the difficulty of maintaining a release
verification script and also the difficulty of remembering which
combination of environment variables are needed on my system to make
it work. The general sentiment of "anybody should be able to check
that this release actually works" is a good one, and (at least for R)
we have encountered many errors by running tests locally that were
never surfaced (and might never have been surfaced) on CI.

I wonder if the difficulty of maintaining the release script is at
least partially tied to the fact that we are trying to release so many
components simultaneously? Is it time to start releasing components
independently?

On Tue, Aug 22, 2023 at 10:11 PM Sutou Kouhei  wrote:
>
> Hi,
>
> We can ask the ASF about this. memb...@apache.org?
>
>
> Thanks,
> --
> kou
>
> In <367f6bb7-d8a5-ea49-f5d8-e4fa1afae...@python.org>
>   "Re: [Discuss] Do we need a release verification script?" on Tue, 22 Aug 
> 2023 17:18:02 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > And of course this is a bit pedantic, and only important if we want to
> > comply with *the letter* of the ASF policies. My own personal opinion
> > is that complying in spirit is enough (but I'm also not sure I
> > understand the ASF's spirit :-)).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/08/2023 à 17:10, Antoine Pitrou a écrit :
> >> Hmm... perhaps Flatbuffers compilation is usually more deterministic
> >> than compiling C++ code into machine code, but that's mainly (AFAIK)
> >> because the transformation step is much simpler in the former case,
> >> than
> >> in the latter. The Flatbuffers compiler also has a range of options
> >> that
> >> influence code generation, certainly with less variation than a C++
> >> compiler, but still.
> >> In other words, I don't think being deterministic is a good criterion
> >> to
> >> know what "compiled code" means. There is a growing movement towards
> >> making generation of machine code artifacts deterministic:
> >> https://reproducible-builds.org/
> >> Regards
> >> Antoine.
> >> Le 22/08/2023 à 16:47, Adam Lippai a écrit :
> >>> Compiled code usually means binaries you can’t derive in a
> >>> deterministic,
> >>> verifiable way from the source code *shipped next to it*. So in this
> >>> case
> >>> any developer should be able to reproduce the flatbuffers output from
> >>> the
> >>> release package only.
> >>>
> >>> “Caches”, multi stage compilation etc should be ok.
> >>>
> >>> Best regards,
> >>> Adam Lippai
> >>>
> >>> On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou 
> >>> wrote:
> >>>
> 
>  If the main impetus for the verification script is to comply with ASF
>  requirements, probably the script can be made much simpler, such as
>  just
>  verify the GPG signatures are valid? Or perhaps this can be achieved
>  without a script at all.
> 
>  The irony is that, however complex, our verification script doesn't
>  seem
>  to check the actual ASF requirements on artifacts.
> 
>  For example, we don't check that """a source release SHOULD not
>  contain
>  compiled code""" (also, what does "compiled code" mean? does generated
>  code, e.g. by the Flatbuffers compiler, apply?)
> 
>  Checking that the release """MUST be sufficient for a user to build
>  and
>  test the release provided they have access to the appropriate platform
>  and tools""" is ill-defined and potentially tautologic, because the
>  "appropriate platform and tools" is too imprecise and contextual (can
>  the "appropriate platform and tools" contain a bunch of proprietary
>  software that gets linked with the binaries? Well, it can, otherwise
>  you
>  can't build on Windows).
> 
>  Regards
> 
>  Antoine.
> 
> 
> 
>  Le 22/08/2023 à 12:31, Raúl Cumplido a écrit :
> > Hi,
> >
> > I do agree that currently verifying the release locally provides
> > little benefit for the effort we have to put in but I thought this was
> > required as per Apache policy:
> > https://www.apache.org/legal/release-policy.html#release-approval
> >
> > Copying the important bit:
> > """
> > Before casting +1 binding votes, individuals are REQUIRED to download
> > all signed source code packages onto their own hardware, verify that
> > they meet all requirements of ASF policy on releases as described
> > below, validate all cryptographic signatures, compile as provided, and
> > test the result on their own platform.
> > """
> >
> > I also think we should try and challenge those.
> >
> > In the past we have identified some minor issues on the local
> > verification but I don't recall any of them being blockers for the
> > release.
> >
> > Thanks,
> > Raúl
> >
> > El mar, 22 ago 2023 a las 11:46, Andrew Lamb ()
>  escribió:
> >>
> >> The Rust arrow implementation 

Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Dewey Dunnington
+1! Looking forward to implementing this in nanoarrow!

On Wed, Aug 16, 2023 at 11:18 AM Ian Cook  wrote:
>
> +1 (non-binding)
>
> On Wed, Aug 16, 2023 at 10:16 AM Matt Topol
>  wrote:
> >
> > Hey All,
> >
> > As proposed by Felipe [1] I'm starting a vote on the proposed update to the
> > Format Spec of adding "+r" as the format string for passing Run-End Encoded
> > arrays through the Arrow C Data Interface.
> >
> > A PR containing an update to the C++ Arrow implementation to add support
> > for this format string along with documentation updates can be found here
> > [2].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 - I'm in favor of this new C Data Format string
> > [ ] +0
> > [ ] -1 - I'm against adding this new format string because
> >
> > Thanks everyone!
> >
> > --Matt
> >
> > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781
> > [2]: https://github.com/apache/arrow/pull/37174


Re: [VOTE] Apache Arrow ADBC (API) 1.1.0

2023-08-15 Thread Dewey Dunnington
+1! Thank you for your work on this!

On Mon, Aug 14, 2023 at 11:00 PM Sutou Kouhei  wrote:
>
> +1
>
> In <2a98c33e-60c1-4df9-80db-350939eb8...@app.fastmail.com>
>   "[VOTE] Apache Arrow ADBC (API) 1.1.0" on Mon, 14 Aug 2023 13:39:46 -0400,
>   "David Li"  wrote:
>
> > Hello,
> >
> > We have been discussing revisions [1] to the ADBC APIs, which we formerly 
> > decided to treat as a specification [2]. These revisions clean up various 
> > missing features (e.g. cancellation, error metadata) and better position 
> > ADBC to help different data systems interoperate (e.g. by exposing more 
> > metadata, like table/column statistics).
> >
> > For details, see the PR at [3]. (The main file to read through is adbc.h.)
> >
> > I would like to propose that the Arrow project adopt this RFC, along with 
> > the linked PR, as version 1.1.0 of the ADBC API standard.
> >
> > Please vote to adopt the specification as described above. This is not a 
> > vote to release any packages; the first package release to support version 
> > 1.1.0 of the APIs will be 0.7.0 of the packages. (So I will not merge the 
> > linked PR until after we release ADBC 0.6.0.)
> >
> > This vote will be open for at least 72 hours.
> >
> > [ ] +1 Adopt the ADBC 1.1.0 specification
> > [ ]  0
> > [ ] -1 Do not adopt the specification because...
> >
> > Thanks to Sutou Kouhei, Matt Topol, Dewey Dunnington, Antoine Pitrou, Will 
> > Ayd, and Will Jones for feedback on the design and various work-in-progress 
> > PRs.
> >
> > [1]: https://github.com/apache/arrow-adbc/milestone/3
> > [2]: https://lists.apache.org/thread/s8m4l9hccfh5kqvvd2x3gxn3ry0w1ryo
> > [3]: https://github.com/apache/arrow-adbc/pull/971
> >
> > Thank you,
> > David


Re: [ANNOUNCE] New Arrow committer: Kevin Gurney

2023-07-04 Thread Dewey Dunnington
Congrats!

On Tue, Jul 4, 2023 at 2:08 PM Matt Topol  wrote:
>
> Welcome!
>
> On Tue, Jul 4, 2023, 11:06 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Congrats Kevin!
> >
> > On Tue, 4 Jul 2023 at 13:47, David Li  wrote:
> > >
> > > Welcome Kevin!
> > >
> > > On Tue, Jul 4, 2023, at 05:55, Raúl Cumplido wrote:
> > > > Congratulations Kevin!!!
> > > >
> > > > El mar, 4 jul 2023 a las 3:32, Weston Pace ()
> > escribió:
> > > >>
> > > >> Congratulations Kevin!
> > > >>
> > > >> On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei 
> > wrote:
> > > >>
> > > >> > On behalf of the Arrow PMC, I'm happy to announce that Kevin Gurney
> > > >> > has accepted an invitation to become a committer on Apache
> > > >> > Arrow. Welcome, and thank you for your contributions!
> > > >> >
> > > >> > --
> > > >> > kou
> > > >> >
> >


Re: Do we need CODEOWNERS ?

2023-07-04 Thread Dewey Dunnington
Just a note that for me, the main problem is that I get automatic
review requests for PRs that have nothing to do with R (I think this
happens when a rebase occurs that contained an R commit). Because that
happens a lot, it means I miss actual review requests and sometimes
mentions because they blend in. I think CODEOWNERS results in me
reviewing more PRs than if I had to set up some kind of custom
notification filter but I agree that it's not perfect.

Cheers,

-dewey

On Tue, Jul 4, 2023 at 10:04 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> Some time ago we added a `.github/CODEOWNERS` file in the main Arrow
> repo. The idea is that, when specific files or directories are touched
> by a PR, specific people are asked for review.
>
> Unfortunately, it seems that, most of the time, this produces the
> following effects:
>
> 1) the people who are automatically queried for review don't show up
> (perhaps they simply ignore those automatic notifications)
> 2) when several people are assigned for review, each designated reviewer
> seems to hope that the other ones will be doing the work, instead of
> doing it themselves
> 3) contributors expect those people to show up and are therefore
> bewildered when nobody comes to review their PR
>
> Do we want to keep CODEOWNERS? If we still think it can be beneficial,
> we should institute a policy where people who are listed in that file
> promise to respond to review requests: 1) either by doing a review 2) or
> by de-assigning themselves, and if possible pinging another core developer.
>
> What do you think?
>
> Regards
>
> Antoine.


Re: [VOTE] Release Apache Arrow ADBC 0.5.1 - RC1

2023-06-23 Thread Dewey Dunnington
+1!

I ran USE_CONDA=1 TEST_APT=0 TEST_YUM=0 ./verify-release-candidate.sh
0.5.1 1 on MacOS M1.

On Fri, Jun 23, 2023 at 8:50 PM David Li  wrote:
>
> My vote: +1 (Ubuntu Linux 20.04/x86_64; macOS 13.4/AArch64)
>
> On Fri, Jun 23, 2023, at 17:51, Matt Topol wrote:
> > +1 tested on Pop!_Os 22.04 with go 1.19
> >
> > On Fri, Jun 23, 2023, 4:52 PM Sutou Kouhei  wrote:
> >
> >> +1
> >>
> >> I ran the following on Debian GNU/Linux sid:
> >>
> >>   JAVA_HOME=/usr/lib/jvm/default-java \
> >> dev/release/verify-release-candidate.sh 0.5.1 1
> >>
> >> with:
> >>
> >>   * Python 3.11.4
> >>   * g++ (Debian 12.3.0-4) 12.3.0
> >>   * go version go1.20.5 linux/amd64
> >>   * openjdk version "17.0.7" 2023-04-18
> >>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <321c4e07-60a1-402d-9574-8437b462e...@app.fastmail.com>
> >>   "[VOTE] Release Apache Arrow ADBC 0.5.1 - RC1" on Thu, 22 Jun 2023
> >> 22:08:56 -0400,
> >>   "David Li"  wrote:
> >>
> >> > (I originally sent this with the wrong email, but it appears to have
> >> been swallowed. Apologies if this ends up being a duplicate.)
> >> >
> >> > I would like to propose the following release candidate (RC1) of Apache
> >> Arrow ADBC version 0.5.1. This is a release consisting of 8 resolved GitHub
> >> issues [1]. The main motivation is to release a fix in the Snowflake
> >> driver, as mentioned in an earlier thread.
> >> >
> >> > This release candidate is based on commit:
> >> 01c2f1eb281e8fb003f2d32096a6b0fe336128a9 [2]
> >> > (Note I had to manually patch one script; this will be resolved in
> >> future releases.)
> >> >
> >> > The source release rc1 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8].
> >> > The changelog is located at [9].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> and vote on the release. See [10] for how to validate a release candidate.
> >> >
> >> > See also a verification result on GitHub Actions [11].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow ADBC 0.5.1
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow ADBC 0.5.1 because...
> >> >
> >> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> >> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> >> TEST_APT=0 TEST_YUM=0`.)
> >> >
> >> > [1]:
> >> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.1%22+is%3Aclosed
> >> > [2]:
> >> https://github.com/apache/arrow-adbc/commit/01c2f1eb281e8fb003f2d32096a6b0fe336128a9
> >> > [3]:
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.1-rc1/
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [7]:
> >> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> > [8]:
> >> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.1-rc1
> >> > [9]:
> >> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.1-rc1/CHANGELOG.md
> >> > [10]:
> >> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5351160439
> >>


Re: [RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-23 Thread Dewey Dunnington
Ok! Post-release tasks are complete. Thank you all!

[x] Closed GitHub milestone
[x] Added release to Apache Reporter System
[x] Uploaded artifacts to Subversion
[x] Created GitHub release
[x] Submit R package to CRAN
[x] Sent announcement to annou...@apache.org
[x] Release blog post [2]
[x] Removed old artifacts from SVN
[x] Bumped versions on main

[1] https://arrow.apache.org/blog/2023/06/22/nanoarrow-0.120-release/

On Fri, Jun 23, 2023 at 9:28 AM Dewey Dunnington  wrote:
>
> Thanks for offering! Sorry for being slow to update the thread...David
> Li ran the upload script yesterday.
>
> -dewey
>
> On Thu, Jun 22, 2023 at 11:59 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > I believe the upload step requires a PMC member to run the script
> >
> > I can do it. Can I run
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/post-01-upload.sh
> > ?
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1" on Thu, 22 
> > Jun 2023 16:05:50 -0300,
> >   Dewey Dunnington  wrote:
> >
> > > Thank you everybody for verifying and voting! With 3 binding +1s and 3
> > > non-binding +1s, the vote passes! I have opened a PR to improve the
> > > verification instructions (particularly on conda where most problems
> > > occurred) [1].
> > >
> > > Apache Arrow nanoarrow 0.2.0 has the following post-release tasks. I
> > > believe the upload step requires a PMC member to run the script but
> > > the rest I'm happy to take care of!
> > >
> > > [x] Closed GitHub milestone
> > > [ ] Added release to Apache Reporter System
> > > [ ] Uploaded artifacts to Subversion
> > > [ ] Created GitHub release
> > > [ ] Submit R package to CRAN
> > > [ ] Sent announcement to annou...@apache.org
> > > [ ] Release blog post [2]
> > > [ ] Removed old artifacts from SVN
> > > [ ] Bumped versions on main
> > >
> > > [1] https://github.com/apache/arrow-nanoarrow/pull/243
> > > [2] https://github.com/apache/arrow-site/pull/364


Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Dewey Dunnington
Thank you everybody for the welcome! I'm honoured!

On Fri, Jun 23, 2023 at 2:41 PM David Li  wrote:
>
> Welcome Dewey!
>
> On Fri, Jun 23, 2023, at 13:37, Weston Pace wrote:
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:
> >
> >>
> >> Welcome to the PMC Dewey!
> >>
> >>
> >> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
> >> > Congrats Dewey!
> >> >
> >> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
> >> >  wrote:
> >> >>
> >> >> Well deserved! Congratulations Dewey!
> >> >>
> >> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
> >> >>
> >> >>> Congratulations Dewey!
> >> >>>
> >> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> >> >>> wrote:
> >> >>>>
> >> >>>> Congrats Dewey!!
> >> >>>>
> >> >>>> On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
> >> 
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Congrats Dewey!
> >> >>>>>
> >> >>>>> On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
> >> wrote:
> >> >>>>>
> >> >>>>>> Well-deserved Dewey, congratulations!
> >> >>>>>>
> >> >>>>>> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  >> >
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Congratulations Dewey!
> >> >>>>>>>
> >> >>>>>>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> >> >>> ale...@voltrondata.com
> >> >>>>>>> .invalid>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Congratulations Dewey!! 
> >> >>>>>>>>
> >> >>>>>>>> On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> >> >>>>> raulcumpl...@gmail.com
> >> >>>>>>>
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Congratulations Dewey!
> >> >>>>>>>>>
> >> >>>>>>>>> El vie, 23 jun 2023, 11:55, Andrew Lamb 
> >> >>>>>>> escribió:
> >> >>>>>>>>>
> >> >>>>>>>>>> The Project Management Committee (PMC) for Apache Arrow has
> >> >>>>> invited
> >> >>>>>>>>>> Dewey Dunnington (paleolimbot) to become a PMC member and we
> >> >>> are
> >> >>>>>>>> pleased
> >> >>>>>>>>> to
> >> >>>>>>>>>> announce
> >> >>>>>>>>>> that Dewey Dunnington has accepted.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Congratulations and welcome!
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>
> >>


[ANNOUNCE] Apache Arrow nanoarrow 0.2.0 Released

2023-06-23 Thread Dewey Dunnington
The Apache Arrow community is pleased to announce the 0.2.0 release of
Apache Arrow nanoarrow. This initial release covers 19 resolved issues
from 6 contributors[1].

The release is available now from [2].

Release notes are available at:
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0/CHANGELOG.md

What is Apache Arrow?
-
Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides
low-overhead streaming and batch messaging, zero-copy interprocess
communication (IPC), and vectorized in-memory analytics libraries.
Languages currently supported include C, C++, C#, Go, Java,
JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

What is Apache Arrow nanoarrow?
--
Apache Arrow nanoarrow is a small C library for building and
interpreting Arrow C Data interface structures with bindings for users
of the R programming language. The vision of nanoarrow is that it
should be trivial for a library or application to implement an
Arrow-based interface. The library provides helpers to create types,
schemas, and metadata, an API for building arrays element-wise,
and an API to extract elements element-wise from an array. For a more
detailed description of the features nanoarrow provides and motivation
for its development, see [3].

Please report any feedback to the mailing lists ([4], [5]).

Regards,
The Apache Arrow Community

[1]: 
https://github.com/apache/arrow-nanoarrow/issues?q=is%3Aissue+milestone%3A%22nanoarrow+0.2.0%22+is%3Aclosed
[2]: https://www.apache.org/dyn/closer.cgi/arrow/apache-arrow-nanoarrow-0.2.0
[3]: https://github.com/apache/arrow-nanoarrow
[4]: https://lists.apache.org/list.html?u...@arrow.apache.org
[5]: https://lists.apache.org/list.html?dev@arrow.apache.org


Re: [RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-23 Thread Dewey Dunnington
Thanks for offering! Sorry for being slow to update the thread...David
Li ran the upload script yesterday.

-dewey

On Thu, Jun 22, 2023 at 11:59 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > I believe the upload step requires a PMC member to run the script
>
> I can do it. Can I run
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/post-01-upload.sh
> ?
>
>
> Thanks,
> --
> kou
>
> In 
>   "[RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1" on Thu, 22 Jun 
> 2023 16:05:50 -0300,
>   Dewey Dunnington  wrote:
>
> > Thank you everybody for verifying and voting! With 3 binding +1s and 3
> > non-binding +1s, the vote passes! I have opened a PR to improve the
> > verification instructions (particularly on conda where most problems
> > occurred) [1].
> >
> > Apache Arrow nanoarrow 0.2.0 has the following post-release tasks. I
> > believe the upload step requires a PMC member to run the script but
> > the rest I'm happy to take care of!
> >
> > [x] Closed GitHub milestone
> > [ ] Added release to Apache Reporter System
> > [ ] Uploaded artifacts to Subversion
> > [ ] Created GitHub release
> > [ ] Submit R package to CRAN
> > [ ] Sent announcement to annou...@apache.org
> > [ ] Release blog post [2]
> > [ ] Removed old artifacts from SVN
> > [ ] Bumped versions on main
> >
> > [1] https://github.com/apache/arrow-nanoarrow/pull/243
> > [2] https://github.com/apache/arrow-site/pull/364


[RESULT][VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-22 Thread Dewey Dunnington
Thank you everybody for verifying and voting! With 3 binding +1s and 3
non-binding +1s, the vote passes! I have opened a PR to improve the
verification instructions (particularly on conda where most problems
occurred) [1].

Apache Arrow nanoarrow 0.2.0 has the following post-release tasks. I
believe the upload step requires a PMC member to run the script but
the rest I'm happy to take care of!

[x] Closed GitHub milestone
[ ] Added release to Apache Reporter System
[ ] Uploaded artifacts to Subversion
[ ] Created GitHub release
[ ] Submit R package to CRAN
[ ] Sent announcement to annou...@apache.org
[ ] Release blog post [2]
[ ] Removed old artifacts from SVN
[ ] Bumped versions on main

[1] https://github.com/apache/arrow-nanoarrow/pull/243
[2] https://github.com/apache/arrow-site/pull/364


Re: [VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-20 Thread Dewey Dunnington
Thanks for verifying!

I don't *think* there is anything non-standard about the
`find_package(Arrow)` / `target_link_libraries(..., arrow_shared)`
sequence used to link the tests (although clearly they aren't working
as intended!). You can pass extra arguments to CMake to help it find
the right Arrow using export NANOARROW_CMAKE_OPTIONS="-DArrow_DIR=..."
but here it sounds like it's finding the .so but failing to link the
dependencies. There are also instructions on creating a conda
environment with all required dependencies at [1].

[1] 
https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md#conda-linux-and-macos

On Tue, Jun 20, 2023 at 9:32 AM Antoine Pitrou  wrote:
>
>
> Ok, now running from the right repo :-), I get linker errors against
> Arrow C++ dependencies:
>
> [ 44%] Linking CXX executable utils_test
> /home/antoine/mambaforge/envs/pyarrow/bin/../lib/gcc/x86_64-conda-linux-gnu/12.2.0/../../../../x86_64-conda-linux-gnu/bin/ld:
> warning: libcrypto.so.3, needed by
> /home/antoine/mambaforge/envs/pyarrow/lib/libarrow.so.1300.0.0, not
> found (try using -rpath or -rpath-link)
>
> (etc.)
>
> https://gist.github.com/pitrou/3e6e9621e3b6cc2aff932eafdafef82b
>
> Note that Arrow C++ is compiled by myself inside a conda environment
> (which is activated when running the verification script).
>
> Regards
>
> Antoine.
>
>
>
> Le 20/06/2023 à 12:38, Raúl Cumplido a écrit :
> > +1 (non-binding)
> >
> > I've run:
> > ./verify-release-candidate.sh 0.2.0 1
> >
> > on Ubuntu 22.04 with conda:
> > * arrow-cpp 12.0.0
> > * gcc (conda-forge gcc 11.4.0-0) 11.4.0
> > * r-base  4.2.3
> >
> > Thanks,
> > Raúl
> >
> > El mar, 20 jun 2023 a las 1:55, Sutou Kouhei () 
> > escribió:
> >>
> >> +1
> >>
> >> I ran the following command line on Debian GNU/Linux sid:
> >>
> >>CMAKE_PREFIX_PATH=/tmp/local \
> >>  dev/release/verify-release-candidate.sh 0.2.0 1
> >>
> >> with:
> >>
> >>* Apache Arrow C++ main
> >>* gcc (Debian 12.2.0-14) 12.2.0
> >>* R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>"[VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1" on Mon, 19 Jun 2023 
> >> 15:58:45 -0300,
> >>Dewey Dunnington  wrote:
> >>
> >>> Hello,
> >>>
> >>> I would like to propose the following release candidate (RC1) of
> >>> Apache Arrow nanoarrow version 0.2.0. This release consists of 17
> >>> resolved GitHub issues [1].
> >>>
> >>> This release candidate is based on commit:
> >>> f71063605e288d9a8dd73cfdd9578773519b6743 [2]
> >>>
> >>> The source release rc1 is hosted at [3].
> >>> The changelog is located at [4].
> >>> The draft release post is located at [5].
> >>>
> >>> Please download, verify checksums and signatures, run the unit tests,
> >>> and vote on the release. See [6] for how to validate a release
> >>> candidate.
> >>>
> >>> The vote will be open for at least 72 hours.
> >>>
> >>> [ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
> >>> [ ] +0
> >>> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...
> >>>
> >>> [0] https://github.com/apache/arrow-nanoarrow
> >>> [1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
> >>> [2] 
> >>> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc1
> >>> [3] 
> >>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc1/
> >>> [4] 
> >>> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc1/CHANGELOG.md
> >>> [5] https://github.com/apache/arrow-site/pull/364
> >>> [6] 
> >>> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


Re: [VOTE] Release Apache Arrow ADBC 0.5.0 - RC0

2023-06-19 Thread Dewey Dunnington
+1 (non-binding)

I ran the following on MacOS M1:

USE_CONDA=1 TEST_APT=0 TEST_YUM=0 ./verify-release-candidate.sh 0.5.0 0

On Mon, Jun 19, 2023 at 12:12 PM Jean-Baptiste Onofré  wrote:
>
> +1 (non binding)
>
> Regards
> JB
>
> On Fri, Jun 16, 2023 at 2:06 AM David Li  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC0) of Apache 
> > Arrow ADBC version 0.5.0. This is a release consisting of 36 resolved 
> > GitHub issues [1].
> >
> > This release candidate is based on commit: 
> > ac0e0ef8bd83787f65e53d421fce6ad490d9a37d [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests, and 
> > vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.5.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.5.0 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> > DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export 
> > TEST_APT=0 TEST_YUM=0`.)
> >
> > [1]: 
> > https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.5.0%22+is%3Aclosed
> > [2]: 
> > https://github.com/apache/arrow-adbc/commit/ac0e0ef8bd83787f65e53d421fce6ad490d9a37d
> > [3]: 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.5.0-rc0/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]: 
> > https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]: 
> > https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.5.0-rc0
> > [9]: 
> > https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.5.0-rc0/CHANGELOG.md
> > [10]: 
> > https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/5284608862


Re: [VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-19 Thread Dewey Dunnington
My vote is +1 (non-binding). Verified on MacOS M1 (both Homebrew and Conda).

On Mon, Jun 19, 2023 at 3:58 PM Dewey Dunnington  wrote:
>
> Hello,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow nanoarrow version 0.2.0. This release consists of 17
> resolved GitHub issues [1].
>
> This release candidate is based on commit:
> f71063605e288d9a8dd73cfdd9578773519b6743 [2]
>
> The source release rc1 is hosted at [3].
> The changelog is located at [4].
> The draft release post is located at [5].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [6] for how to validate a release
> candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...
>
> [0] https://github.com/apache/arrow-nanoarrow
> [1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
> [2] 
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc1
> [3] 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc1/
> [4] 
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc1/CHANGELOG.md
> [5] https://github.com/apache/arrow-site/pull/364
> [6] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


Re: [VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC0

2023-06-19 Thread Dewey Dunnington
Hi all,

Thank you all for verifying!

The issue uncovered during verification on MacOS M1/Conda (Thanks!)
has been fixed [1] and a new release candidate has been issued [2].

Cheers,

-dewey

[1] https://github.com/apache/arrow-nanoarrow/pull/242
[2] https://lists.apache.org/thread/027xxw9vfv7dnf6lhhyzqofoyqnnbnf2

On Sun, Jun 18, 2023 at 6:05 PM Sutou Kouhei  wrote:
>
> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   CMAKE_PREFIX_PATH=/tmp/local \
> dev/release/verify-release-candidate.sh 0.2.0 0
>
> with:
>
>   * Apache Arrow C++ main
>   * gcc (Debian 12.2.0-14) 12.2.0
>   * R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
>
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC0" on Fri, 16 Jun 2023 
> 17:15:41 -0300,
>   Dewey Dunnington  wrote:
>
> > Hello,
> >
> > I would like to propose the following release candidate (RC0) of
> > Apache Arrow nanoarrow version 0.2.0. This release consists of 17
> > resolved GitHub issues [1].
> >
> > This release candidate is based on commit:
> > a7b824de6cb99ce458e1a5cd311d69588ceb0570 [2]
> >
> > The source release rc0 is hosted at [3].
> > The changelog is located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [5] for how to validate a release
> > candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
> > [2] 
> > https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc0
> > [3] 
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc0/
> > [4] 
> > https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc0/CHANGELOG.md
> > [5] 
> > https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


[VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-19 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (RC1) of
Apache Arrow nanoarrow version 0.2.0. This release consists of 17
resolved GitHub issues [1].

This release candidate is based on commit:
f71063605e288d9a8dd73cfdd9578773519b6743 [2]

The source release rc1 is hosted at [3].
The changelog is located at [4].
The draft release post is located at [5].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [6] for how to validate a release
candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc1
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc1/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc1/CHANGELOG.md
[5] https://github.com/apache/arrow-site/pull/364
[6] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


[VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC0

2023-06-16 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (RC0) of
Apache Arrow nanoarrow version 0.2.0. This release consists of 17
resolved GitHub issues [1].

This release candidate is based on commit:
a7b824de6cb99ce458e1a5cd311d69588ceb0570 [2]

The source release rc0 is hosted at [3].
The changelog is located at [4].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [5] for how to validate a release
candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc0
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc0/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc0/CHANGELOG.md
[5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md


Re: [VOTE] Release Apache Arrow 12.0.1 - RC1

2023-06-12 Thread Dewey Dunnington
+1! I ran

TEST_DEFAULT=0 TEST_CPP=1
ARROW_CMAKE_OPTIONS="-DProtobuf_SOURCE=BUNDLED -DARROW_FLIGHT=OFF
-DARROW_FLIGHT_SQL=OFF"  ./verify-release-candidate.sh

...on MacOS Ventura aarch64. (Flight disabled because of protobuf issues).

On Mon, Jun 12, 2023 at 10:28 AM Joris Van den Bossche
 wrote:
>
> +1 (verified source release on Ubuntu 20.04, using conda)
>
> On Sat, 10 Jun 2023 at 22:31, Sutou Kouhei  wrote:
> >
> > +1
> >
> > I ran the followings on Debian GNU/Linux sid:
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_SOURCE=1 \
> >   LANG=C \
> >   TZ=UTC \
> >   CUDAToolkit_ROOT=/usr \
> >   ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON 
> > -Dxsimd_SOURCE=BUNDLED" \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_APT=1 \
> >   LANG=C \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_BINARY=1 \
> >   LANG=C \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_JARS=1 \
> >   LANG=C \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_PYTHON_VERSIONS=3.11 \
> >   TEST_WHEEL_PLATFORM_TAGS=manylinux_2_17_x86_64.manylinux2014_x86_64 \
> >   TEST_WHEELS=1 \
> >   LANG=C \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> >   * TEST_DEFAULT=0 \
> >   TEST_YUM=1 \
> >   LANG=C \
> >   dev/release/verify-release-candidate.sh 12.0.1 1
> >
> > with:
> >
> >   * .NET SDK (6.0.408)
> >   * Python 3.11.2
> >   * gcc (Debian 12.2.0-14) 12.2.0
> >   * nvidia-cuda-dev 11.8.89~11.8.0-3
> >   * openjdk version "18.0.2-ea" 2022-07-19
> >   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[VOTE] Release Apache Arrow 12.0.1 - RC1" on Fri, 9 Jun 2023 14:32:26 
> > +0200,
> >   Raúl Cumplido  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose the following release candidate (RC1) of Apache
> > > Arrow version 12.0.1. This is a release consisting of 29
> > > resolved GitHub issues[1].
> > >
> > > This release candidate is based on commit:
> > > 6af660f48472b8b45a5e01b7136b9b040b185eb1 [2]
> > >
> > > The source release rc1 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > > The changelog is located at [12].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [13] for how to validate a release candidate.
> > >
> > > See also a verification result on GitHub pull request [14].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow 12.0.1
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow 12.0.1 because...
> > >
> > > [1]: 
> > > https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A12.0.1+is%3Aclosed
> > > [2]: 
> > > https://github.com/apache/arrow/tree/6af660f48472b8b45a5e01b7136b9b040b185eb1
> > > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-12.0.1-rc1
> > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/12.0.1-rc1
> > > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/12.0.1-rc1
> > > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/12.0.1-rc1
> > > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > [12]: 
> > > https://github.com/apache/arrow/blob/6af660f48472b8b45a5e01b7136b9b040b185eb1/CHANGELOG.md
> > > [13]: 
> > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > > [14]: https://github.com/apache/arrow/pull/35967


Re: [VOTE][Format] Add experimental ArrowDeviceArray to C-Data API

2023-06-02 Thread Dewey Dunnington
I've already given my vote here, but wanted to share a
proof-of-concept C implementation (== copy an arbitrary valid
ArrowArray to given a suitable device implementation) of the proposed
spec that includes Apple Metal [1] and could include CUDA as well (I
did Metal first since Matt already worked up an example with CUDA).
The proposed structures were great to work with!

[1] https://github.com/apache/arrow-nanoarrow/pull/205

On Wed, May 31, 2023 at 2:19 PM Ian Cook  wrote:
>
> +1 (non-binding).
>
> Thanks very much Matt for all the work you did here to solicit input from
> other stakeholder communities.
>
> On Mon, May 22, 2023 at 12:02 PM Matt Topol  wrote:
>
> > Hello,
> >
> > Now that there's a rough consensus and a toy example POC[1], I would like
> > to propose an official enhancement to the Arrow C-Data API specification as
> > described in the PR[2]. The new ArrowDeviceArray/ArrowDeviceArrayStream
> > structs would be considered "experimental" and the documentation would
> > label them as such for the time being.
> >
> > Please comment, ask questions, and look at the PR and toy example POC as
> > needed.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Add this to the C-Data API
> > [ ] +0
> > [ ] -1 Do not add this to the C-Data API because...
> >
> > Thank you very much everyone!
> > -- Matt
> >
> > [1]: https://github.com/zeroshade/arrow-non-cpu
> > [2]: https://github.com/apache/arrow/pull/34972
> >


Re: [VOTE][Format] Add experimental ArrowDeviceArray to C-Data API

2023-05-29 Thread Dewey Dunnington
+1 (non-binding)! Reading the discussion on that PR is illuminating as
to how difficult this can be...thank you!

On Fri, May 26, 2023 at 3:54 PM Benjamin Kietzman  wrote:
>
> +1, thanks for all your work on this!
>
> On Fri, May 26, 2023 at 11:09 AM Matt Topol  wrote:
>
> > That makes 1 binding and one non-binding +1, as 3 binding votes are
> > necessary I'm sending this to hopefully request more eyes here and get some
> > more votes.
> >
> > Thanks all!
> >
> > On Thu, May 25, 2023 at 11:38 AM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> > > +1 for me.
> > >
> > > The C structs are clean and leave good room for extension.
> > >
> > > --
> > > Felipe
> > >
> > > On Thu, May 25, 2023 at 12:04 PM David Li  wrote:
> > >
> > > > +1 for me.
> > > >
> > > > (Heads up: on the PR, there was some discussion since the last email
> > and
> > > > the meaning of 'experimental' was clarified.)
> > > >
> > > > On Tue, May 23, 2023, at 16:56, Matt Topol wrote:
> > > > > To clarify:
> > > > >
> > > > >> Depends on what we're voting on?
> > > > >
> > > > > Voting on adopting the spec and adding it (while still leaving it
> > > labeled
> > > > > as "experimental" in the docs) to the format.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Tue, May 23, 2023 at 3:29 PM Matthew Topol
> > > > 
> > > > > wrote:
> > > > >
> > > > >> @Antoine: I've updated the PR with a prose description of the C
> > Device
> > > > Data
> > > > >> interface. Sorry for the lack of that in the first place.
> > > > >>
> > > > >> --Matt
> > > > >>
> > > > >> On Tue, May 23, 2023 at 10:34 AM Antoine Pitrou  > >
> > > > >> wrote:
> > > > >>
> > > > >> >
> > > > >> > Also, I forgot to say, but thanks a lot for doing this! We can
> > hope
> > > > this
> > > > >> > will drastically improve interoperability between non-CPU data
> > > > >> > frameworks and libraries.
> > > > >> >
> > > > >> > Regards
> > > > >> >
> > > > >> > Antoine.
> > > > >> >
> > > > >> >
> > > > >> > Le 23/05/2023 à 16:32, Antoine Pitrou a écrit :
> > > > >> > >
> > > > >> > > Depends on what we're voting on?
> > > > >> > >
> > > > >> > > The C declarations seem fine to me (I'm a bit lukewarm on the
> > > > reserved
> > > > >> > > bits, but I understand the motivation), however I've posted
> > > > comments as
> > > > >> > > to how to document the interface. The current PR entirely lacks
> > a
> > > > prose
> > > > >> > > description of the C Device Data Interface.
> > > > >> > >
> > > > >> > > Regards
> > > > >> > >
> > > > >> > > Antoine.
> > > > >> > >
> > > > >> > >
> > > > >> > > Le 22/05/2023 à 18:02, Matt Topol a écrit :
> > > > >> > >> Hello,
> > > > >> > >>
> > > > >> > >> Now that there's a rough consensus and a toy example POC[1], I
> > > > would
> > > > >> > like
> > > > >> > >> to propose an official enhancement to the Arrow C-Data API
> > > > >> > specification as
> > > > >> > >> described in the PR[2]. The new
> > > > >> ArrowDeviceArray/ArrowDeviceArrayStream
> > > > >> > >> structs would be considered "experimental" and the
> > documentation
> > > > would
> > > > >> > >> label them as such for the time being.
> > > > >> > >>
> > > > >> > >> Please comment, ask questions, and look at the PR and toy
> > example
> > > > POC
> > > > >> as
> > > > >> > >> needed.
> > > > >> > >>
> > > > >> > >> The vote will be open for at least 72 hours.
> > > > >> > >>
> > > > >> > >> [ ] +1 Add this to the C-Data API
> > > > >> > >> [ ] +0
> > > > >> > >> [ ] -1 Do not add this to the C-Data API because...
> > > > >> > >>
> > > > >> > >> Thank you very much everyone!
> > > > >> > >> -- Matt
> > > > >> > >>
> > > > >> > >> [1]: https://github.com/zeroshade/arrow-non-cpu
> > > > >> > >> [2]: https://github.com/apache/arrow/pull/34972
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Dewey Dunnington
Very cool!

In addition to performance mentioned above, I could see this being
useful for the R bindings - we already have a global string pool and a
mechanism for keeping a vector of them alive.

I don't see the C Data interface in the PR although I may have missed
it - is that a part of the proposal? It seems like it would be
possible to use raw pointers as long as they can be guaranteed to be
valid until the release callback is called?

On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
 wrote:
>
> Hello Everyone,
> I think keeping interoperability with the large ecosystem is a very
> important goal for arrow so I am overall in favor of this proposal!
>
> You mention benchmarks multiple times, are these results published
> somewhere?
>
> Thanks
>
> On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman 
> wrote:
>
> > Hello all,
> >
> > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> > compatible
> > "string view" type could bring several performance benefits to access and
> > authoring of string data in the arrow format [2]. Additionally better
> > interoperability with engines already using this format could be
> > established.
> >
> > PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and
> > to
> > the IPC format. For the purposes of IPC raw pointers are not used. Instead,
> > each view contains a pair of 32 bit unsigned integers which encode the
> > index of
> > a character buffer (string view arrays may consist of a variable number of
> > such buffers) and the offset of a view's data within that buffer
> > respectively.
> > Benefits of this substitution include:
> > - This makes explicit the guarantee that lifetime of all character data is
> > equal
> >   to that of the array which views it, which is critical for confident
> >   consumption across an interface boundary.
> > - As with other types in the arrow format, such arrays are serializable and
> >   venue agnostic; directly usable in shared memory without modification.
> > - Indices and offsets are easily validated.
> >
> > Accessing the data requires some trivial pointer arithmetic, but in
> > benchmarking
> > this had negligible impact on sequential access and only minor impact on
> > random
> > access.
> >
> > In the C++ implementation, raw pointer string views are supported as an
> > extended
> > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
> > Branching on
> > this access pattern bit at the data type level has negligible impact on
> > performance since the branch resides outside any hot loops. Utility
> > functions
> > are provided for efficient (potentially in-place) conversion between raw
> > pointer
> > and index offset views. For example, the C++ implementation could zero copy
> > a raw pointer array from Velox, filter it, then convert to index/offset for
> > serialization. Other implementations may choose to accommodate or eschew
> > raw
> > pointer views as their communities direct.
> >
> > Where desirous in a rigorously controlled context this still enables
> > construction
> > and safe consumption of string view arrays which reference memory not
> > directly bound to the lifetime of the array. I'm not sure when or if we
> > would
> > find it useful to have arrays like this; I do not introduce any in [3]. I
> > mention
> > this possibility to highlight that if benchmarking demonstrates that such
> > an
> > approach brings a significant performance benefit to some operation, the
> > only
> > barrier to its adoption would be code review. Likewise if more intensive
> > benchmarking determines that raw pointer views critically outperform
> > index/offset
> > views for real-world analytics tasks, prioritizing raw pointer string views
> > for usage within the C++ implementation will be straightforward.
> >
> > See also the proposal to Velox that their string view vector be refactored
> > in a similar vein [4].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
> > [3] https://github.com/apache/arrow/pull/35628
> > [4] https://github.com/facebookincubator/velox/discussions/4362
> >


Re: [VOTE] Release Apache Arrow ADBC 0.4.0 - RC0

2023-05-12 Thread Dewey Dunnington
+1!

On Windows/Git Bash I ran:
PATH="$PATH:/c/PROGRA~1/R/R-43~1.0/bin" TEST_DEFAULT=0 TEST_R=1
dev/release/verify-release-candidate.sh 0.4.0 0

On Ubuntu 22.04 I ran:
TEST_APT=0 TEST_YUM=0 USE_CONDA=1
dev/release/verify-release-candidate.sh 0.4.0 0

I wasn't able to verify on MacOS/aarch64 because of the Snowflake
driver issue (but have verified Matt's fix to be included in the next
release locally). I also wasn't able to get the PowerShell
verification script working but will pursue that and/or add some
instructions as I have time before the next release.

On Wed, May 10, 2023 at 8:22 PM David Li  wrote:
>
> My vote: +1
>
> I verified on Ubuntu 22.04/Conda/AMD64 and macOS 13/Conda/AArch64 (sources 
> only there). I had to disable all the Snowflake driver builds for macOS as 
> those aren't working on non-AMD64 platforms (that was fixed earlier today by 
> Matt for the next release).
>
> Re: missing packages, I agree we should install them all via Conda.
>
> How long do we expect to need the Arch workaround? We don't currently test 
> with Arch in the first place so that might bitrot quickly (especially as it's 
> Arch).
>
> On Wed, May 10, 2023, at 18:12, Will Jones wrote:
> > +1 (binding)
> >
> > Verified on Ubuntu 22 with USE_CONDA=1
> > dev/release/verify-release-candidate.sh 0.4.0 0
> >
> > On Wed, May 10, 2023 at 2:27 PM Matt Topol  wrote:
> >
> >> Using a manjaro linux image (in honor of the issues we found for Arrow v12
> >> rc) I ran:
> >> USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.4.0 0
> >>
> >> My first attempt failed because the default base image doesn't have make
> >> and such installed. should we install that via conda too since we install
> >> the compilers and toolchains through conda when USE_CONDA=1?
> >>
> >> After installing `base-devel` package which gives make/autoconf/etc
> >> everything ran properly and worked just fine for verifying the candidate.
> >>
> >> So I'm +1 on the release (I'm fine with requiring that base-devel package
> >> installed) but I wanted to bring up/suggest the idea of installing make
> >> through conda also. That said, it still has the same libcrypt.so.1 issue
> >> that we saw with the Arrow v12 release, maybe we should add a note in the
> >> documentation that the `libxcrypt-compat` package is needed to build on any
> >> `pacman` / ArchLinux based systems?
> >>
> >> On Wed, May 10, 2023 at 7:03 AM Raúl Cumplido 
> >> wrote:
> >>
> >> > +1
> >> >
> >> > I ran the following on Ubuntu 22.04:
> >> > USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.4.0 0
> >> >
> >> > El mié, 10 may 2023 a las 9:59, Sutou Kouhei ()
> >> > escribió:
> >> > >
> >> > > +1
> >> > >
> >> > > I ran the following on Debian GNU/Linux sid:
> >> > >
> >> > >   JAVA_HOME=/usr/lib/jvm/default-java \
> >> > > TEST_PYTHON_VERSIONS=3.11 \
> >> > > dev/release/verify-release-candidate.sh 0.4.0 0
> >> > >
> >> > > with:
> >> > >
> >> > >   * Python 3.11.2
> >> > >   * g++ (Debian 12.2.0-14) 12.2.0
> >> > >   * go version go1.19.8 linux/amd64
> >> > >   * openjdk version "17.0.6" 2023-01-17
> >> > >   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >> > >   * R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
> >> > >
> >> > > Thanks,
> >> > > --
> >> > > kou
> >> > >
> >> > >
> >> > > In <831038c5-0ab3-4dae-80e3-07c882dce...@app.fastmail.com>
> >> > >   "[VOTE] Release Apache Arrow ADBC 0.4.0 - RC0" on Tue, 09 May 2023
> >> > 21:46:48 -0400,
> >> > >   "David Li"  wrote:
> >> > >
> >> > > > Hello,
> >> > > >
> >> > > > I would like to propose the following release candidate (RC0) of
> >> > Apache Arrow ADBC version 0.4.0. This is a release consisting of 47
> >> > resolved GitHub issues [1].
> >> > > >
> >> > > > This release candidate is based on commit:
> >> > cdb8fba8f6ca26647863224fb7fd9fc74097 [2]
> >> > > >
> >> > > > The source release rc0 is hosted at [3].
> >> > > > The binary artifacts are hosted at [4][5][6][7][8].
> >> > > > The changelog is located at [9].
> >> > > >
> >> > > > Please download, verify checksums and signatures, run the unit tests,
> >> > and vote on the release. See [10] for how to validate a release
> >> candidate.
> >> > > >
> >> > > > See also a verification result on GitHub Actions [11].
> >> > > >
> >> > > > The vote will be open for at least 72 hours.
> >> > > >
> >> > > > [ ] +1 Release this as Apache Arrow ADBC 0.4.0
> >> > > > [ ] +0
> >> > > > [ ] -1 Do not release this as Apache Arrow ADBC 0.4.0 because...
> >> > > >
> >> > > > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> >> > DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> >> > TEST_APT=0 TEST_YUM=0`.)
> >> > > >
> >> > > > [1]:
> >> >
> >> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.4.0%22+is%3Aclosed
> >> > > > [2]:
> >> >
> >> https://github.com/apache/arrow-adbc/commit/cdb8fba8f6ca26647863224fb7fd9fc74097
> >> > > > [3]:
> >> >
> >> 

Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-04 Thread Dewey Dunnington
Congrats!

On Thu, May 4, 2023 at 6:31 AM Alenka Frim
 wrote:
>
> Congratulations Matt!!
>
> On Thu, May 4, 2023 at 9:22 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Congrats Matt!
> >
> > On Thu, 4 May 2023 at 06:31, Nic Crane  wrote:
> > >
> > > Congratulations!
> > >
> > > On Thu, 4 May 2023, 05:24 Vibhatha Abeykoon,  wrote:
> > >
> > > > Congratulations Matt!
> > > >
> > > > On Thu, May 4, 2023 at 7:35 AM Ian Cook  wrote:
> > > >
> > > > > Congratulations Matt!!!
> > > > >
> > > > > On Wed, May 3, 2023 at 9:55 PM Yibo Cai  wrote:
> > > > > >
> > > > > > Congrats Matt!
> > > > > >
> > > > > > On 5/4/23 07:07, Krisztián Szűcs wrote:
> > > > > > > Congrats Matt!
> > > > > > >
> > > > > > > On Wed, May 3, 2023 at 11:44 PM Rok Mihevc  > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> Congrats Matt. Well deserved!
> > > > > > >>
> > > > > > >> Rok
> > > > > > >>
> > > > > > >> On Wed, May 3, 2023 at 11:03 PM David Li 
> > > > wrote:
> > > > > > >>
> > > > > > >>> Congrats Matt!
> > > > > > >>>
> > > > > > >>> On Wed, May 3, 2023, at 16:06, Neal Richardson wrote:
> > > > > >  Congratulations!
> > > > > > 
> > > > > >  On Wed, May 3, 2023 at 1:58 PM Jacob Wujciak
> > > > > > >>> 
> > > > > >  wrote:
> > > > > > 
> > > > > > > Congratulations, well deserved!
> > > > > > >
> > > > > > > On Wed, May 3, 2023 at 7:48 PM Weston Pace <
> > > > weston.p...@gmail.com>
> > > > > > >>> wrote:
> > > > > > >
> > > > > > >> Congratulations!
> > > > > > >>
> > > > > > >> On Wed, May 3, 2023 at 10:47 AM Raúl Cumplido <
> > > > > raulcumpl...@gmail.com
> > > > > > 
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Congratulations Matt!
> > > > > > >>>
> > > > > > >>> El mié, 3 may 2023, 19:44, vin jake 
> > > > > > >>> escribió:
> > > > > > >>>
> > > > > >  Congratulations, Matt!
> > > > > > 
> > > > > >  Felipe Oliveira Carvalho  于
> > 2023年5月4日周四
> > > > > > > 01:42写道:
> > > > > > 
> > > > > > > Congratulations, Matt!
> > > > > > >
> > > > > > > On Wed, 3 May 2023 at 14:37 Andrew Lamb <
> > > > al...@influxdata.com>
> > > > > > >> wrote:
> > > > > > >
> > > > > > >> The Project Management Committee (PMC) for Apache Arrow
> > has
> > > > > > > invited
> > > > > > >> Matt Topol (zeroshade) to become a PMC member and we are
> > > > > > >>> pleased
> > > > > > > to
> > > > > > >> announce
> > > > > > >> that Matt has accepted.
> > > > > > >>
> > > > > > >> Congratulations and welcome!
> > > > > > >>
> > > > > > >
> > > > > > 
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >>>
> > > > >
> > > >
> >


Re: [DISCUSS] Migrate s390x from Travis to ASF Jenkins

2023-04-20 Thread Dewey Dunnington
I wonder if the main use of the s390x job is to ensure that Arrow
works on big endian? If there are any claims about working with/on big
endian in the code base (e.g., the existence of a
ARROW_IS_LITTLE_ENDIAN macro) I think it's essential that it is tested
somewhere. Deciding to abandon big endian is, of course, a choice,
although I imagine it would be more work/require more input to do so
than to migrate a CI job.

I use Arrow on s380x, although it's a bit of circular logic because
I'm using it to make sure nanoarrow works on big endian.

On Thu, Apr 20, 2023 at 4:07 PM Matt Topol  wrote:
>
> I just wanted to add on that there was a Go on s390x job too that needs to
> get migrated and wasn't on the list in Raul's original email.
>
> On Thu, Apr 20, 2023 at 2:42 PM Benson Muite 
> wrote:
>
> > Might also consider testing farm for Centos Stream, Fedora and/or RHEL
> > builds[1][2].
> >
> > 1) https://docs.testing-farm.io/general/0.1/test-environment.html
> > 2)
> >
> > https://fedoramagazine.org/test-github-projects-with-github-actions-and-testing-farm/
> >
> > On 4/20/23 19:43, Antoine Pitrou wrote:
> > >
> > > Hi Raul,
> > >
> > > I'm a bit lukewarm about this. We currently don't use Jenkins and it's
> > > quite different from the CI services we have. Adding Jenkins jobs for
> > > s390x sounds like significant additional maintenance for a little-used
> > > platform. Has someone been asking for this?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 20/04/2023 à 13:00, Raúl Cumplido a écrit :
> > >> Hi,
> > >>
> > >> As discussed on this mailing list thread [1], one month and a half ago
> > >> we migrated the ARM 64 jobs from Travis to self-hosted runners [2].
> > >>
> > >> We are still missing to migrate the s390x jobs that we run on Travis
> > [3].
> > >>
> > >> - name: "Java on s390x"
> > >> - name: "C++ on s390x"
> > >> - name: "Python on s390x"
> > >>
> > >> As we don't have other s390x hosts I will try and set up new Jenkins
> > >> jobs for those using the ASF provided infrastructure.
> > >>  From what I can read on the ASF wiki [4] I might require some PMC to
> > >> help me get access to Jenkins via the whimsy tool to be added to the
> > >> hudson-jobadmin group in order to have access to set up jobs on
> > >> Jenkins.
> > >>
> > >> I wanted to validate that this is ok and would like to ask if someone
> > >> can help me with the access.
> > >>
> > >> As a reminder all Apache projects were supposed to migrate from Travis
> > >> CI by the end of 2022 [5].
> > >>
> > >> Thanks,
> > >> Raúl
> > >>
> > >> [1] https://lists.apache.org/thread/mskpqwpdq65t1wpj4f5klfq9217ljodw
> > >> [2] https://github.com/apache/arrow/pull/34482
> > >> [3]
> > >>
> > https://github.com/apache/arrow/blob/f2cc0b41fe9fb1d8d2bdb1d2abf676278e273f55/.travis.yml
> > >> [4] https://cwiki.apache.org/confluence/display/INFRA/Jenkins
> > >> [5] https://github.com/apache/arrow/issues/20496
> >
> >


Re: [DISCUSSION] C-Data API for Non-CPU Use Cases

2023-04-10 Thread Dewey Dunnington
I left some comments on the PR as well...I think this is an important
addition and I'm excited to see this discussion!

If there is further information that needs to be passed along in the
future, schema metadata could be used. Even with schema metadata, the
device type and ID will always need to be communicated and the use of
the ArrowDeviceArray/stream would be necessary as an opt-in to
multi-device support.

I do wonder if the stream interface is a little CUDA-specific...my
first reaction was wondering if it shouldn't live in a CUDA header (or
connector library including a CUDA header) since it contains direct
references to types in that header. If this is not true there should
be strong documentation supporting its utility or perhaps an example
from another library (I know you're working on this!).

On Mon, Apr 10, 2023 at 5:26 PM Weston Pace  wrote:
>
> Sorry, I meant:
>
> I am *now* a solid +1
>
> On Mon, Apr 10, 2023 at 1:26 PM Weston Pace  wrote:
>
> > I am not a solid +1 and I can see the usefulness.  Matt and I spoke on
> > this externally and I think Matt has written a great summary.  There were a
> > few more points that came up in the discussion that I think are
> > particularly compelling.
> >
> > * Avoiding device location is generally fatal
> >
> > In other cases when we have generic metadata it is typically ok for
> > libraries to ignore this and just make sure they forward it on properly.
> > For example, when writing parquet files or IPC files we need to be sure we
> > persist the generic metadata and load it from the disk.  However, if you
> > ignore the device location, and it is not CPU, and you attempt to access
> > the buffer (e.g. to persist the data) you will probably crash.
> >
> > This is not true for most of the examples I gave (one could argue that
> > accessing a NUMA buffer from the wrong socket is a bad idea, but at least
> > it is not fatal).  It's possible to contrive other cases that meet this
> > criteria (e.g. a custom allocator that can temporarily persist buffers to
> > disk and requires checkout before using the pointers or, arguably, the numa
> > node example given above) but they can probably be mapped to the device id
> > pattern and they aren't actively encountered today.
> >
> > * We need this internally
> >
> > As mentioned in Matt's previous email.  We could make use of this field
> > today to avoid doing things (e.g. validating arrays) in flight / IPC when
> > we know the data is not on the CPU.
> >
> > * Others are actively working around this problem today
> >
> > There are libraries today that have encountered this problem and have
> > proposed similar workarounds.
> >
> >  * The changes to the stream interface are more than just "metadata"
> >
> > I did not look closely enough and realize that these changes are more
> > substantial than just switching to ArrowDeviceArray.  These changes
> > introduce a concept of queues to the stream which mirrors concepts found in
> > mainstream GPU libraries (e.g. CUDA's "streams")
> >
> > On Mon, Apr 10, 2023 at 12:51 PM Matt Topol 
> > wrote:
> >
> >> > There's nothing in the spec today that
> >> prevents users from creating `ArrowDeviceArray` and
> >> `ArrowDeviceArrayStream` themselves
> >>
> >> True, but third-party applications aren't going to be the only downstream
> >> users of this API. We also want to build on this within Arrow itself to
> >> enable easier usage of non-CPU memory in the higher-level interfaces (IPC,
> >> Flight). Right now there are the buffer and memory manager classes, but
> >> there are plenty of areas that still inspect the buffers instead of first
> >> checking the `is_cpu` flag on them first. Plus we would want to be able to
> >> expose the device_type and device_id information at those higher level
> >> interfaces too.
> >>
> >> Even if we don't standardize on the device type list from dlpack, having a
> >> standardized way for libraries to pass this device information alongside
> >> the Arrow Arrays themselves is still beneficial and I think it's better
> >> for
> >> us to define it rather than wait for others to do so. The next step after
> >> agreeing on this format change is going to be building helpers (similar to
> >> the existing C Data helpers) around this to ensure safe usage and
> >> conversion to C++ Arrow Array's and buffers, etc.
> >>
> >> > Would we rather come up with a mechanism for arbitrary key/value
> >> metadata
> >> (e.g. UTF-8 encoded JSON string) to accompany arrays?  Something similar
> >> to
> >> record batch metadata in the IPC format?
> >>
> >> A generic key/value metadata layer at the Array level for the C-Data API
> >> could work in most cases *except* for the change to `get_next` for passing
> >> the stream/queue pointer object that the producer needs to make the data
> >> safe to access on. If a need for generic metadata at the array level IS
> >> needed in the future, the reserved 128 bytes could be easily evolved into
> >> a
> >> const char* + a 

Re: [VOTE] Release Apache Arrow ADBC 0.3.0 - RC1

2023-03-20 Thread Dewey Dunnington
+1 (non-binding)!

I ran `USE_CONDA=1 TEST_APT=0 TEST_YUM=0
dev/release/verify-release-candidate.sh 0.3.0 1` successfully on MacOS
Monterey (M1) and MacOS Ventura (M1).

Additionally, I verified R with and without conda  (`TEST_DEFAULT=0
TEST_R=1 dev/release/verify-release-candidate.sh 0.3.0 1` and
`TEST_DEFAULT=0 TEST_R=1 USE_CONDA=1
dev/release/verify-release-candidate.sh 0.3.0 1`) on MacOS Monterey
(M1) and MacOS El Capitan (x86). I was also able to verify R without
conda on a Windows VM (x86) via Git bash.

I did run into a missing symbol error using Conda on MacOS El Capitan
[1] but this seems related to conda environments in verification and
not the release itself.

[1] https://github.com/apache/arrow-adbc/issues/530


On Sat, Mar 18, 2023 at 11:19 AM SHIMA Tatsuya  wrote:
>
> +1 (non-binding)
>
> I run `export USE_CONDA=1 && ./dev/release/verify-release-candidate.sh
> 0.3.0 1` on Ubuntu 20.04
>
> On 2023/03/18 21:58, David Li wrote:
> > My vote: +1
> >
> > Verified with Conda on macOS 13.2/AArch64 and Ubuntu Linux 18.04/x86_64.
> >
> > On Fri, Mar 17, 2023, at 23:40, Sutou Kouhei wrote:
> >> +1
> >>
> >> I ran the following on Debian GNU/Linux sid:
> >>
> >>dev/release/verify-release-candidate.sh 0.3.0 1
> >>
> >> with:
> >>
> >>* Python 3.11.2
> >>* g++ (Debian 12.2.0-14) 12.2.0
> >>* go version go1.19.6 linux/amd64
> >>* openjdk version "18.0.2-ea" 2022-07-19
> >>* ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >>* R version 4.2.2 Patched (2022-11-10 r83330) -- "Innocent and Trusting"
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <35d590f7-d9ad-4d1f-81c6-bcfbab251...@app.fastmail.com>
> >>"[VOTE] Release Apache Arrow ADBC 0.3.0 - RC1" on Fri, 17 Mar 2023
> >> 11:22:11 -0400,
> >>"David Li"  wrote:
> >>
> >>> Hello,
> >>>
> >>> I would like to propose the following release candidate (RC1) of Apache 
> >>> Arrow ADBC version 0.3.0. This is a release consisting of 24 resolved 
> >>> GitHub issues [1].
> >>>
> >>> This release candidate is based on commit: 
> >>> ebcb87d8df41798d82171d81b7650b6bdfbe295a [2]
> >>>
> >>> The source release rc1 is hosted at [3].
> >>> The binary artifacts are hosted at [4][5][6][7][8].
> >>> The changelog is located at [9].
> >>>
> >>> Please download, verify checksums and signatures, run the unit tests, and 
> >>> vote on the release. See [10] for how to validate a release candidate.
> >>>
> >>> See also a verification result on GitHub Actions [11].
> >>>
> >>> The vote will be open for at least 72 hours.
> >>>
> >>> [ ] +1 Release this as Apache Arrow ADBC 0.3.0
> >>> [ ] +0
> >>> [ ] -1 Do not release this as Apache Arrow ADBC 0.3.0 because...
> >>>
> >>> Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> >>> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export 
> >>> TEST_APT=0 TEST_YUM=0`.)
> >>>
> >>> Thanks to Kou for helping prepare the release.
> >>>
> >>> [1]: 
> >>> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.3.0%22+is%3Aclosed
> >>> [2]: 
> >>> https://github.com/apache/arrow-adbc/commit/ebcb87d8df41798d82171d81b7650b6bdfbe295a
> >>> [3]: 
> >>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.3.0-rc1/
> >>> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >>> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >>> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >>> [7]: 
> >>> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >>> [8]: 
> >>> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.3.0-rc1
> >>> [9]: 
> >>> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.3.0-rc1/CHANGELOG.md
> >>> [10]: 
> >>> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >>> [11]: https://github.com/apache/arrow-adbc/actions/runs/4448256840


Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Dewey Dunnington
Congrats, Will!

On Mon, Mar 13, 2023 at 3:07 PM Matt Topol  wrote:
>
> Congrats Will!
>
> On Mon, Mar 13, 2023, 2:02 PM Jacob Wujciak 
> wrote:
>
> > Congratulations Will, well deserved!
> >
> > On Mon, Mar 13, 2023 at 6:58 PM Andrew Lamb  wrote:
> >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Will Jones to become a PMC member and we are pleased to announce
> > > that Will Jones has accepted.
> > >
> > > Congratulations and welcome!
> > >
> >


Re: [RESULT][VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1

2023-03-09 Thread Dewey Dunnington
Absolutely! By the next time this happens I hope to be better at this :-)

The post-release taks are all complete!
[x] Closed GitHub milestone
[x] Added release to Apache Reporter System (Thanks David!)
[x] Uploaded artifacts to Subversion (Thanks David!)
[x] Created GitHub release
[x] Submit R package to CRAN (fingers crossed!)
[x] Sent announcement to annou...@apache.org
[x] Release blog post at https://github.com/apache/arrow-site/pull/288
(arrow-site build is currently failing but post should be live when it
is resurrected)
[x] Removed old artifacts from SVN
[x] Bumped versions on main

Thank you all!

On Tue, Mar 7, 2023 at 5:28 PM Sutou Kouhei  wrote:
>
> Hi,
>
> I prepended "[RESULT]" to the subject.
>
> Dewey, could you use "[RESULT][VOTE] ..." subject for a vote
> result e-mail next time? It's for easy to find.
>
> A recent example:
>   [RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.5 RC1
>   https://lists.apache.org/thread/ybqvx5k2lnyotnh6yq5xzzp80x09fl1c
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1" on Tue, 7 Mar 2023 
> 12:16:12 -0400,
>   Dewey Dunnington  wrote:
>
> > Thank you everybody for verifying and voting! With 4 binding +1s and 5
> > non-binding +1s, the vote passes!
> >
> > The first release of Apache Arrow nanoarrow has the following post-release
> > tasks. I believe the upload step requires a PMC member to run the script
> > but the rest I'm happy to take care of!
> >
> > [x] Closed GitHub milestone
> > [ ] Added release to Apache Reporter System
> > [ ] Uploaded artifacts to Subversion
> > [ ] Created GitHub release
> > [ ] Submit R package to CRAN
> > [ ] Sent announcement to annou...@apache.org
> > [ ] Release blog post at https://github.com/apache/arrow-site/pull/288
> > [ ] Removed old artifacts from SVN
> > [ ] Bumped versions on main
> >
> >
> >
> > On Thu, Mar 2, 2023 at 11:35 PM Jacob Wujciak 
> > 
> > wrote:
> >
> >> +1 (non-binding) verified on manjaro
> >>
> >> On Fri, Mar 3, 2023 at 1:37 AM Sutou Kouhei  wrote:
> >>
> >> > +1
> >> >
> >> > This was because I already installed the old 'arrow' R
> >> > package (8.0.0). I upgraded it to 11.0.0 and it
> >> > worked. Thanks to Dewey for helping me!
> >> >
> >> > Thanks,
> >> > --
> >> > kou
> >> >
> >> > In 
> >> >   "Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1" on Wed, 1 Mar
> >> > 2023 22:35:00 -0400,
> >> >   Dewey Dunnington  wrote:
> >> >
> >> > > Thank you!
> >> > >
> >> > > I've followed up on an issue [1], but I believe the problem is that the
> >> > > version of the 'arrow' R package that is installed is before 10.0.0
> >> when
> >> > > the tested features were added. In the R package DESCRIPTION the
> >> version
> >> > > constraint is >= 9.0.0, which should be updated (or those tests should
> >> be
> >> > > skipped for old versions of arrow).
> >> > >
> >> > > [1] https://github.com/apache/arrow-nanoarrow/issues/141
> >> > >
> >> > > On Wed, Mar 1, 2023 at 7:53 PM Sutou Kouhei 
> >> wrote:
> >> > >
> >> > >> -0
> >> > >>
> >> > >> I ran the following command line on Debian GNU/Linux sid:
> >> > >>
> >> > >>   NANOARROW_CMAKE_OPTIONS=-DCMAKE_PREFIX_PATH=/tmp/local \
> >> > >> dev/release/verify-release-candidate.sh 0.1.0 1
> >> > >>
> >> > >> with:
> >> > >>
> >> > >>   * Apache Arrow C++ main
> >> > >>   * R version 4.2.2 Patched (2022-11-10 r83330)
> >> > >>
> >> > >> I got 4 R failures:
> >> > >>
> >> > >>
> >> > >>
> >> >
> >> https://gist.github.com/kou/1194cf28cb8e70fe309d0f07e6f49b3b#file-testthat-rout-fail
> >> > >>
> >> > >> I'm not sure whether they are caused by my wrong setup or
> >> > >> not. Could someone check the failures?
> >> > >>
> >> > >> Thanks,
> >> > >> --
> >> > >> kou
> >> > >>
> >> > >> In <
> >> cafb7qsfx3jcb6wnxcmpnqj_ieey8rxfee8e+zf2gudyyjxi...@mail.gmai

Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1

2023-03-07 Thread Dewey Dunnington
Thank you everybody for verifying and voting! With 4 binding +1s and 5
non-binding +1s, the vote passes!

The first release of Apache Arrow nanoarrow has the following post-release
tasks. I believe the upload step requires a PMC member to run the script
but the rest I'm happy to take care of!

[x] Closed GitHub milestone
[ ] Added release to Apache Reporter System
[ ] Uploaded artifacts to Subversion
[ ] Created GitHub release
[ ] Submit R package to CRAN
[ ] Sent announcement to annou...@apache.org
[ ] Release blog post at https://github.com/apache/arrow-site/pull/288
[ ] Removed old artifacts from SVN
[ ] Bumped versions on main



On Thu, Mar 2, 2023 at 11:35 PM Jacob Wujciak 
wrote:

> +1 (non-binding) verified on manjaro
>
> On Fri, Mar 3, 2023 at 1:37 AM Sutou Kouhei  wrote:
>
> > +1
> >
> > This was because I already installed the old 'arrow' R
> > package (8.0.0). I upgraded it to 11.0.0 and it
> > worked. Thanks to Dewey for helping me!
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1" on Wed, 1 Mar
> > 2023 22:35:00 -0400,
> >   Dewey Dunnington  wrote:
> >
> > > Thank you!
> > >
> > > I've followed up on an issue [1], but I believe the problem is that the
> > > version of the 'arrow' R package that is installed is before 10.0.0
> when
> > > the tested features were added. In the R package DESCRIPTION the
> version
> > > constraint is >= 9.0.0, which should be updated (or those tests should
> be
> > > skipped for old versions of arrow).
> > >
> > > [1] https://github.com/apache/arrow-nanoarrow/issues/141
> > >
> > > On Wed, Mar 1, 2023 at 7:53 PM Sutou Kouhei 
> wrote:
> > >
> > >> -0
> > >>
> > >> I ran the following command line on Debian GNU/Linux sid:
> > >>
> > >>   NANOARROW_CMAKE_OPTIONS=-DCMAKE_PREFIX_PATH=/tmp/local \
> > >> dev/release/verify-release-candidate.sh 0.1.0 1
> > >>
> > >> with:
> > >>
> > >>   * Apache Arrow C++ main
> > >>   * R version 4.2.2 Patched (2022-11-10 r83330)
> > >>
> > >> I got 4 R failures:
> > >>
> > >>
> > >>
> >
> https://gist.github.com/kou/1194cf28cb8e70fe309d0f07e6f49b3b#file-testthat-rout-fail
> > >>
> > >> I'm not sure whether they are caused by my wrong setup or
> > >> not. Could someone check the failures?
> > >>
> > >> Thanks,
> > >> --
> > >> kou
> > >>
> > >> In <
> cafb7qsfx3jcb6wnxcmpnqj_ieey8rxfee8e+zf2gudyyjxi...@mail.gmail.com>
> > >>   "[VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1" on Wed, 1 Mar
> 2023
> > >> 13:03:44 -0400,
> > >>   Dewey Dunnington  wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I would like to propose the following release candidate (RC1) of
> > Apache
> > >> > Arrow nanoarrow [0] version 0.1.0. This is an initial release
> > consisting
> > >> of
> > >> > 31 resolved GitHub issues [1].
> > >> >
> > >> > Special thanks to David Li for his reviews and support during the
> > >> > preparation of this initial release candidate!
> > >> >
> > >> > This release candidate is based on commit:
> > >> > 341279af1b2fdede36871d212f339083ffbd75eb [2]
> > >> >
> > >> > The source release rc1 is hosted at [3].
> > >> > The changelog is located at [4].
> > >> >
> > >> > Please download, verify checksums and signatures, run the unit
> tests,
> > and
> > >> > vote on the release. See [5] for how to validate a release
> candidate.
> > >> >
> > >> > The vote will be open for at least 72 hours.
> > >> >
> > >> > [ ] +1 Release this as Apache Arrow nanoarrow 0.1.0
> > >> > [ ] +0
> > >> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.1.0
> because...
> > >> >
> > >> > [0] https://github.com/apache/arrow-nanoarrow
> > >> > [1] https://github.com/apache/arrow-nanoarrow/milestone/1?closed=1
> > >> > [2]
> > >> >
> > >>
> >
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.1.0-rc1
> > >> > [3]
> > >> >
> > >>
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.1.0-rc1/
> > >> > [4]
> > >> >
> > >>
> >
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.1.0-rc1/CHANGELOG.md
> > >> > [5]
> > >> >
> > >>
> >
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > >>
> >
>


Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-06 Thread Dewey Dunnington
+1 (non-binding)!

On Mon, Mar 6, 2023 at 9:59 AM Nic Crane  wrote:

> +1
>
> On Mon, 6 Mar 2023 at 12:41, Alenka Frim 
> wrote:
>
> > Hi all,
> >
> > I am starting a new voting thread with this email as the first voting
> > thread [1] opened up new
> > comments and suggestions and we wanted to take time to see how that
> > evolves.
> >
> > *I would like to propose we vote on adding the fixed shape tensor
> canonical
> > extension type*
> > *with the following specification:*
> >
> > Fixed shape tensor
> > ==
> >
> > * Extension name: `arrow.fixed_shape_tensor`.
> >
> > * The storage type of the extension: ``FixedSizeList`` where:
> >
> >   * **value_type** is the data type of individual tensor elements.
> >   * **list_size** is the product of all the elements in tensor shape.
> >
> > * Extension type parameters:
> >
> >   * **value_type** = the Arrow data type of individual tensor elements.
> >   * **shape** = the physical shape of the contained tensors
> > as an array.
> >
> >   Optional parameters describing the logical layout:
> >
> >   * **dim_names** = explicit names to tensor dimensions
> > as an array. The length of it should be equal to the shape
> > length and equal to the number of dimensions.
> >
> > ``dim_names`` can be used if the dimensions have well-known
> > names and they map to the physical layout (row-major).
> >
> >   * **permutation**  = indices of the desired ordering of the
> > original dimensions, defined as an array.
> >
> > The indices contain a permutation of the values [0, 1, .., N-1] where
> > N is the number of dimensions. The permutation indicates which
> > dimension of the logical layout corresponds to which dimension of the
> > physical tensor (the i-th dimension of the logical view corresponds
> > to the dimension with number ``permutations[i]`` of the physical
> > tensor).
> >
> > Permutation can be useful in case the logical order of
> > the tensor is a permutation of the physical order (row-major).
> >
> > When logical and physical layout are equal, the permutation will
> always
> > be ([0, 1, .., N-1]) and can therefore be left out.
> >
> > * Description of the serialization:
> >
> >   The metadata must be a valid JSON object including shape of
> >   the contained tensors as an array with key **"shape"** plus optional
> >   dimension names with keys **"dim_names"** and ordering of the
> >   dimensions with key **"permutation"**.
> >
> >   - Example: ``{ "shape": [2, 5]}``
> >   - Example with ``dim_names`` metadata for NCHW ordered data:
> >
> > ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``
> >
> >   - Example of permuted 3-dimensional tensor:
> >
> > ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``
> >
> > This is the physical layout shape and the the shape of the logical
> > layout would in this case be ``[500, 100, 200]``.
> >
> > .. note::
> >
> >   Elements in a fixed shape tensor extension array are stored
> >   in row-major/C-contiguous order.
> >
> > * The specification is submitted as a PR [2] to Canonical Extension Types
> > document under the
> >format specifications directory [3].
> >
> > There are also two implementations submitted to Apache Arrow repository:
> > * C++ implementation of the proposed specification [4]
> > * Python example implementation of the proposed specification and usage
> > (only illustrative) [5]
> >
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept this proposal
> > [ ] +0
> > [ ] -1 Do not accept this proposal because...
> >
> >
> > Regards, Alenka
> >
> > [1]: https://lists.apache.org/thread/3cj0cr44hg3t2rn0kxly8td82yfob1nd
> > [2]: https://github.com/apache/arrow/pull/33925/files
> > [3]:
> >
> >
> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
> >
> > [4]: https://github.com/apache/arrow/pull/8510/files
> > [5]: https://github.com/apache/arrow/pull/33948/files
> >
>


Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1

2023-03-01 Thread Dewey Dunnington
Thank you!

I've followed up on an issue [1], but I believe the problem is that the
version of the 'arrow' R package that is installed is before 10.0.0 when
the tested features were added. In the R package DESCRIPTION the version
constraint is >= 9.0.0, which should be updated (or those tests should be
skipped for old versions of arrow).

[1] https://github.com/apache/arrow-nanoarrow/issues/141

On Wed, Mar 1, 2023 at 7:53 PM Sutou Kouhei  wrote:

> -0
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   NANOARROW_CMAKE_OPTIONS=-DCMAKE_PREFIX_PATH=/tmp/local \
> dev/release/verify-release-candidate.sh 0.1.0 1
>
> with:
>
>   * Apache Arrow C++ main
>   * R version 4.2.2 Patched (2022-11-10 r83330)
>
> I got 4 R failures:
>
>
> https://gist.github.com/kou/1194cf28cb8e70fe309d0f07e6f49b3b#file-testthat-rout-fail
>
> I'm not sure whether they are caused by my wrong setup or
> not. Could someone check the failures?
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1" on Wed, 1 Mar 2023
> 13:03:44 -0400,
>   Dewey Dunnington  wrote:
>
> > Hello,
> >
> > I would like to propose the following release candidate (RC1) of Apache
> > Arrow nanoarrow [0] version 0.1.0. This is an initial release consisting
> of
> > 31 resolved GitHub issues [1].
> >
> > Special thanks to David Li for his reviews and support during the
> > preparation of this initial release candidate!
> >
> > This release candidate is based on commit:
> > 341279af1b2fdede36871d212f339083ffbd75eb [2]
> >
> > The source release rc1 is hosted at [3].
> > The changelog is located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote on the release. See [5] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.1.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.1.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/1?closed=1
> > [2]
> >
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.1.0-rc1
> > [3]
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.1.0-rc1/
> > [4]
> >
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.1.0-rc1/CHANGELOG.md
> > [5]
> >
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
>


  1   2   >