Re: [VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-23 Thread Vibhatha Abeykoon
+1 (non-binding)

I have tested on Ubuntu 22.04

./verify-release-candidate.sh 0.5.0 0

With Regards,
Vibhatha Abeykoon


On Thu, May 23, 2024 at 3:21 PM Raúl Cumplido  wrote:

> +1 (binding)
>
> I've tested successfully on Ubuntu 22.04 without R.
>
> TEST_R=0 ./verify-release-candidate.sh 0.5.0 0
>
> Regards,
> Raúl
>
> El jue, 23 may 2024 a las 6:49, David Li () escribió:
> >
> > +1 (binding)
> >
> > Tested on Debian 12 'bookworm'
> >
> > On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote:
> > > +1 (binding)
> > >
> > > I ran the following command line on Debian GNU/Linux sid:
> > >
> > >   dev/release/verify-release-candidate.sh 0.5.0 0
> > >
> > > with:
> > >
> > >   * Apache Arrow C++ main
> > >   * gcc (Debian 13.2.0-23) 13.2.0
> > >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> > >   * Python 3.11.9
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > >
> > > In  >
> > >   "[VOTE] Release Apache Arrow nanoarrow 0.5.0" on Wed, 22 May 2024
> > > 15:17:40 -0300,
> > >   Dewey Dunnington  wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to propose the following release candidate (rc0) of
> > >> Apache Arrow nanoarrow [0] version 0.5.0. This is an initial release
> > >> consisting of 79 resolved GitHub issues from 9 contributors [1].
> > >>
> > >> This release candidate is based on commit:
> > >> c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]
> > >>
> > >> The source release rc0 is hosted at [3].
> > >> The changelog is located at [4].
> > >>
> > >> Please download, verify checksums and signatures, run the unit tests,
> > >> and vote on the release. See [5] for how to validate a release
> > >> candidate.
> > >>
> > >> The vote will be open for at least 72 hours.
> > >>
> > >> [ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
> > >> [ ] +0
> > >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 because...
> > >>
> > >> [0] https://github.com/apache/arrow-nanoarrow
> > >> [1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
> > >> [2]
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
> > >> [3]
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
> > >> [4]
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
> > >> [5]
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
>


Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Aldrin
For what it's worth, duckdb accesses arrow data via IPC in an extension then 
exports to C data interface to call into code in its core.
Also, assumptions about when query optimization occurs relative to data access 
potentially breaks down in scenarios involving: views, distributed tables, 
substrait and decomposed query engines, and a few others.
 Sent from Proton Mail for iOS 
On Thu, May 23, 2024 at 13:28, Shoumyo Chakravorti (BLOOMBERG/ 120 PARK) 
 wrote:  Appreciate the additional context!

> use cases where you want to know the schema *before*
> the data is produced

I think my understanding aligns with Dewey's on this point.
I guess I'm struggling to imagine a scenario where a query
planner would want the schema but not the statistics. Because
by the time the query engine starts consuming data, the plan
should've already been optimized, which implies that the
statistics should've come earlier at some point (so having
it in the schema wouldn't hurt per se). But please correct
me if I misunderstood.

> It is usually fine but occasionally ends up with schema
> metadata that is lying

This is a totally valid point and I'm definitely aware
of it - there would be an onus on developers to make sure
that they're not plumbing around nonsensical metadata. And
to your point, making the production of statistics opt-in
would make this decision explicit.

I guess the other saving grace is that query optimization
should never affect the *correctness* of a query, only its
performance. However, I can appreciate that it would be
difficult to diagnose a query being slow just because of bad
metadata.

> Technically there is message-level metadata in the IPC
> flatbuffers... That mechanism isn't available from an
> ArrowArrayStream and so it might not help with the specific
> case at hand.

Gotcha. So it sounds like the schema and field metadata are
the only ones available at the "top" level in Arrow IPC
streams or files; glad to know we didn't miss something :)

As mentioned earlier, my understanding is that query
optimization happens in its entirety before the query engine
consumes any actual data. So I believe the schema- and
field-level metadata are the only ones relevant for the
use-case being considered anyway.

Taking a step back -- my thought process was that if there
is a case for transmitting statistics over Arrow IPC, then
it would be nice to have a consistent solution in the C data
interface as well. Using schema metadata just seemed like
one approach that would achieve this goal.

Best,
Shoumyo

From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

Thanks Shoumyo for bringing this up!

Using a schema to transmit statistica/data dependent values is also
something we do in GeoParquet (whose schema also finds its way into
pyarrow and the C data interface when reading). It is usually fine but
occasionally ends up with schema metadata that is lying (e.g., when
unifying schemas from multiple files in a dataset, I believe pyarrow
will sometimes assign metadata from one file to the entire dataset
and/or propagate it through projections/filters).

I imagine statistics would be opt-in (i.e., a consumer would have to
explicitly request them), in which case that consumer could possibly
be required to remove them. With the custom format string that was
proposed I think this is unlikely to happen; however, that a consumer
might want to know statistics over IPC too is an excellent point.

> Unless there are other ways of producing stream-level application metadata
outside of the schema/field metadata

Technically there is message-level metadata in the IPC flatbuffers,
although I don't believe it is accessible from most IPC readers. That
mechanism isn't available from an ArrowArrayStream and so it might not
help with the specific case at hand.

> nowhere is it mentioned that metadata must be used to determine schema
equivalence

I am only familiar with a few implementations, but at least Arrow C++
and nanoarrow have options to ignore metadata and/or nullability
and/or possibly field names (e.g., for a list type) depending on what
type of type/schema equivalence is required.

> use cases where you want to know the schema *before* the data is produced.

I may be understanding it incorrectly, but I think it's generally
possible to emit a schema with metadata before emitting record
batches. I suppose you would have already started downloading the
stream, though.

> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

+1 (this will be helpful however we decide to transmit statistics)

On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>
>
> Hi Shoumyo,
>
> The problem with communicating data statistics through schema metadata
> is that it's not compatible with use cases where you want to know the
> sche

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
Appreciate the additional context!

> use cases where you want to know the schema *before*
> the data is produced

I think my understanding aligns with Dewey's on this point.
I guess I'm struggling to imagine a scenario where a query
planner would want the schema but not the statistics. Because
by the time the query engine starts consuming data, the plan
should've already been optimized, which implies that the
statistics should've come earlier at some point (so having
it in the schema wouldn't hurt per se). But please correct
me if I misunderstood.

> It is usually fine but occasionally ends up with schema
> metadata that is lying

This is a totally valid point and I'm definitely aware
of it - there would be an onus on developers to make sure
that they're not plumbing around nonsensical metadata. And
to your point, making the production of statistics opt-in
would make this decision explicit.

I guess the other saving grace is that query optimization
should never affect the *correctness* of a query, only its
performance. However, I can appreciate that it would be
difficult to diagnose a query being slow just because of bad
metadata.

> Technically there is message-level metadata in the IPC
> flatbuffers... That mechanism isn't available from an
> ArrowArrayStream and so it might not help with the specific
> case at hand.

Gotcha. So it sounds like the schema and field metadata are
the only ones available at the "top" level in Arrow IPC
streams or files; glad to know we didn't miss something :)

As mentioned earlier, my understanding is that query
optimization happens in its entirety before the query engine
consumes any actual data. So I believe the schema- and
field-level metadata are the only ones relevant for the
use-case being considered anyway.

Taking a step back -- my thought process was that if there
is a case for transmitting statistics over Arrow IPC, then
it would be nice to have a consistent solution in the C data
interface as well. Using schema metadata just seemed like
one approach that would achieve this goal.

Best,
Shoumyo

From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

Thanks Shoumyo for bringing this up!

Using a schema to transmit statistica/data dependent values is also
something we do in GeoParquet (whose schema also finds its way into
pyarrow and the C data interface when reading). It is usually fine but
occasionally ends up with schema metadata that is lying (e.g., when
unifying schemas from multiple files in a dataset, I believe pyarrow
will sometimes assign metadata from one file to the entire dataset
and/or propagate it through projections/filters).

I imagine statistics would be opt-in (i.e., a consumer would have to
explicitly request them), in which case that consumer could possibly
be required to remove them. With the custom format string that was
proposed I think this is unlikely to happen; however, that a consumer
might want to know statistics over IPC too is an excellent point.

> Unless there are other ways of producing stream-level application metadata 
outside of the schema/field metadata

Technically there is message-level metadata in the IPC flatbuffers,
although I don't believe it is accessible from most IPC readers. That
mechanism isn't available from an ArrowArrayStream and so it might not
help with the specific case at hand.

> nowhere is it mentioned that metadata must be used to determine schema 
equivalence

I am only familiar with a few implementations, but at least Arrow C++
and nanoarrow have options to ignore metadata and/or nullability
and/or possibly field names (e.g., for a list type) depending on what
type of type/schema equivalence is required.

> use cases where you want to know the schema *before* the data is produced.

I may be understanding it incorrectly, but I think it's generally
possible to emit a schema with metadata before emitting record
batches. I suppose you would have already started downloading the
stream, though.

> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

+1 (this will be helpful however we decide to transmit statistics)

On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>
>
> Hi Shoumyo,
>
> The problem with communicating data statistics through schema metadata
> is that it's not compatible with use cases where you want to know the
> schema *before* the data is produced.
>
> Regards
>
> Antoine.
>
>
> On Thu, 23 May 2024 14:28:43 -
> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>  wrote:
> > This is a really exciting development, thank you for putting together this 
proposal!
> >
> > It looks like this thread and the linked GitHub issue has lots of input 
from folks who work with Arrow at a low level and have better familiarity with 
the Arrow specifications than I do, so I'll refrain from commenting on the 
technicalities of the proposa

[C++] Thread deadlock in ObjectOutputStream

2024-05-23 Thread Li Jin
Hello,

I am seeing a deadlock when destructing an ObjectOutputStream. I have
attached the stack trace.

I did some debugging and found that the issue seems to be that the mutex in
question is already held by this thread (I checked the __owner field in the
pthread_mutex_t which points to the hanging thread)

Unfortunately the stack trace doesn’t show exactly which mutex it is trying
to lock. I wonder if someone more familiar with the IO code has some ideas
what might be the issue and where to dig deeper?

Appreciate the help,
Li
Thread 39 (Thread 0xe2199eee700 (LWP 1392) "python3.10"):
#0  __lll_lock_wait (futex=futex@entry=0xe2158016c60, private=0) at 
lowlevellock.c:52
#1  0x0e223fe14843 in __GI___pthread_mutex_lock (mutex=0xe2158016c60) at 
../nptl/pthread_mutex_lock.c:80
#2  0x0e223a4c7be3 in virtual thunk to arrow::fs::(anonymous 
namespace)::ObjectOutputStream::Close() () at 
/build/build/ext/public/apache/arrow/15/0/0/apache-arrow/cpp/src/arrow/status.h:140
#3  0x0e223993eaef in arrow::io::internal::CloseFromDestructor 
(file=file@entry=0xe2158005c80) at 
/build/build/ext/public/apache/arrow/15/0/0/apache-arrow/cpp/src/arrow/io/interfaces.cc:284
#4  0x0e223a4b9250 in arrow::fs::(anonymous 
namespace)::ObjectOutputStream::~ObjectOutputStream (this=0xe2158005b40, 
__in_chrg=, __vtt_parm=) at 
/build/build/ext/public/apache/arrow/15/0/0/apache-arrow/cpp/src/arrow/filesystem/s3fs.cc:1398
#5  __gnu_cxx::new_allocator::destroy (__p=0xe2158005b40, this=0xe2158005b40) at 
/build/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/ext/new_allocator.h:168
#6  std::allocator_traits >::destroy (__p=0xe2158005b40, __a=...) at 
/build/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/alloc_traits.h:535
#7  std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose 
(this=0xe2158005b30) at 
/build/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:528
#8  0x0e223c41ddda in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
(this=0xe2158005b30) at 
/build/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:168
#9  0x0e223bbb62a8 in 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
(this=0xe21580f2b40, __in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:705
#10 std::__shared_ptr::~__shared_ptr (this=0xe21580f2b38, 
__in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:1154
#11 std::shared_ptr::~shared_ptr (this=0xe21580f2b38, 
__in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr.h:122
#12 arrow::dataset::FileWriter::~FileWriter (this=0xe21580f2b10, 
__in_chrg=) at 
/tmp/build/ts/arrow/dataset/c/src/arrow/dataset/file_base.h:378
#13 arrow::dataset::ParquetFileWriter::~ParquetFileWriter (this=0xe21580f2b10, 
__in_chrg=) at 
/tmp/build/ts/arrow/dataset/c/src/arrow/dataset/file_parquet.h:282
#14 arrow::dataset::ParquetFileWriter::~ParquetFileWriter (this=0xe21580f2b10, 
__in_chrg=) at 
/tmp/build/ts/arrow/dataset/c/src/arrow/dataset/file_parquet.h:282
#15 0x0e223c41ddda in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
(this=0xe21580b9a10) at 
/build/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:168
#16 0x0e223bae3f1c in 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
(this=0x5b665b834690, __in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:705
#17 std::__shared_ptr::~__shared_ptr (this=0x5b665b834688, 
__in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr_base.h:1154
#18 std::shared_ptr::~shared_ptr 
(this=0x5b665b834688, __in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/shared_ptr.h:122
#19 arrow::dataset::internal::(anonymous 
namespace)::DatasetWriterFileQueue::~DatasetWriterFileQueue 
(this=0x5b665b834670, __in_chrg=) at 
/tmp/build/ts/arrow/dataset/c/src/arrow/dataset/dataset_writer.cc:140
#20 std::default_delete::operator() (__ptr=0x5b665b834670, 
this=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/unique_ptr.h:85
#21 std::unique_ptr >::~unique_ptr (this=0x5b665b833800, 
__in_chrg=) at 
/tmp/build/ext/public/gpl3/gnu/gcc/11/dist/include/c++/11.3.0/bits/unique_ptr.h:361
#22 ~ (this=0x5b665b8337f8, __in_chrg=) at 
/tmp/build/ts/arrow/dataset/c/src/arrow/dataset/dataset_writer.cc:364
#23 
arrow::internal::FnOnce::FnImpl >::~FnImpl (this=0x5b665b8337f0, __in_chrg=) at 
/tmp/build/ext/public/apache/arrow/15/0/0/dist/include/arrow/util/functional.h:150
#24 
arrow::internal::FnOnce::FnImpl >::~FnImpl(void) (this=0x5b665b8337f0, 
__in_chrg=) at 
/tmp/build/ext/public/apache/arrow/15/0/0/dist/include/arrow/util/functional.h:150
#25 0x0e223995cd6e in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
(this=0x5b665

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thanks Shoumyo for bringing this up!

Using a schema to transmit statistica/data dependent values is also
something we do in GeoParquet (whose schema also finds its way into
pyarrow and the C data interface when reading). It is usually fine but
occasionally ends up with schema metadata that is lying (e.g., when
unifying schemas from multiple files in a dataset, I believe pyarrow
will sometimes assign metadata from one file to the entire dataset
and/or propagate it through projections/filters).

I imagine statistics would be opt-in (i.e., a consumer would have to
explicitly request them), in which case that consumer could possibly
be required to remove them. With the custom format string that was
proposed I think this is unlikely to happen; however, that a consumer
might want to know statistics over IPC too is an excellent point.

> Unless there are other ways of producing stream-level application metadata 
> outside of the schema/field metadata

Technically there is message-level metadata in the IPC flatbuffers,
although I don't believe it is accessible from most IPC readers. That
mechanism isn't available from an ArrowArrayStream and so it might not
help with the specific case at hand.

> nowhere is it mentioned that metadata must be used to determine schema 
> equivalence

I am only familiar with a few implementations, but at least Arrow C++
and nanoarrow have options to ignore metadata and/or nullability
and/or possibly field names (e.g., for a list type) depending on what
type of type/schema equivalence is required.

> use cases where you want to know the schema *before* the data is produced.

I may be understanding it incorrectly, but I think it's generally
possible to emit a schema with metadata before emitting record
batches. I suppose you would have already started downloading the
stream, though.

> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

+1 (this will be helpful however we decide to transmit statistics)

On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>
>
> Hi Shoumyo,
>
> The problem with communicating data statistics through schema metadata
> is that it's not compatible with use cases where you want to know the
> schema *before* the data is produced.
>
> Regards
>
> Antoine.
>
>
> On Thu, 23 May 2024 14:28:43 -
> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>  wrote:
> > This is a really exciting development, thank you for putting together this 
> > proposal!
> >
> > It looks like this thread and the linked GitHub issue has lots of input 
> > from folks who work with Arrow at a low level and have better familiarity 
> > with the Arrow specifications than I do, so I'll refrain from commenting on 
> > the technicalities of the proposal. I would, however, like to share my 
> > perspective as an application developer that heavily uses Arrow at higher 
> > levels for composing data systems.
> >
> > My main concern with the direction of this proposal is that it seems too 
> > narrowly focused on what the integration with DuckDB will look like (how 
> > the statistics can be fed into DuckDB). In many applications, executing the 
> > query is often the "last mile", and it's important to consider where the 
> > statistics will actually come from. To start, data might be sourced in 
> > various manners:
> >
> > - Arrow IPC files may be mapped from shared memory
> > - Arrow IPC streams may be received via some RPC framework (à la Flight)
> > - The Arrow libraries may be used to read from file formats like Parquet or 
> > CSV
> > - ADBC drivers may be used to read from databases
> >
> > Note that in at least the first two cases, the system _executing the query_ 
> > will not be able to provide statistics simply because it is not actually 
> > the data producer. As an example, if Process A writes an Arrow IPC file to 
> > shared memory, and Process B wants to run a query on it -- how is Process B 
> > supposed to get the statistics for query planning? There are a few 
> > approaches that I anticipate application developers might consider:
> >
> > 1. Design an out-of-band mechanism for Process B to fetch statistics from 
> > Process A.
> > 2. Design an encoding that is a superset of Arrow IPC and includes 
> > statistics information, allowing statistics to be communicated in-band.
> > 3. Use custom schema metadata to communicate statistics in-band.
> >
> > Options 1 and 2 require considerably more effort than Option 3. Also, 
> > Option 3 feels somewhat natural because it makes sense for the statistics 
> > to come with the data (similar to how statistics are embedded in Parquet 
> > files). In some sense, the statistics actually *are* a property of the 
> > stream.
> >
> > In systems that I work on, we already use schema metadata to communicate 
> > information that is unrelated to the structure of the data. From my reading 
> > of the documentation [1], this sounds like a reasonable (and perhaps 
> > int

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou


Hi Shoumyo,

The problem with communicating data statistics through schema metadata
is that it's not compatible with use cases where you want to know the
schema *before* the data is produced.

Regards

Antoine.


On Thu, 23 May 2024 14:28:43 -
"Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
 wrote:
> This is a really exciting development, thank you for putting together this 
> proposal!
> 
> It looks like this thread and the linked GitHub issue has lots of input from 
> folks who work with Arrow at a low level and have better familiarity with the 
> Arrow specifications than I do, so I'll refrain from commenting on the 
> technicalities of the proposal. I would, however, like to share my 
> perspective as an application developer that heavily uses Arrow at higher 
> levels for composing data systems.
> 
> My main concern with the direction of this proposal is that it seems too 
> narrowly focused on what the integration with DuckDB will look like (how the 
> statistics can be fed into DuckDB). In many applications, executing the query 
> is often the "last mile", and it's important to consider where the statistics 
> will actually come from. To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases
> 
> Note that in at least the first two cases, the system _executing the query_ 
> will not be able to provide statistics simply because it is not actually the 
> data producer. As an example, if Process A writes an Arrow IPC file to shared 
> memory, and Process B wants to run a query on it -- how is Process B supposed 
> to get the statistics for query planning? There are a few approaches that I 
> anticipate application developers might consider:
> 
> 1. Design an out-of-band mechanism for Process B to fetch statistics from 
> Process A.
> 2. Design an encoding that is a superset of Arrow IPC and includes statistics 
> information, allowing statistics to be communicated in-band.
> 3. Use custom schema metadata to communicate statistics in-band.
> 
> Options 1 and 2 require considerably more effort than Option 3. Also, Option 
> 3 feels somewhat natural because it makes sense for the statistics to come 
> with the data (similar to how statistics are embedded in Parquet files). In 
> some sense, the statistics actually *are* a property of the stream.
> 
> In systems that I work on, we already use schema metadata to communicate 
> information that is unrelated to the structure of the data. From my reading 
> of the documentation [1], this sounds like a reasonable (and perhaps 
> intended?) use of metadata, and nowhere is it mentioned that metadata must be 
> used to determine schema equivalence. Unless there are other ways of 
> producing stream-level application metadata outside of the schema/field 
> metadata, the lack of purity was not a concern for me to begin with.
> 
> I would appreciate an approach that communicates statistics via schema 
> metadata, or at least in some in-band fashion that is consistent across the 
> IPC and C data specifications. This would make it much easier to uniformly 
> and transparently plumb statistics through applications, regardless of where 
> they source Arrow data from. As developers are likely to create bespoke 
> conventions for this anyways, it seems reasonable to standardize it as 
> canonical metadata.
> 
> I say this all as a happy user of DuckDB's Arrow scan functionality that is 
> excited to see better query optimization capabilities. It's just that, in its 
> current form, the changes in this proposal are not something I could 
> foreseeably integrate with.
> 
> Best,
> Shoumyo
> 
> [1]: 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
> From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
> dev@arrow.apache.org
> Subject: Re: [DISCUSS] Statistics through the C data interface
> 
> I want to +1 on what Dewey is saying here and some comments.
> 
> Sutou Kouhei wrote:
> > ADBC may be a bit larger to use only for transmitting statistics. ADBC has  
> >  
> statistics related APIs but it has more other APIs.
> 
> It's impossible to keep the responsibility of communication protocols
> cleanly separated, but IMO, we should strive to keep the C Data
> Interface more of a Transport Protocol than an Application Protocol.
> 
> Statistics are application dependent and can complicate the
> implementation of importers/exporters which would hinder the adoption
> of the C Data Interface. Statistics also bring in security concerns
> that are application-specific. e.g. can an algorithm trust min/max
> stats and risk producing incorrect results if the statistics are
> incorrect? A question that can't really be answered at the C Data
> Interface level.
> 
> The need for more sop

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
This is a really exciting development, thank you for putting together this 
proposal!

It looks like this thread and the linked GitHub issue has lots of input from 
folks who work with Arrow at a low level and have better familiarity with the 
Arrow specifications than I do, so I'll refrain from commenting on the 
technicalities of the proposal. I would, however, like to share my perspective 
as an application developer that heavily uses Arrow at higher levels for 
composing data systems.

My main concern with the direction of this proposal is that it seems too 
narrowly focused on what the integration with DuckDB will look like (how the 
statistics can be fed into DuckDB). In many applications, executing the query 
is often the "last mile", and it's important to consider where the statistics 
will actually come from. To start, data might be sourced in various manners:

- Arrow IPC files may be mapped from shared memory
- Arrow IPC streams may be received via some RPC framework (à la Flight)
- The Arrow libraries may be used to read from file formats like Parquet or CSV
- ADBC drivers may be used to read from databases

Note that in at least the first two cases, the system _executing the query_ 
will not be able to provide statistics simply because it is not actually the 
data producer. As an example, if Process A writes an Arrow IPC file to shared 
memory, and Process B wants to run a query on it -- how is Process B supposed 
to get the statistics for query planning? There are a few approaches that I 
anticipate application developers might consider:

1. Design an out-of-band mechanism for Process B to fetch statistics from 
Process A.
2. Design an encoding that is a superset of Arrow IPC and includes statistics 
information, allowing statistics to be communicated in-band.
3. Use custom schema metadata to communicate statistics in-band.

Options 1 and 2 require considerably more effort than Option 3. Also, Option 3 
feels somewhat natural because it makes sense for the statistics to come with 
the data (similar to how statistics are embedded in Parquet files). In some 
sense, the statistics actually *are* a property of the stream.

In systems that I work on, we already use schema metadata to communicate 
information that is unrelated to the structure of the data. From my reading of 
the documentation [1], this sounds like a reasonable (and perhaps intended?) 
use of metadata, and nowhere is it mentioned that metadata must be used to 
determine schema equivalence. Unless there are other ways of producing 
stream-level application metadata outside of the schema/field metadata, the 
lack of purity was not a concern for me to begin with.

I would appreciate an approach that communicates statistics via schema 
metadata, or at least in some in-band fashion that is consistent across the IPC 
and C data specifications. This would make it much easier to uniformly and 
transparently plumb statistics through applications, regardless of where they 
source Arrow data from. As developers are likely to create bespoke conventions 
for this anyways, it seems reasonable to standardize it as canonical metadata.

I say this all as a happy user of DuckDB's Arrow scan functionality that is 
excited to see better query optimization capabilities. It's just that, in its 
current form, the changes in this proposal are not something I could 
foreseeably integrate with.

Best,
Shoumyo

[1]: 
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

I want to +1 on what Dewey is saying here and some comments.

Sutou Kouhei wrote:
> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
statistics related APIs but it has more other APIs.

It's impossible to keep the responsibility of communication protocols
cleanly separated, but IMO, we should strive to keep the C Data
Interface more of a Transport Protocol than an Application Protocol.

Statistics are application dependent and can complicate the
implementation of importers/exporters which would hinder the adoption
of the C Data Interface. Statistics also bring in security concerns
that are application-specific. e.g. can an algorithm trust min/max
stats and risk producing incorrect results if the statistics are
incorrect? A question that can't really be answered at the C Data
Interface level.

The need for more sophisticated statistics only grows with time, so
there is no such thing as a "simple statistics schema".

Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.

ADBC might be too big of a leap in complexity now, but "we just need C
Data Interface + statistics" is unlikely to remain true for very long
as projects grow in complexity.

--
Felipe

On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
 wrote:
>
> Than

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou



Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit :


Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.


This is also my opinion.

I think what we are slowly converging on is the need for a spec to 
describe the encoding of Arrow array statistics as Arrow arrays.


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Felipe Oliveira Carvalho
I want to +1 on what Dewey is saying here and some comments.

Sutou Kouhei wrote:
> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
> statistics related APIs but it has more other APIs.

It's impossible to keep the responsibility of communication protocols
cleanly separated, but IMO, we should strive to keep the C Data
Interface more of a Transport Protocol than an Application Protocol.

Statistics are application dependent and can complicate the
implementation of importers/exporters which would hinder the adoption
of the C Data Interface. Statistics also bring in security concerns
that are application-specific. e.g. can an algorithm trust min/max
stats and risk producing incorrect results if the statistics are
incorrect? A question that can't really be answered at the C Data
Interface level.

The need for more sophisticated statistics only grows with time, so
there is no such thing as a "simple statistics schema".

Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.

ADBC might be too big of a leap in complexity now, but "we just need C
Data Interface + statistics" is unlikely to remain true for very long
as projects grow in complexity.

--
Felipe

On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
 wrote:
>
> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
>
> > ADBC may be a bit larger to use only for transmitting
> > statistics. ADBC has statistics related APIs but it has more
> > other APIs.
>
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
>
> > How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
>
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
>
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice
> >
> > It seems that we should use the approach because all
> > feedback said so. How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
> >
> > | Field Name   | Field Type| Comments |
> > |--|---|  |
> > | column_name  | utf8  | (1)  |
> > | statistic_key| utf8 not null | (2)  |
> > | statistic_value  | VALUE_SCHEMA not null |  |
> > | statistic_is_approximate | bool not null | (3)  |
> >
> > 1. If null, then the statistic applies to the entire table.
> >It's for "row_count".
> > 2. We'll provide pre-defined keys such as "max", "min",
> >"byte_width" and "distinct_count" but users can also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > |||
> > | int64  | int64  |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 
> > 2024 17:04:57 +0200,
> >   Antoine Pitrou  wrote:
> >
> > >
> > > Hi Kou,
> > >
> > > I agree that Dewey that this is overstretching the capabilities of the
> > > C Data Interface. In particular, stuffing a pointer as metadata value
> > > and decreeing it immortal doesn't sound like a good design decision.
> > >
> > > Why not simply pass the statistics ArrowA

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Curt Hagenlocher
>  would it be easier to request statistics at a higher level of
abstraction?

What if there were a "single table provider" level of abstraction between
ADBC and ArrowArrayStream as a C API; something that can report statistics
and apply simple predicates?

On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
 wrote:

> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
>
> > ADBC may be a bit larger to use only for transmitting
> > statistics. ADBC has statistics related APIs but it has more
> > other APIs.
>
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
>
> > How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
>
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
>
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice
> >
> > It seems that we should use the approach because all
> > feedback said so. How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
> >
> > | Field Name   | Field Type| Comments |
> > |--|---|  |
> > | column_name  | utf8  | (1)  |
> > | statistic_key| utf8 not null | (2)  |
> > | statistic_value  | VALUE_SCHEMA not null |  |
> > | statistic_is_approximate | bool not null | (3)  |
> >
> > 1. If null, then the statistic applies to the entire table.
> >It's for "row_count".
> > 2. We'll provide pre-defined keys such as "max", "min",
> >"byte_width" and "distinct_count" but users can also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > |||
> > | int64  | int64  |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
> 2024 17:04:57 +0200,
> >   Antoine Pitrou  wrote:
> >
> > >
> > > Hi Kou,
> > >
> > > I agree that Dewey that this is overstretching the capabilities of the
> > > C Data Interface. In particular, stuffing a pointer as metadata value
> > > and decreeing it immortal doesn't sound like a good design decision.
> > >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice (Dewey mentioned ADBC but it is of course just
> > > a possible API among others)?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> > >> Hi,
> > >> We're discussing how to provide statistics through the C
> > >> data interface at:
> > >> https://github.com/apache/arrow/issues/38837
> > >> If you're interested in this feature, could you share your
> > >> comments?
> > >> Motivation:
> > >> We can interchange Apache Arrow data by the C data interface
> > >> in the same process. For example, we can pass Apache Arrow
> > >> data read by Apache Arrow C++ (provider) to DuckDB
> > >> (consumer) through the C data interface.
> > >> A provider may know Apache Arrow data statistics. For
> > >> example, a provider can know statistics when it reads Apache
> > >> Parquet data because Apache Parquet may provide statistics.
> > >> But a consumer can't know statistics that are known by a
> > >> producer. Because there

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-23 Thread Dewey Dunnington
The adbcdrivermanager, adbcsqlite, and adbcpostgresql packages are all
updated on CRAN!

On Tue, May 21, 2024 at 10:41 PM David Li  wrote:
>
> [x] Close the GitHub milestone/project
> [x] Add the new release to the Apache Reporter System
> [x] Upload source release artifacts to Subversion
> [x] Create the final GitHub release
> [x] Update website
> [x] Upload wheels/sdist to PyPI
> [x] Publish Maven packages
> [x] Update tags for Go modules
> [x] Deploy APT/Yum repositories
> [ ] Update R packages
> [x] Upload Ruby packages to RubyGems
> [x] Upload C#/.NET packages to NuGet
> [x] Update conda-forge packages
> [x] Announce the new release
> [x] Remove old artifacts
> [x] Bump versions
> [IN PROGRESS] Publish release blog post [2]
>
> @Dewey, I'd appreciate your help as always with the R packages :)
>
> [1]: https://github.com/apache/arrow-site/pull/523
>
> On Tue, May 21, 2024, at 09:00, Sutou Kouhei wrote:
> > +1 (binding)
> >
> > I ran the following on Debian GNU/Linux sid:
> >
> >   TEST_DEFAULT=0 \
> > TEST_SOURCE=1 \
> > LANG=C \
> > TZ=UTC \
> > JAVA_HOME=/usr/lib/jvm/default-java \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_APT=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_BINARY=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_JARS=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_WHEELS=1 \
> > TEST_PYTHON_VERSIONS=3.11 \
> > LANG=C \
> > TZ=UTC \
> > dev/release/verify-release-candidate.sh 12 4
> >
> >   TEST_DEFAULT=0 \
> > TEST_YUM=1 \
> > LANG=C \
> > dev/release/verify-release-candidate.sh 12 4
> >
> > with:
> >
> >   * g++ (Debian 13.2.0-23) 13.2.0
> >   * go version go1.22.2 linux/amd64
> >   * openjdk version "17.0.11" 2024-04-16
> >   * Python 3.11.9
> >   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> >   * Apache Arrow 17.0.0-SNAPSHOT
> >
> > Note:
> >
> > I needed to install arrow-glib-devel explicitly to verify
> > Yum repository:
> >
> > 
> > diff --git a/dev/release/verify-yum.sh b/dev/release/verify-yum.sh
> > index f7f023611..ff30176f1 100755
> > --- a/dev/release/verify-yum.sh
> > +++ b/dev/release/verify-yum.sh
> > @@ -170,6 +170,7 @@ echo "::endgroup::"
> >
> >  echo "::group::Test ADBC Arrow GLib"
> >
> > +${install_command} --enablerepo=epel arrow-glib-devel
> >  ${install_command} --enablerepo=epel 
> > adbc-arrow-glib-devel-${package_version}
> >  ${install_command} --enablerepo=epel adbc-arrow-glib-doc-${package_version}
> >
> > 
> >
> > adbc-arrow-glib-devel depends on "pkgconfig(arrow-glib)" and
> > libarrow-glib-devel provided by EPEL also provides it:
> >
> > $ sudo dnf repoquery --deplist adbc-arrow-glib-devel-12
> > Last metadata expiration check: 2:01:21 ago on Mon May 20 21:17:44 2024.
> > package: adbc-arrow-glib-devel-12-1.el9.x86_64
> > ...
> >   dependency: pkgconfig(arrow-glib)
> >provider: arrow-glib-devel-16.1.0-1.el9.x86_64
> >provider: libarrow-glib-devel-9.0.0-11.el9.x86_64
> > ...
> >
> >
> > If I don't install arrow-glib-devel explicitly,
> > libarrow-glib-devel may be installed. We may need to add
> > "Conflicts: libarrow-glib-devel" to Apache Arrow's
> > arrow-glib-devel to resolve this case automatically. Anyway,
> > this is not a ADBC problem. So it's not a blocker.
> >
> >
> >
> > Thanks,
> > --
> > kou
> >
> >
> > In 
> >   "[VOTE] Release Apache Arrow ADBC 12 - RC4" on Wed, 15 May 2024
> > 14:00:33 +0900,
> >   "David Li"  wrote:
> >
> >> Hello,
> >>
> >> I would like to propose the following release candidate (RC4) of Apache 
> >> Arrow ADBC version 12. This is a release consisting of 56 resolved GitHub 
> >> issues [1].
> >>
> >> Please note that the versioning scheme has changed.  This is the 12th 
> >> release of ADBC, and so is called version "12".  The subcomponents, 
> >> however, are versioned independently:
> >>
> >> - C/C++/GLib/Go/Python/Ruby: 1.0.0
> >> - C#: 0.12.0
> >> - Java: 0.12.0
> >> - R: 0.12.0
> >> - Rust: 0.12.0
> >>
> >> These are the versions you will see in the source and in actual packages.  
> >> The next release will be "13", and the subcomponents will increment their 
> >> versions independently (to either 1.1.0, 0.13.0, or 1.0.0).  At this 
> >> point, there is no plan to release subcomponents independently from the 
> >> project as a whole.
> >>
> >> Please note that there is a known issue when using the Flight SQL and 
> >> Snowflake drivers at the same time on x86_64 macOS [12].
> >>
> >> This release candidate is based on commit: 
> >> 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
> >>
> >> The source release rc4 is hosted at [3].
> >> The binary artifacts are hosted at [4][5][6][7][8].
> >> The changelog is located at [9].
> >>
> >> 

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thank you for the background! I understand that these statistics are
important for query planning; however, I am not sure that I follow why
we are constrained to the ArrowSchema to represent them. The examples
given seem to going through Python...would it be easier to request
statistics at a higher level of abstraction? There would already need
to be a separate mechanism to request an ArrowArrayStream with
statistics (unless the PyCapsule `requested_schema` argument would
suffice).

> ADBC may be a bit larger to use only for transmitting
> statistics. ADBC has statistics related APIs but it has more
> other APIs.

Some examples of producers given in the linked threads (Delta Lake,
Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
can implement an ADBC driver without defining all the methods (where
the producer could call AdbcConnectionGetStatistics(), although
AdbcStatementGetStatistics() might be more relevant here and doesn't
exist). One example listed (using an Arrow Table as a source) seems a
bit light to wrap in an ADBC driver; however, it would not take much
code to do so and the overhead of getting the reader via ADBC it is
something like 100 microseconds (tested via the ADBC R package's
"monkey driver" which wraps an existing stream as a statement). In any
case, the bulk of the code is building the statistics array.

> How about the following schema for the
> statistics ArrowArray? It's based on ADBC.

Whatever format for statistics is decided on, I imagine it should be
exactly the same as the ADBC standard? (Perhaps pushing changes
upstream if needed?).

On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>
> Hi,
>
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice
>
> It seems that we should use the approach because all
> feedback said so. How about the following schema for the
> statistics ArrowArray? It's based on ADBC.
>
> | Field Name   | Field Type| Comments |
> |--|---|  |
> | column_name  | utf8  | (1)  |
> | statistic_key| utf8 not null | (2)  |
> | statistic_value  | VALUE_SCHEMA not null |  |
> | statistic_is_approximate | bool not null | (3)  |
>
> 1. If null, then the statistic applies to the entire table.
>It's for "row_count".
> 2. We'll provide pre-defined keys such as "max", "min",
>"byte_width" and "distinct_count" but users can also use
>application specific keys.
> 3. If true, then the value is approximate or best-effort.
>
> VALUE_SCHEMA is a dense union with members:
>
> | Field Name | Field Type |
> |||
> | int64  | int64  |
> | uint64 | uint64 |
> | float64| float64|
> | binary | binary |
>
> If a column is an int32 column, it uses int64 for
> "max"/"min". We don't provide all types here. Users should
> use a compatible type (int64 for a int32 column) instead.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
> 17:04:57 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > Hi Kou,
> >
> > I agree that Dewey that this is overstretching the capabilities of the
> > C Data Interface. In particular, stuffing a pointer as metadata value
> > and decreeing it immortal doesn't sound like a good design decision.
> >
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice (Dewey mentioned ADBC but it is of course just
> > a possible API among others)?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> >> Hi,
> >> We're discussing how to provide statistics through the C
> >> data interface at:
> >> https://github.com/apache/arrow/issues/38837
> >> If you're interested in this feature, could you share your
> >> comments?
> >> Motivation:
> >> We can interchange Apache Arrow data by the C data interface
> >> in the same process. For example, we can pass Apache Arrow
> >> data read by Apache Arrow C++ (provider) to DuckDB
> >> (consumer) through the C data interface.
> >> A provider may know Apache Arrow data statistics. For
> >> example, a provider can know statistics when it reads Apache
> >> Parquet data because Apache Parquet may provide statistics.
> >> But a consumer can't know statistics that are known by a
> >> producer. Because there isn't a standard way to provide
> >> statistics through the C data interface. If a consumer can
> >> know statistics, it can process Apache Arrow data faster
> >> based on statistics.
> >> Proposal:
> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >> How about providing statistics as a metadata in ArrowSchema?
> >> We reserve "ARROW" namespace for internal Apache Arrow use:
> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >>
> >>> The ARROW pattern is a reserved na

Re: [VOTE] Release Apache Arrow nanoarrow 0.5.0

2024-05-23 Thread Raúl Cumplido
+1 (binding)

I've tested successfully on Ubuntu 22.04 without R.

TEST_R=0 ./verify-release-candidate.sh 0.5.0 0

Regards,
Raúl

El jue, 23 may 2024 a las 6:49, David Li () escribió:
>
> +1 (binding)
>
> Tested on Debian 12 'bookworm'
>
> On Thu, May 23, 2024, at 11:03, Sutou Kouhei wrote:
> > +1 (binding)
> >
> > I ran the following command line on Debian GNU/Linux sid:
> >
> >   dev/release/verify-release-candidate.sh 0.5.0 0
> >
> > with:
> >
> >   * Apache Arrow C++ main
> >   * gcc (Debian 13.2.0-23) 13.2.0
> >   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
> >   * Python 3.11.9
> >
> > Thanks,
> > --
> > kou
> >
> >
> > In 
> >   "[VOTE] Release Apache Arrow nanoarrow 0.5.0" on Wed, 22 May 2024
> > 15:17:40 -0300,
> >   Dewey Dunnington  wrote:
> >
> >> Hello,
> >>
> >> I would like to propose the following release candidate (rc0) of
> >> Apache Arrow nanoarrow [0] version 0.5.0. This is an initial release
> >> consisting of 79 resolved GitHub issues from 9 contributors [1].
> >>
> >> This release candidate is based on commit:
> >> c5fb10035c17b598e6fd688ad9eb7b874c7c631b [2]
> >>
> >> The source release rc0 is hosted at [3].
> >> The changelog is located at [4].
> >>
> >> Please download, verify checksums and signatures, run the unit tests,
> >> and vote on the release. See [5] for how to validate a release
> >> candidate.
> >>
> >> The vote will be open for at least 72 hours.
> >>
> >> [ ] +1 Release this as Apache Arrow nanoarrow 0.5.0
> >> [ ] +0
> >> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.5.0 because...
> >>
> >> [0] https://github.com/apache/arrow-nanoarrow
> >> [1] https://github.com/apache/arrow-nanoarrow/milestone/5?closed=1
> >> [2] 
> >> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.5.0-rc0
> >> [3] 
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.5.0-rc0/
> >> [4] 
> >> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0-rc0/CHANGELOG.md
> >> [5] 
> >> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md