Re: [DISCUSS] Statistics through the C data interface

2024-06-20 Thread Sutou Kouhei
Hi,

Here is an updated summary so far:


Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
  Examples:
  * Transmit statistics through Apache Arrow IPC file
  * Transmit statistics through Apache Arrow Flight
* Multi-column statistics
* Constraints information
* Indexes information

Discussing approach:

Standardize Apache Arrow schema for statistics and transmit
statistics via separated API call that uses the C data
interface.

This also works for per-batch statistics.

Candidate schema:

map<
  // The column index or null if the statistics refer to whole table or batch.
  column: int32,
  // Statistics key is int32.
  // Different keys are assigned for exact value and
  // approximate value.
  map>
>

Discussions:

1. Can we use int32 for statistic keys?
   Should we use utf8 (or dictionary) for
   statistic keys?
2. Hot to support non-standard (vendor-specific)
   statistic keys?


Here is my idea:

1. We can use int32 for statistic keys.
2. We can reserve a specific range for non-standard
   statistic keys. Prerequisites of this:
   * There is no use case to merge some statistics for
 the same data.
   * We can't merge statistics for different data.

If the prerequisites aren't satisfied:

1. We should use utf8 (or dictionary) for
   statistic keys?
2. We can use reserved prefix such as "ARROW:"/"arrow." for
   standard statistic keys or use prefix such as
   "vendor1:"/"vendor1." for non-standard statistic keys.

Here is Felipe's idea:
https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2

1. We can use int32 for statistic keys.
2. We can use the special statistic key + a string identifier
   for non-standard statistic keys.


What do you think about this?


Thanks,
-- 
kou

In <20240606.182727.1004633558059795207@clear-code.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun 2024 
18:27:27 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> Thanks for sharing your comments. Here is a summary so far:
> 
> 
> 
> Use cases:
> 
> * Optimize query plan: e.g. JOIN for DuckDB
> 
> Out of scope:
> 
> * Transmit statistics through not the C data interface
>   Examples:
>   * Transmit statistics through Apache Arrow IPC file
>   * Transmit statistics through Apache Arrow Flight
> 
> Candidate approaches:
> 
> 1. Pass statistics (encoded as an Apache Arrow data) via
>ArrowSchema metadata
>* This embeds statistics address into metadata
>* It's for avoiding using Apache Arrow IPC format with
>  the C data interface
> 2. Embed statistics (encoded as an Apache Arrow data) into
>ArrowSchema metadata
>* This adds statistics to metadata in Apache Arrow IPC
>  format
> 3. Embed statistics (encoded as JSON) into ArrowArray
>metadata
> 4. Standardize Apache Arrow schema for statistics and
>transmit statistics via separated API call that uses the
>C data interface
> 5. Use ADBC
> 
> 
> 
> I think that 4. is the best approach in these candidates.
> 
> 1. Embedding statistics address is tricky.
> 2. Consumers need to parse Apache Arrow IPC format data.
>(The C data interface consumers may not have the
>feature.)
> 3. This will work but 4. is more generic.
> 5. ADBC is too large to use only for statistics.
> 
> What do you think about this?
> 
> 
> If we select 4., we need to standardize Apache Arrow schema
> for statistics. How about the following schema?
> 
> 
> Metadata:
> 
> | Name   | Value | Comments |
> ||---|- |
> | ARROW::statistics::version | 1.0.0 | (1)  |
> 
> (1) This follows semantic versioning.
> 
> Fields:
> 
> | Name   | Type  | Comments |
> ||---|  |
> | column | utf8  | (2)  |
> | key| utf8 not null | (3)  |
> | value  | VALUE_SCHEMA not null |  |
> | is_approximate | bool not null | (4)  |
> 
> (2) If null, then the statistic applies to the entire table.
> It's for "row_count".
> (3) We'll provide pre-defined keys such as "max", "min",
> "byte_width" and "distinct_count" but users can also use
> application specific keys.
> (4) If true, then the value is approximate or best-effort.
> 
> VALUE_SCHEMA is a dense union with members:
> 
> | Name| Type|
> |-|-|
> | int64   | int64   |
> | uint64  | uint64  |
> | float64 | float64 |
> | binary  | binary  |
> 
> If a column is an int32 column, it uses int64 

Re: [DISCUSS] Statistics through the C data interface

2024-06-14 Thread Felipe Oliveira Carvalho
On Sun, Jun 9, 2024 at 7:53 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
> 22:11:54 +0200,
>   Antoine Pitrou  wrote:
>
> >>>> Fields:
> >>>> | Name   | Type  | Comments |
> >>>> ||---|  |
> >>>> | column | utf8  | (2)  |
> >>>> | key| utf8 not null | (3)  |
> >>>
> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
> >>> the representation more efficient where there are many columns?
> >> Dictionary is more efficient. But we need to standardize not
> >> only key but also ID -> key mapping.
> >
> > I don't get why we would need to standardize ID -> key mapping. The
> > key names would be significant, the dictionary mapping is just for
> > efficiency.
>
> Ah, space efficiency was only discussed here, right? I
> thought that computational efficiency is also discussed
> here. If we standardize ID -> key mapping, consumers don't
> need to compare key names.
>
> Example: We want to find "distinct_count" statistics.
>
> If we standardize ID -> key mapping (1 -> "distinct_count"),
> consumers can find "distinct_count" statistics by finding ID
> 1 entry.

A consumer should decode the statistics array before using it into the
structure of their choice.

my_specific_stats_map.Clear();  // all stats are unknown
for each item in statistics_array {
  switch (stat_kind):
 case ARRAY_STAT_X:
  auto value_from_the_dense_union = 
  my_specific_stats_map.SetStatX(value_from_the_dense_union);
  break;
 case ARRAY_STAT_Y:
 ...
 break;
 default:
 // just ignore statistics that a producer might add, but that this
 // consumer doesn't take advantage of.
 break;
}

Maps on the wire (network wire, or shared memory "wire") are better
represented as lists that become a hash-table (or simple structs) at
the receiving end. We don't have to standardize the positions in the
statistics array which allows a small statistics array when not many
statistics are available. It also allows the number of standardized
statistics to grow without creating overhead unless a provider opts-in
to producing that statistic.

> If we don't standardize ID -> key mapping, consumers need to
> compare key name to find "distinct_count" statistics.
>
>
> Anyway, this (string comparison) will not be a large
> overhead because (1) statistics data will not be large data
> and (2) consumers can cache ID -> key mapping to avoid
> duplicated string comparisons. So standardizing ID -> key
> mapping isn't required.
>
>
> Thanks,
> --
> kou


Re: [DISCUSS] Statistics through the C data interface

2024-06-11 Thread Sutou Kouhei
Hi,

Hmm. I think that this should not be covered by this
approach. Constraints/indexes information isn't similar to
statistics information. If we mix them, the data for it will
be difficult to use.

BTW, how do you want to use constraints and indexes
information? Indexes can't be interchanged through the C
data interface. Do you want to use it to determine whether
pushdown is used or not?


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
18:29:19 -0400,
  Adam Lippai  wrote:

> It’s not strictly statistics, but would this also cover constraints and
> indexes? Table, recordbatch and column primary keys, unique keys, sort
> keys, bloom filters, hnsw index and shape (ndarray for keys xyz).
> 
> Not sure which backends (DB, parquet, lance) expose which natively, but
> might worth considering it for a minute.
> 
> Best regards,
> Adam Lippai
> 
> On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei  wrote:
> 
>> Hi,
>>
>> In 
>>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun
>> 2024 22:11:54 +0200,
>>   Antoine Pitrou  wrote:
>>
>> >>>> Fields:
>> >>>> | Name   | Type  | Comments |
>> >>>> ||---|  |
>> >>>> | column | utf8  | (2)  |
>> >>>> | key| utf8 not null | (3)  |
>> >>>
>> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>> >>> the representation more efficient where there are many columns?
>> >> Dictionary is more efficient. But we need to standardize not
>> >> only key but also ID -> key mapping.
>> >
>> > I don't get why we would need to standardize ID -> key mapping. The
>> > key names would be significant, the dictionary mapping is just for
>> > efficiency.
>>
>> Ah, space efficiency was only discussed here, right? I
>> thought that computational efficiency is also discussed
>> here. If we standardize ID -> key mapping, consumers don't
>> need to compare key names.
>>
>> Example: We want to find "distinct_count" statistics.
>>
>> If we standardize ID -> key mapping (1 -> "distinct_count"),
>> consumers can find "distinct_count" statistics by finding ID
>> 1 entry.
>>
>> If we don't standardize ID -> key mapping, consumers need to
>> compare key name to find "distinct_count" statistics.
>>
>>
>> Anyway, this (string comparison) will not be a large
>> overhead because (1) statistics data will not be large data
>> and (2) consumers can cache ID -> key mapping to avoid
>> duplicated string comparisons. So standardizing ID -> key
>> mapping isn't required.
>>
>>
>> Thanks,
>> --
>> kou
>>


Re: [DISCUSS] Statistics through the C data interface

2024-06-11 Thread Sutou Kouhei
Hi,

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
22:07:01 +0200,
  Antoine Pitrou  wrote:

>> How about reserving a specific range (e.g. 1-2) for
>> vendor-specific statistics?
> 
> This would be quite annoying to work with, especially if several
> vendors
> want to converge on a shared statistic.

Hmm. I don't think that I understand the use case you
said...

Does this use case merge some statistics for the same data
from multiple vendors into one statistic? (But why do we
need to get statistics for the same data from multiple
vendors?)

Or does this use case merge some statistics for different
data from multiple vendors into one statistics? In general,
I think that we can't merge statistics for different
data. For example, merging "distinct_count" statistics for
different data is meaningless.


Thanks,
-- 
kou


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Adam Lippai
It’s not strictly statistics, but would this also cover constraints and
indexes? Table, recordbatch and column primary keys, unique keys, sort
keys, bloom filters, hnsw index and shape (ndarray for keys xyz).

Not sure which backends (DB, parquet, lance) expose which natively, but
might worth considering it for a minute.

Best regards,
Adam Lippai

On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei  wrote:

> Hi,
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun
> 2024 22:11:54 +0200,
>   Antoine Pitrou  wrote:
>
> >>>> Fields:
> >>>> | Name   | Type  | Comments |
> >>>> ||---|  |
> >>>> | column | utf8  | (2)  |
> >>>> | key| utf8 not null | (3)  |
> >>>
> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
> >>> the representation more efficient where there are many columns?
> >> Dictionary is more efficient. But we need to standardize not
> >> only key but also ID -> key mapping.
> >
> > I don't get why we would need to standardize ID -> key mapping. The
> > key names would be significant, the dictionary mapping is just for
> > efficiency.
>
> Ah, space efficiency was only discussed here, right? I
> thought that computational efficiency is also discussed
> here. If we standardize ID -> key mapping, consumers don't
> need to compare key names.
>
> Example: We want to find "distinct_count" statistics.
>
> If we standardize ID -> key mapping (1 -> "distinct_count"),
> consumers can find "distinct_count" statistics by finding ID
> 1 entry.
>
> If we don't standardize ID -> key mapping, consumers need to
> compare key name to find "distinct_count" statistics.
>
>
> Anyway, this (string comparison) will not be a large
> overhead because (1) statistics data will not be large data
> and (2) consumers can cache ID -> key mapping to avoid
> duplicated string comparisons. So standardizing ID -> key
> mapping isn't required.
>
>
> Thanks,
> --
> kou
>


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
22:11:54 +0200,
  Antoine Pitrou  wrote:

>>>> Fields:
>>>> | Name   | Type  | Comments |
>>>> ||---|  |
>>>> | column | utf8  | (2)  |
>>>> | key| utf8 not null | (3)  |
>>>
>>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>>> the representation more efficient where there are many columns?
>> Dictionary is more efficient. But we need to standardize not
>> only key but also ID -> key mapping.
> 
> I don't get why we would need to standardize ID -> key mapping. The
> key names would be significant, the dictionary mapping is just for
> efficiency.

Ah, space efficiency was only discussed here, right? I
thought that computational efficiency is also discussed
here. If we standardize ID -> key mapping, consumers don't
need to compare key names.

Example: We want to find "distinct_count" statistics.

If we standardize ID -> key mapping (1 -> "distinct_count"),
consumers can find "distinct_count" statistics by finding ID
1 entry.

If we don't standardize ID -> key mapping, consumers need to
compare key name to find "distinct_count" statistics.


Anyway, this (string comparison) will not be a large
overhead because (1) statistics data will not be large data
and (2) consumers can cache ID -> key mapping to avoid
duplicated string comparisons. So standardizing ID -> key
mapping isn't required.


Thanks,
-- 
kou


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou



Le 09/06/2024 à 08:33, Sutou Kouhei a écrit :



Fields:
| Name   | Type  | Comments |
||---|  |
| column | utf8  | (2)  |
| key| utf8 not null | (3)  |


1. Should the key be something like `dictionary(int32, utf8)` to make
the representation more efficient where there are many columns?


Dictionary is more efficient. But we need to standardize not
only key but also ID -> key mapping.


I don't get why we would need to standardize ID -> key mapping. The key 
names would be significant, the dictionary mapping is just for efficiency.



It may be complex to support full multi-column statistics
use cases. How about standardizing this without
multi-columns statistics support for the first version?


That's ok with me.

Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou



Le 09/06/2024 à 09:01, Sutou Kouhei a écrit :

Hi,


One thing that a plain integer makes more difficult is representing
non-standard statistics. For example some engine might want to expose
elaborate quantile-based statistics even if it not officially defined
here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
easy with some prefixing to ensure uniqueness. With a `int32` field,
the spec would have to mention a mechanism to ensure global uniqueness
of vendor-specific statistics.


How about reserving a specific range (e.g. 1-2) for
vendor-specific statistics?


This would be quite annoying to work with, especially if several vendors
want to converge on a shared statistic.

Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field,
> the spec would have to mention a mechanism to ensure global uniqueness
> of vendor-specific statistics.

How about reserving a specific range (e.g. 1-2) for
vendor-specific statistics? Statistics in the range aren't
global unique but global uniqueness may not be needed in the
specific producer-consumer communication.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Fri, 7 Jun 2024 
10:05:48 +0200,
  Antoine Pitrou  wrote:

> 
> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
>> I've been thinking about how to encode statistics on Arrow arrays and
>> how to keep the set of statistics known by both producers and
>> consumers (i.e. standardized).
>> The statistics array(s) could be a
>>map<
>>  // the column index or null if the statistics refer to whole table or
>>  batch
>>  column: int32,
>>  map>  keys...>>
>>>
>> The keys would be defined as part of the standard:
>> // Statistics values are identified by specified int32-valued keys
>> // so that producers and consumers can agree on physical
>> // encoding and semantics. Statistics can be about a column,
>> // a record batch, or both.
>> typedef ArrowStatKind int32_t;
> 
> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field,
> the spec would have to mention a mechanism to ensure global uniqueness
> of vendor-specific statistics.
> 
>> Version markers in two-sided protocols never work well long term:
>> see Parquet files lying about the version of the encoder so the files
>> can be read and web browsers lying on their User-Agent strings so
>> websites don't break. It's better to allow probing for individual
>> feature support (in this case, the presence of a specific stat kind in
>> the array).
> 
> +1 on this.
> 
> Regards
> 
> Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

> The exact types inside the dense_union would be chosen when encoding.

Ah, this approach doesn't standardize VALUE_SCHEMA (use a
fixed VALUE_SCHEMA). If it works in real world, it's more
flexible.

>  Version markers in two-sided protocols never work well long term:
> see Parquet files lying about the version of the encoder so the files
> can be read and web browsers lying on their User-Agent strings so
> websites don't break.

Could you explain more? I can understand the latter
case. But are there version information based compatibility
problems in Parquet files?


>   It's better to allow probing for individual
> feature support (in this case, the presence of a specific stat kind in
> the array).

It will work for statistics kind case but does it work for
adding support for multi-column statistics case?

I think that we can't mix

map<
  // the column index or null if the statistics refer to whole table or batch
  column: int32,
  map>
>

and

map<
  // the column indexes or empty if the statistics refer to whole table or batch
  column: list,
  map>
>

.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
23:27:18 -0300,
  Felipe Oliveira Carvalho  wrote:

> I've been thinking about how to encode statistics on Arrow arrays and
> how to keep the set of statistics known by both producers and
> consumers (i.e. standardized).
> 
> The statistics array(s) could be a
> 
>   map<
> // the column index or null if the statistics refer to whole table or 
> batch
> column: int32,
> map keys...>>
>   >
> 
> The keys would be defined as part of the standard:
> 
> // Statistics values are identified by specified int32-valued keys
> // so that producers and consumers can agree on physical
> // encoding and semantics. Statistics can be about a column,
> // a record batch, or both.
> typedef ArrowStatKind int32_t;
> 
> #define ARROW_STAT_ANY 0
> // Exact number of nulls in a column. Value must be int32 or int64.
> #define ARROW_STAT_NULL_COUNT_EXACT 1
> // Approximate number of nulls in a column. Value must be float32 or float64.
> #define ARROW_STAT_NULL_COUNT_APPROX 2
> // The minimum and maximum values of a column.
> // Value must be the same type of the column.
> // Supported types are: ...
> #define ARROW_STAT_MIN_APROX 2
> #define ARROW_STAT_MIN_NULLS_FIRST 4
> #define ARROW_STAT_MIN_NULLS_LAST 5
> #define ARROW_STAT_MAX_APROX 6
> #define ARROW_STAT_MAX_NULLS_FIRST 7
> #define ARROW_STAT_MAX_NULLS_LAST 8
> #define ARROW_STAT_CARDINALITY_APPROX 9
> #define ARROW_STAT_COUNT_DISTINCT_APPROX 10
> 
> Every key is optional and consumers that don't know or don't care
> about the stats can skip them while scanning statistics arrays.
> 
> Applications would have their own domain classes for storing
> statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and
> unpack into these arrays.
> 
> The exact types inside the dense_union would be chosen when encoding.
> The decoder would handle the types expected and/or supported for each
> given stat kind.
> 
> We wouldn't have to rely on versioning of the entire statistics
> objects. If we want a richer way to represent a maximum, we add
> another stat kind to the spec and keep producing both the old and the
> new representations for the maximum while consumers migrate to the new
> way. Version markers in two-sided protocols never work well long term:
> see Parquet files lying about the version of the encoder so the files
> can be read and web browsers lying on their User-Agent strings so
> websites don't break. It's better to allow probing for individual
> feature support (in this case, the presence of a specific stat kind in
> the array).
> 
> Multiple calls could be done to load statistics and they could come
> with more statistics each time.
> 
> --
> Felipe
> 
> [1] 
> https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
> 
> On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington
>  wrote:
>>
>> Thank you for collecting all of our opinions on this! I also agree
>> that (4) is the best option.
>>
>> > Fields:
>> >
>> > | Name   | Type  | Comments |
>> > ||---|  |
>> > | column | utf8  | (2)  |
>>
>> The uft8 type would presume that column names are unique (although I
>> like it better than referring to columns by integer position).
>>
>> > If null, then the statistic applies to the entire table.
>>

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

>> | Name   | Type  | Comments |
>> ||---|  |
>> | column | utf8  | (2)  |
> 
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).

Ah, I forgot it. This proposal is based on ADBC's one. ADBC
can presume it because it's true on RDBMS. But we can't
presume it with the Apache Arrow columnar format.

So we need to use position.

Or we can use "FieldRef" in the C++ implementation:
https://github.com/apache/arrow/blob/399408cb273c47f490f65cdad95bc184a652826c/cpp/src/arrow/type.h#L2038-L2055

But it's complex to use this. So position is better.

>> If null, then the statistic applies to the entire table.
> 
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?

I didn't think of it. Thanks. It makes sense.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
22:06:41 -0300,
  Dewey Dunnington  wrote:

> Thank you for collecting all of our opinions on this! I also agree
> that (4) is the best option.
> 
>> Fields:
>>
>> | Name   | Type  | Comments |
>> ||---|  |
>> | column | utf8  | (2)  |
> 
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).
> 
>> If null, then the statistic applies to the entire table.
> 
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
> 
> 
> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
>>
>>
>> Hi Kou,
>>
>> Thanks for pushing for this!
>>
>> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> > 4. Standardize Apache Arrow schema for statistics and
>> > transmit statistics via separated API call that uses the
>> > C data interface
>> [...]
>> >
>> > I think that 4. is the best approach in these candidates.
>>
>> I agree.
>>
>> > If we select 4., we need to standardize Apache Arrow schema
>> > for statistics. How about the following schema?
>> >
>> > 
>> > Metadata:
>> >
>> > | Name   | Value | Comments |
>> > ||---|- |
>> > | ARROW::statistics::version | 1.0.0 | (1)  |
>>
>> I'm not sure this is useful, but it doesn't hurt.
>>
>> Nit: this should be "ARROW:statistics:version" for consistency with
>> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>>
>> > Fields:
>> >
>> > | Name   | Type  | Comments |
>> > ||---|  |
>> > | column | utf8  | (2)  |
>> > | key| utf8 not null | (3)  |
>>
>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>> the representation more efficient where there are many columns?
>>
>> 2. Should the statistics perhaps be nested as a map type under each
>> column to avoid repeating `column`, or is that overkill?
>>
>> 3. Should there also be room for multi-column statistics (such as
>> cardinality of a given column pair), or is it too complex for now?
>>
>> Regards
>>
>> Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

We can use 4. for per-batch statistics. Because 4. uses
separated API call. Users can design the separated API call
for per-batch statistics.

Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
13:14:08 +0200,
  Alessandro Molina  wrote:

> I brought it up on Github, but writing here too to avoid spawning too many
> threads.
> https://github.com/apache/arrow/issues/38837#issuecomment-2145343755
> 
> It's not something we have to address now, but it would be great if we
> could design a solution that can be extended in the future to add Par-Batch
> statistics in ArrowArrayStream.
> 
> While it's true that in most cases the producer code will be applying the
> filtering, in the case of C-Data we can't take that for granted. There
> might be cases where the consumer has no control over the filtering that
> the producer would apply and the producer might not be aware of the
> filtering that the consumer might want to do.
> 
> In those cases providing the statistics per-batch would allow the consumer
> to skip the batches it doesn't care about, thus giving the opportunity for
> a fast path.
> 
> 
> 
> 
> 
> On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou  wrote:
> 
>>
>> Hi Kou,
>>
>> Thanks for pushing for this!
>>
>> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> > 4. Standardize Apache Arrow schema for statistics and
>> > transmit statistics via separated API call that uses the
>> > C data interface
>> [...]
>> >
>> > I think that 4. is the best approach in these candidates.
>>
>> I agree.
>>
>> > If we select 4., we need to standardize Apache Arrow schema
>> > for statistics. How about the following schema?
>> >
>> > 
>> > Metadata:
>> >
>> > | Name   | Value | Comments |
>> > ||---|- |
>> > | ARROW::statistics::version | 1.0.0 | (1)  |
>>
>> I'm not sure this is useful, but it doesn't hurt.
>>
>> Nit: this should be "ARROW:statistics:version" for consistency with
>> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>>
>> > Fields:
>> >
>> > | Name   | Type  | Comments |
>> > ||---|  |
>> > | column | utf8  | (2)  |
>> > | key| utf8 not null | (3)  |
>>
>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>> the representation more efficient where there are many columns?
>>
>> 2. Should the statistics perhaps be nested as a map type under each
>> column to avoid repeating `column`, or is that overkill?
>>
>> 3. Should there also be room for multi-column statistics (such as
>> cardinality of a given column pair), or is it too complex for now?
>>
>> Regards
>>
>> Antoine.
>>


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Sutou Kouhei
Hi,

>> Metadata:
>> | Name   | Value | Comments |
>> ||---|- |
>> | ARROW::statistics::version | 1.0.0 | (1)  |
> 
> I'm not sure this is useful, but it doesn't hurt.

The Apache Arrow columnar format uses semantic
versioning. So I think that other specifications should also
use semantic versioning. FYI: ADBC API standard also uses
semantic versioning.

https://arrow.apache.org/docs/format/ADBC.html#adbc-api-standard-1-0-0

> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types

You're right. I should have used ":" not "::" here...

>> Fields:
>> | Name   | Type  | Comments |
>> ||---|  |
>> | column | utf8  | (2)  |
>> | key| utf8 not null | (3)  |
> 
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?

Dictionary is more efficient. But we need to standardize not
only key but also ID -> key mapping. If we standardize ID ->
key mapping, we don't need to use dictionary. We can just
use ID like the Felipe's approach does.

> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?

Ah, I didn't think of it. A nested type may be a bit complex
but we already use union (nested type) for value. So using
map here isn't a problem.

> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?

I didn't think of multi-column statistics too...
It seems that PostgreSQL supports multi-column statistics:
https://www.postgresql.org/docs/current/catalog-pg-statistic-ext.html

We can support multi-column statistics by using list for the
"column" field. But we also need to add more fields to
VALUE_SCHEMA to store a value of multi-column statistics.

If we support PostgreSQL's multi-column N-distinct counts
case, we need "map, uint64>":

https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS

> k  | 1 2 5
> nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}


If we support PostgreSQL's multi-column most common value
lists case, we need a more complex type...

https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-MCV-LISTS

>  index | values | nulls | frequency | base_frequency
> ---++---+---+
>  0 | {Washington, DC}   | {f,f} |  0.003467 |2.7e-05
>  1 | {Apo, AE}  | {f,f} |  0.003067 |1.9e-05
>  2 | {Houston, TX}  | {f,f} |  0.002167 |   0.000133
>  3 | {El Paso, TX}  | {f,f} | 0.002 |   0.000113
>  4 | {New York, NY} | {f,f} |  0.001967 |   0.000114
>  5 | {Atlanta, GA}  | {f,f} |  0.001633 |3.3e-05
>  6 | {Sacramento, CA}   | {f,f} |  0.001433 |7.8e-05
>  7 | {Miami, FL}| {f,f} |0.0014 |  6e-05
>  8 | {Dallas, TX}   | {f,f} |  0.001367 |8.8e-05
>  9 | {Chicago, IL}  | {f,f} |  0.001333 |5.1e-05
>...
> (99 rows)


It may be complex to support full multi-column statistics
use cases. How about standardizing this without
multi-columns statistics support for the first version? We
can add support for multi-column statistics later. We can
use feedback from users of the first version at that time.


Thanks,
-- 
kou

In <57595559-a561-4bd2-9efd-b67aa9a32...@python.org>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
11:40:50 +0200,
  Antoine Pitrou  wrote:

> 
> Hi Kou,
> 
> Thanks for pushing for this!
> 
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> 4. Standardize Apache Arrow schema for statistics and
>> transmit statistics via separated API call that uses the
>> C data interface
> [...]
>> I think that 4. is the best approach in these candidates.
> 
> I agree.
> 
>> If we select 4., we need to standardize Apache Arrow schema
>> for statistics. How about the following schema?
>> 
>> Metadata:
>> | Name   | Value | Comments |
>> ||---|- |
>> | ARROW::statistics::version | 1.0.0 | (1)  |
> 
> I'm not sure this is useful, but it doesn't hurt.
> 
> Nit: this should be "ARROW:statistics:version" for consistency 

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Felipe Oliveira Carvalho
> I just used quantiles as an example of a statistic that's not in the current 
> proposed spec, but that some engines would like to expose.

All statistics are optional so we can always add more to the spec.

> In other words, a plain integer makes extensibility more difficult than a 
> string.

Only the standardized metrics would be identified by an integer.
ARROW_STAT_ANY can be used + a string identifier for non-standard
metrics. Very similar to pre-defined Arrow types + Extension types
identified by string.

Since the C Data Interface is used to connect decoupled systems,
having a standard on most metrics would maximize the chances of
correct production and consumption of the statistics. The opaqueness
of the integer keys forces the reading of abi.h which contains the
explanation of the semantics of each metric.

On Sat, Jun 8, 2024 at 5:03 AM Antoine Pitrou  wrote:
>
>
>
> Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :
> > On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou  wrote:
> >>
> >>
> >> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
> >>> I've been thinking about how to encode statistics on Arrow arrays and
> >>> how to keep the set of statistics known by both producers and
> >>> consumers (i.e. standardized).
> >>>
> >>> The statistics array(s) could be a
> >>>
> >>> map<
> >>>   // the column index or null if the statistics refer to whole table 
> >>> or batch
> >>>   column: int32,
> >>>   map >>> keys...>>
> >>> >
> >>>
> >>> The keys would be defined as part of the standard:
> >>>
> >>> // Statistics values are identified by specified int32-valued keys
> >>> // so that producers and consumers can agree on physical
> >>> // encoding and semantics. Statistics can be about a column,
> >>> // a record batch, or both.
> >>> typedef ArrowStatKind int32_t;
> >>
> >> One thing that a plain integer makes more difficult is representing
> >> non-standard statistics. For example some engine might want to expose
> >> elaborate quantile-based statistics even if it not officially defined
> >> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> >> easy with some prefixing to ensure uniqueness. With a `int32` field, the
> >> spec would have to mention a mechanism to ensure global uniqueness of
> >> vendor-specific statistics.
> >
> > This encoding scheme can cover quantiles as well. Instead of parsing
> > strings or even naively matching just prefixes and breaking as
> > providers evolve (as already happens on some C Data interface
> > consumers), the consumers would expect a list of values in the enum
> > for a key called ARROW_STAT_QUANTILES.
>
> Ok, there's a misunderstanding. I did not claim that quantiles were
> difficult to represent. I just used quantiles as an example of a
> statistic that's not in the current proposed spec, but that some engines
> would like to expose. In other words, a plain integer makes
> extensibility more difficult than a string.
>
> Regards
>
> Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou




Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit :

On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou  wrote:



Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :

I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

map<
  // the column index or null if the statistics refer to whole table or 
batch
  column: int32,
  map>
>

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;


One thing that a plain integer makes more difficult is representing
non-standard statistics. For example some engine might want to expose
elaborate quantile-based statistics even if it not officially defined
here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
easy with some prefixing to ensure uniqueness. With a `int32` field, the
spec would have to mention a mechanism to ensure global uniqueness of
vendor-specific statistics.


This encoding scheme can cover quantiles as well. Instead of parsing
strings or even naively matching just prefixes and breaking as
providers evolve (as already happens on some C Data interface
consumers), the consumers would expect a list of values in the enum
for a key called ARROW_STAT_QUANTILES.


Ok, there's a misunderstanding. I did not claim that quantiles were 
difficult to represent. I just used quantiles as an example of a 
statistic that's not in the current proposed spec, but that some engines 
would like to expose. In other words, a plain integer makes 
extensibility more difficult than a string.


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Felipe Oliveira Carvalho
On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou  wrote:
>
>
> Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :
> > I've been thinking about how to encode statistics on Arrow arrays and
> > how to keep the set of statistics known by both producers and
> > consumers (i.e. standardized).
> >
> > The statistics array(s) could be a
> >
> >map<
> >  // the column index or null if the statistics refer to whole table or 
> > batch
> >  column: int32,
> >  map > keys...>>
> >>
> >
> > The keys would be defined as part of the standard:
> >
> > // Statistics values are identified by specified int32-valued keys
> > // so that producers and consumers can agree on physical
> > // encoding and semantics. Statistics can be about a column,
> > // a record batch, or both.
> > typedef ArrowStatKind int32_t;
>
> One thing that a plain integer makes more difficult is representing
> non-standard statistics. For example some engine might want to expose
> elaborate quantile-based statistics even if it not officially defined
> here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite
> easy with some prefixing to ensure uniqueness. With a `int32` field, the
> spec would have to mention a mechanism to ensure global uniqueness of
> vendor-specific statistics.

This encoding scheme can cover quantiles as well. Instead of parsing
strings or even naively matching just prefixes and breaking as
providers evolve (as already happens on some C Data interface
consumers), the consumers would expect a list of values in the enum
for a key called ARROW_STAT_QUANTILES.

/// ... Represented as a list
#define ARROW_STAT_CUMMULATIVE_QUANTILES ...
/// ...
#define ARROW_STAT_QUANTILES ...

--
Felipe

> > Version markers in two-sided protocols never work well long term:
> > see Parquet files lying about the version of the encoder so the files
> > can be read and web browsers lying on their User-Agent strings so
> > websites don't break. It's better to allow probing for individual
> > feature support (in this case, the presence of a specific stat kind in
> > the array).
>
> +1 on this.
>
> Regards
>
> Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Antoine Pitrou



Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit :

I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

   map<
 // the column index or null if the statistics refer to whole table or batch
 column: int32,
 map>
   >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;


One thing that a plain integer makes more difficult is representing 
non-standard statistics. For example some engine might want to expose 
elaborate quantile-based statistics even if it not officially defined 
here. With a `utf8` or `dictionary(int32, utf8)` field, that is quite 
easy with some prefixing to ensure uniqueness. With a `int32` field, the 
spec would have to mention a mechanism to ensure global uniqueness of 
vendor-specific statistics.



Version markers in two-sided protocols never work well long term:
see Parquet files lying about the version of the encoder so the files
can be read and web browsers lying on their User-Agent strings so
websites don't break. It's better to allow probing for individual
feature support (in this case, the presence of a specific stat kind in
the array).


+1 on this.

Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Felipe Oliveira Carvalho
I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

  map<
// the column index or null if the statistics refer to whole table or batch
column: int32,
map>
  >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

#define ARROW_STAT_ANY 0
// Exact number of nulls in a column. Value must be int32 or int64.
#define ARROW_STAT_NULL_COUNT_EXACT 1
// Approximate number of nulls in a column. Value must be float32 or float64.
#define ARROW_STAT_NULL_COUNT_APPROX 2
// The minimum and maximum values of a column.
// Value must be the same type of the column.
// Supported types are: ...
#define ARROW_STAT_MIN_APROX 2
#define ARROW_STAT_MIN_NULLS_FIRST 4
#define ARROW_STAT_MIN_NULLS_LAST 5
#define ARROW_STAT_MAX_APROX 6
#define ARROW_STAT_MAX_NULLS_FIRST 7
#define ARROW_STAT_MAX_NULLS_LAST 8
#define ARROW_STAT_CARDINALITY_APPROX 9
#define ARROW_STAT_COUNT_DISTINCT_APPROX 10

Every key is optional and consumers that don't know or don't care
about the stats can skip them while scanning statistics arrays.

Applications would have their own domain classes for storing
statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and
unpack into these arrays.

The exact types inside the dense_union would be chosen when encoding.
The decoder would handle the types expected and/or supported for each
given stat kind.

We wouldn't have to rely on versioning of the entire statistics
objects. If we want a richer way to represent a maximum, we add
another stat kind to the spec and keep producing both the old and the
new representations for the maximum while consumers migrate to the new
way. Version markers in two-sided protocols never work well long term:
see Parquet files lying about the version of the encoder so the files
can be read and web browsers lying on their User-Agent strings so
websites don't break. It's better to allow probing for individual
feature support (in this case, the presence of a specific stat kind in
the array).

Multiple calls could be done to load statistics and they could come
with more statistics each time.

--
Felipe

[1] 
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146

On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington
 wrote:
>
> Thank you for collecting all of our opinions on this! I also agree
> that (4) is the best option.
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
>
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).
>
> > If null, then the statistic applies to the entire table.
>
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
>
>
> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
> >
> >
> > Hi Kou,
> >
> > Thanks for pushing for this!
> >
> > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > > 4. Standardize Apache Arrow schema for statistics and
> > > transmit statistics via separated API call that uses the
> > > C data interface
> > [...]
> > >
> > > I think that 4. is the best approach in these candidates.
> >
> > I agree.
> >
> > > If we select 4., we need to standardize Apache Arrow schema
> > > for statistics. How about the following schema?
> > >
> > > 
> > > Metadata:
> > >
> > > | Name   | Value | Comments |
> > > ||---|- |
> > > | ARROW::statistics::version | 1.0.0 | (1)  |
> >
> > I'm not sure this is useful, but it doesn't hurt.
> >
> > Nit: this should be "ARROW:statistics:version" for consistency with
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> > > Fields:
> > >
> > > | Name   | Type  | Comments |
> > > ||---|  |
> > > | column | utf8  | (2)  |
> > > | key| utf8 not null | (3)  |
> >
> > 1. Should the key be something like `dictionary(int32, utf8)` to make
> > the representation more efficient where there are many columns?
> >
> > 2. Should the statistics perhaps be nested as a map type under each
> > column to avoid repeating `column`, or is that overkill?
> >
> > 3. Should there also be room for multi-column statistics (such as
> > cardinality of a given column pair), or is it too complex for now?
> >
> > Regards
> >
> > Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Dewey Dunnington
Thank you for collecting all of our opinions on this! I also agree
that (4) is the best option.

> Fields:
>
> | Name   | Type  | Comments |
> ||---|  |
> | column | utf8  | (2)  |

The uft8 type would presume that column names are unique (although I
like it better than referring to columns by integer position).

> If null, then the statistic applies to the entire table.

Perhaps the NULL column value could also be used for the other
statistics in addition to a row count if the array is not a struct
array?


On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou  wrote:
>
>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Alessandro Molina
I brought it up on Github, but writing here too to avoid spawning too many
threads.
https://github.com/apache/arrow/issues/38837#issuecomment-2145343755

It's not something we have to address now, but it would be great if we
could design a solution that can be extended in the future to add Par-Batch
statistics in ArrowArrayStream.

While it's true that in most cases the producer code will be applying the
filtering, in the case of C-Data we can't take that for granted. There
might be cases where the consumer has no control over the filtering that
the producer would apply and the producer might not be aware of the
filtering that the consumer might want to do.

In those cases providing the statistics per-batch would allow the consumer
to skip the batches it doesn't care about, thus giving the opportunity for
a fast path.





On Thu, Jun 6, 2024 at 11:42 AM Antoine Pitrou  wrote:

>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> > transmit statistics via separated API call that uses the
> > C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > 
> > Metadata:
> >
> > | Name   | Value | Comments |
> > ||---|- |
> > | ARROW::statistics::version | 1.0.0 | (1)  |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name   | Type  | Comments |
> > ||---|  |
> > | column | utf8  | (2)  |
> > | key| utf8 not null | (3)  |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou



Hi Kou,

Thanks for pushing for this!

Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :

4. Standardize Apache Arrow schema for statistics and
transmit statistics via separated API call that uses the
C data interface

[...]


I think that 4. is the best approach in these candidates.


I agree.


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?


Metadata:

| Name   | Value | Comments |
||---|- |
| ARROW::statistics::version | 1.0.0 | (1)  |


I'm not sure this is useful, but it doesn't hurt.

Nit: this should be "ARROW:statistics:version" for consistency with 
https://arrow.apache.org/docs/format/Columnar.html#extension-types



Fields:

| Name   | Type  | Comments |
||---|  |
| column | utf8  | (2)  |
| key| utf8 not null | (3)  |


1. Should the key be something like `dictionary(int32, utf8)` to make 
the representation more efficient where there are many columns?


2. Should the statistics perhaps be nested as a map type under each 
column to avoid repeating `column`, or is that overkill?


3. Should there also be room for multi-column statistics (such as 
cardinality of a given column pair), or is it too complex for now?


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Sutou Kouhei
Hi,

Thanks for sharing your comments. Here is a summary so far:



Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
  Examples:
  * Transmit statistics through Apache Arrow IPC file
  * Transmit statistics through Apache Arrow Flight

Candidate approaches:

1. Pass statistics (encoded as an Apache Arrow data) via
   ArrowSchema metadata
   * This embeds statistics address into metadata
   * It's for avoiding using Apache Arrow IPC format with
 the C data interface
2. Embed statistics (encoded as an Apache Arrow data) into
   ArrowSchema metadata
   * This adds statistics to metadata in Apache Arrow IPC
 format
3. Embed statistics (encoded as JSON) into ArrowArray
   metadata
4. Standardize Apache Arrow schema for statistics and
   transmit statistics via separated API call that uses the
   C data interface
5. Use ADBC



I think that 4. is the best approach in these candidates.

1. Embedding statistics address is tricky.
2. Consumers need to parse Apache Arrow IPC format data.
   (The C data interface consumers may not have the
   feature.)
3. This will work but 4. is more generic.
5. ADBC is too large to use only for statistics.

What do you think about this?


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?


Metadata:

| Name   | Value | Comments |
||---|- |
| ARROW::statistics::version | 1.0.0 | (1)  |

(1) This follows semantic versioning.

Fields:

| Name   | Type  | Comments |
||---|  |
| column | utf8  | (2)  |
| key| utf8 not null | (3)  |
| value  | VALUE_SCHEMA not null |  |
| is_approximate | bool not null | (4)  |

(2) If null, then the statistic applies to the entire table.
It's for "row_count".
(3) We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.
(4) If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Name| Type|
|-|-|
| int64   | int64   |
| uint64  | uint64  |
| float64 | float64 |
| binary  | binary  |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.



Thanks,
-- 
kou


In <20240522.113708.2023905028549001143@clear-code.com>
  "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
11:37:08 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> We're discussing how to provide statistics through the C
> data interface at:
> https://github.com/apache/arrow/issues/38837
> 
> If you're interested in this feature, could you share your
> comments?
> 
> 
> Motivation:
> 
> We can interchange Apache Arrow data by the C data interface
> in the same process. For example, we can pass Apache Arrow
> data read by Apache Arrow C++ (provider) to DuckDB
> (consumer) through the C data interface.
> 
> A provider may know Apache Arrow data statistics. For
> example, a provider can know statistics when it reads Apache
> Parquet data because Apache Parquet may provide statistics.
> 
> But a consumer can't know statistics that are known by a
> producer. Because there isn't a standard way to provide
> statistics through the C data interface. If a consumer can
> know statistics, it can process Apache Arrow data faster
> based on statistics.
> 
> 
> Proposal:
> 
> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> 
> How about providing statistics as a metadata in ArrowSchema?
> 
> We reserve "ARROW" namespace for internal Apache Arrow use:
> 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
>> The ARROW pattern is a reserved namespace for internal
>> Arrow use in the custom_metadata fields. For example,
>> ARROW:extension:name.
> 
> So we can use "ARROW:statistics" for the metadata key.
> 
> We can represent statistics as a ArrowArray like ADBC does.
> 
> Here is an example ArrowSchema that is for a record batch
> that has "int32 column1" and "string column2":
> 
> ArrowSchema {
>   .format = "+siu",
>   .metadata = {
> "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
> count */
>   },
>   .children = {
> ArrowSchema {
>   .name = "column1",
>   .format = "i",
>   .metadata = {
>  

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Sutou Kouhei
Hi,

> why can't statistics be returned as arrow arrays encoded using the C data 
> interface?

This refers the following scenario, right?

1. Producer encodes statistics as Arrow arrays
2. Producer passes the encoded statistics using the C data interface
   (The encoded statistics aren't embedded into schema, data
   or something. They are standalone Arrow arrays.)
3. Consumer decides whether consumer requests data to
   producer or not based on the received encoded statistics

If so, I think that we should support this scenario.


My JSON based proposal isn't a replacement of this scenario.

It's an additional proposal for the following use case:

> - Arrow IPC files may be mapped from shared memory


But I should have not proposed this in this thread because
this thread is for "statistics through the C data
interface". Sorry.

Let's discuss this use case in a separated thread. It may be
better that we start a new thread for it after we complete
this thread. In the separated thread, we will use a
conclusion in this thread.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Fri, 31 May 2024 
18:43:48 +0100,
  Raphael Taylor-Davies  wrote:

> I'm likely missing something here, but why can't statistics be returned as 
> arrow arrays encoded using the C data interface? My understanding of the C 
> data interface is as a specification for exchanging arrow payloads, with it 
> left to higher level protocols, such as ADBC, to assign semantic meaning to 
> said payloads. It therefore seems odd to me to be shoehorning statistics 
> information into it in this way? 
> 
> On 31 May 2024 18:05:19 BST, "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" 
>  wrote:
>>Agreed, it doesn't seem like a good idea to require users of the
>>C data interface to also depend on the IPC format. JSON sounds
>>more reasonable in that case.
>>
>>Shoumyo
>>
>>From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:  
>>dev@arrow.apache.org
>>Subject: Re: [DISCUSS] Statistics through the C data interface
>>
>>>Hi,
>>
>>>
>>
>>>>> - If you need statistics in the schema then simply encode the 1-row batch
>>
>>>>>   into an IPC buffer (using the streaming format) or maybe just an IPC
>>
>>>>>   RecordBatch message since the schema is fixed and store those bytes in 
>>>>> the
>>
>>>>>   schema
>>
>>>> 
>>
>>>> This would avoid having to define a separate "schema" for
>>
>>>> the JSON metadata
>>
>>>
>>
>>>Right. What I'm worried about with this approach is that
>>
>>>this may not match with the C data interface.
>>
>>>
>>
>>>In the C data interface, we don't use the IPC format. If we
>>
>>>want to transmit statistics with schema through the C data
>>
>>>interface, we need to mix the IPC format and the C data
>>
>>>interface. (This is why I used the address in my first
>>
>>>proposal.)
>>
>>>
>>
>>>Note that we can use separated API to transmit statistics
>>
>>>instead of embedding statistics into schema for this case.
>>
>>>
>>
>>>I thought using JSON is easier to use for both of the IPC
>>
>>>format and the C data interface. Statistics data will not be
>>
>>>large. So this will not affect performance.
>>
>>>
>>
>>>
>>
>>>> If we do go down the JSON route, how about something like
>>
>>>> this to avoid defining the keys for all possible statistics up
>>
>>>> front:
>>
>>>> 
>>
>>>>   Schema {
>>
>>>> custom_metadata: {
>>
>>>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
>>
>>>\"value_type\": \"uint64\", \"is_approximate\": false } ]"
>>
>>>> }
>>
>>>>   }
>>
>>>> 
>>
>>>> It's more verbose, but more closely mirrors the Arrow array
>>
>>>> schema defined for statistics getter APIs. This could make it
>>
>>>> easier to translate between the two.
>>
>>>
>>
>>>Thanks. I didn't think of it.
>>
>>>It makes sense.
>>
>>>
>>
>>>
>>
>>>Thanks,
>>
>>>-- 
>>
>>>kou
>>
>>>
>>
>>>In <665673b500015f5808ce0...@message.bloomberg.net>

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Raphael Taylor-Davies
I'm likely missing something here, but why can't statistics be returned as 
arrow arrays encoded using the C data interface? My understanding of the C data 
interface is as a specification for exchanging arrow payloads, with it left to 
higher level protocols, such as ADBC, to assign semantic meaning to said 
payloads. It therefore seems odd to me to be shoehorning statistics information 
into it in this way? 

On 31 May 2024 18:05:19 BST, "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" 
 wrote:
>Agreed, it doesn't seem like a good idea to require users of the
>C data interface to also depend on the IPC format. JSON sounds
>more reasonable in that case.
>
>Shoumyo
>
>From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:  
>dev@arrow.apache.org
>Subject: Re: [DISCUSS] Statistics through the C data interface
>
>>Hi,
>
>>
>
>>>> - If you need statistics in the schema then simply encode the 1-row batch
>
>>>>   into an IPC buffer (using the streaming format) or maybe just an IPC
>
>>>>   RecordBatch message since the schema is fixed and store those bytes in 
>>>> the
>
>>>>   schema
>
>>> 
>
>>> This would avoid having to define a separate "schema" for
>
>>> the JSON metadata
>
>>
>
>>Right. What I'm worried about with this approach is that
>
>>this may not match with the C data interface.
>
>>
>
>>In the C data interface, we don't use the IPC format. If we
>
>>want to transmit statistics with schema through the C data
>
>>interface, we need to mix the IPC format and the C data
>
>>interface. (This is why I used the address in my first
>
>>proposal.)
>
>>
>
>>Note that we can use separated API to transmit statistics
>
>>instead of embedding statistics into schema for this case.
>
>>
>
>>I thought using JSON is easier to use for both of the IPC
>
>>format and the C data interface. Statistics data will not be
>
>>large. So this will not affect performance.
>
>>
>
>>
>
>>> If we do go down the JSON route, how about something like
>
>>> this to avoid defining the keys for all possible statistics up
>
>>> front:
>
>>> 
>
>>>   Schema {
>
>>> custom_metadata: {
>
>>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
>
>>\"value_type\": \"uint64\", \"is_approximate\": false } ]"
>
>>> }
>
>>>   }
>
>>> 
>
>>> It's more verbose, but more closely mirrors the Arrow array
>
>>> schema defined for statistics getter APIs. This could make it
>
>>> easier to translate between the two.
>
>>
>
>>Thanks. I didn't think of it.
>
>>It makes sense.
>
>>
>
>>
>
>>Thanks,
>
>>-- 
>
>>kou
>
>>
>
>>In <665673b500015f5808ce0...@message.bloomberg.net>
>
>>  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 
>
>>00:15:49 -,
>
>>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  
>
>>wrote:
>
>>
>
>>> Thanks for addressing the feedback! I didn't know that an
>
>>> Arrow IPC `Message` (not just Schema) could also contain
>
>>> `custom_metadata` -- thanks for pointing it out.
>
>>> 
>
>>>> Based on the list, how about standardizing both of the
>
>>>> followings for statistics?
>
>>>>
>
>>>> 1. Apache Arrow schema for statistics that is used by
>
>>>>separated statistics getter API
>
>>>> 2. "ARROW:statistics" metadata format that can be used in
>
>>>>Apache Arrow schema metadata
>
>>>>
>
>>>> Users can use 1. and/or 2. based on their use cases.
>
>>> 
>
>>> This sounds good to me. Using JSON to represent the metadata
>
>>> for #2 also sounds reasonable. I think elsewhere on this
>
>>> thread, Weston mentioned that we could alternatively use
>
>>> the schema defined for #1 and directly use that to encode
>
>>> the schema metadata as an Arrow IPC RecordBatch:
>
>>> 
>
>>>> This has been something that has always been desired for the Arrow IPC
>
>>>> format too.
>
>>>>
>
>>>> My preference would be (apologies if this has been mentioned before):
>
>>>>
>
>>>> - Agree on how statisti

Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
Agreed, it doesn't seem like a good idea to require users of the
C data interface to also depend on the IPC format. JSON sounds
more reasonable in that case.

Shoumyo

From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

>Hi,

>

>>> - If you need statistics in the schema then simply encode the 1-row batch

>>>   into an IPC buffer (using the streaming format) or maybe just an IPC

>>>   RecordBatch message since the schema is fixed and store those bytes in the

>>>   schema

>> 

>> This would avoid having to define a separate "schema" for

>> the JSON metadata

>

>Right. What I'm worried about with this approach is that

>this may not match with the C data interface.

>

>In the C data interface, we don't use the IPC format. If we

>want to transmit statistics with schema through the C data

>interface, we need to mix the IPC format and the C data

>interface. (This is why I used the address in my first

>proposal.)

>

>Note that we can use separated API to transmit statistics

>instead of embedding statistics into schema for this case.

>

>I thought using JSON is easier to use for both of the IPC

>format and the C data interface. Statistics data will not be

>large. So this will not affect performance.

>

>

>> If we do go down the JSON route, how about something like

>> this to avoid defining the keys for all possible statistics up

>> front:

>> 

>>   Schema {

>> custom_metadata: {

>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 

>\"value_type\": \"uint64\", \"is_approximate\": false } ]"

>> }

>>   }

>> 

>> It's more verbose, but more closely mirrors the Arrow array

>> schema defined for statistics getter APIs. This could make it

>> easier to translate between the two.

>

>Thanks. I didn't think of it.

>It makes sense.

>

>

>Thanks,

>-- 

>kou

>

>In <665673b500015f5808ce0...@message.bloomberg.net>

>  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 

>00:15:49 -,

>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  

>wrote:

>

>> Thanks for addressing the feedback! I didn't know that an

>> Arrow IPC `Message` (not just Schema) could also contain

>> `custom_metadata` -- thanks for pointing it out.

>> 

>>> Based on the list, how about standardizing both of the

>>> followings for statistics?

>>>

>>> 1. Apache Arrow schema for statistics that is used by

>>>separated statistics getter API

>>> 2. "ARROW:statistics" metadata format that can be used in

>>>Apache Arrow schema metadata

>>>

>>> Users can use 1. and/or 2. based on their use cases.

>> 

>> This sounds good to me. Using JSON to represent the metadata

>> for #2 also sounds reasonable. I think elsewhere on this

>> thread, Weston mentioned that we could alternatively use

>> the schema defined for #1 and directly use that to encode

>> the schema metadata as an Arrow IPC RecordBatch:

>> 

>>> This has been something that has always been desired for the Arrow IPC

>>> format too.

>>>

>>> My preference would be (apologies if this has been mentioned before):

>>>

>>> - Agree on how statistics should be encoded into an array (this is not

>>>   hard, we just have to agree on the field order and the data type for

>>>   null_count)

>>> - If you need statistics in the schema then simply encode the 1-row batch

>>>   into an IPC buffer (using the streaming format) or maybe just an IPC

>>>   RecordBatch message since the schema is fixed and store those bytes in the

>>>   schema

>> 

>> This would avoid having to define a separate "schema" for

>> the JSON metadata, but might be more effort to work with in

>> certain contexts (e.g. a library that currently only needs the

>> C data interface would now also have to learn how to parse

>> Arrow IPC).

>> 

>> If we do go down the JSON route, how about something like

>> this to avoid defining the keys for all possible statistics up

>> front:

>> 

>>   Schema {

>> custom_metadata: {

>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 

>\"value_type\": \"uint64\", \"is_approximate\"

Re: [DISCUSS] Statistics through the C data interface

2024-05-29 Thread Sutou Kouhei
Hi,

>> - If you need statistics in the schema then simply encode the 1-row batch
>>   into an IPC buffer (using the streaming format) or maybe just an IPC
>>   RecordBatch message since the schema is fixed and store those bytes in the
>>   schema
> 
> This would avoid having to define a separate "schema" for
> the JSON metadata

Right. What I'm worried about with this approach is that
this may not match with the C data interface.

In the C data interface, we don't use the IPC format. If we
want to transmit statistics with schema through the C data
interface, we need to mix the IPC format and the C data
interface. (This is why I used the address in my first
proposal.)

Note that we can use separated API to transmit statistics
instead of embedding statistics into schema for this case.

I thought using JSON is easier to use for both of the IPC
format and the C data interface. Statistics data will not be
large. So this will not affect performance.


> If we do go down the JSON route, how about something like
> this to avoid defining the keys for all possible statistics up
> front:
> 
>   Schema {
> custom_metadata: {
>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
> \"value_type\": \"uint64\", \"is_approximate\": false } ]"
> }
>   }
> 
> It's more verbose, but more closely mirrors the Arrow array
> schema defined for statistics getter APIs. This could make it
> easier to translate between the two.

Thanks. I didn't think of it.
It makes sense.


Thanks,
-- 
kou

In <665673b500015f5808ce0...@message.bloomberg.net>
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 
00:15:49 -,
  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  
wrote:

> Thanks for addressing the feedback! I didn't know that an
> Arrow IPC `Message` (not just Schema) could also contain
> `custom_metadata` -- thanks for pointing it out.
> 
>> Based on the list, how about standardizing both of the
>> followings for statistics?
>>
>> 1. Apache Arrow schema for statistics that is used by
>>separated statistics getter API
>> 2. "ARROW:statistics" metadata format that can be used in
>>Apache Arrow schema metadata
>>
>> Users can use 1. and/or 2. based on their use cases.
> 
> This sounds good to me. Using JSON to represent the metadata
> for #2 also sounds reasonable. I think elsewhere on this
> thread, Weston mentioned that we could alternatively use
> the schema defined for #1 and directly use that to encode
> the schema metadata as an Arrow IPC RecordBatch:
> 
>> This has been something that has always been desired for the Arrow IPC
>> format too.
>>
>> My preference would be (apologies if this has been mentioned before):
>>
>> - Agree on how statistics should be encoded into an array (this is not
>>   hard, we just have to agree on the field order and the data type for
>>   null_count)
>> - If you need statistics in the schema then simply encode the 1-row batch
>>   into an IPC buffer (using the streaming format) or maybe just an IPC
>>   RecordBatch message since the schema is fixed and store those bytes in the
>>   schema
> 
> This would avoid having to define a separate "schema" for
> the JSON metadata, but might be more effort to work with in
> certain contexts (e.g. a library that currently only needs the
> C data interface would now also have to learn how to parse
> Arrow IPC).
> 
> If we do go down the JSON route, how about something like
> this to avoid defining the keys for all possible statistics up
> front:
> 
>   Schema {
> custom_metadata: {
>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
> \"value_type\": \"uint64\", \"is_approximate\": false } ]"
> }
>   }
> 
> It's more verbose, but more closely mirrors the Arrow array
> schema defined for statistics getter APIs. This could make it
> easier to translate between the two.
> 
> Thanks,
> Shoumyo
> 
> From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To:  
> dev@arrow.apache.org
> Subject: Re: [DISCUSS] Statistics through the C data interface
> 
>>Hi,
> 
>>
> 
>>> To start, data might be sourced in various manners:
> 
>>> 
> 
>>> - Arrow IPC files may be mapped from shared memory
> 
>>> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> 
>>> - The Arrow libraries may be used to read from file formats like Parquet or 
> 
>>CSV
> 
&

Re: [DISCUSS] Statistics through the C data interface

2024-05-28 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
Thanks for addressing the feedback! I didn't know that an
Arrow IPC `Message` (not just Schema) could also contain
`custom_metadata` -- thanks for pointing it out.

> Based on the list, how about standardizing both of the
> followings for statistics?
>
> 1. Apache Arrow schema for statistics that is used by
>separated statistics getter API
> 2. "ARROW:statistics" metadata format that can be used in
>Apache Arrow schema metadata
>
> Users can use 1. and/or 2. based on their use cases.

This sounds good to me. Using JSON to represent the metadata
for #2 also sounds reasonable. I think elsewhere on this
thread, Weston mentioned that we could alternatively use
the schema defined for #1 and directly use that to encode
the schema metadata as an Arrow IPC RecordBatch:

> This has been something that has always been desired for the Arrow IPC
> format too.
>
> My preference would be (apologies if this has been mentioned before):
>
> - Agree on how statistics should be encoded into an array (this is not
>   hard, we just have to agree on the field order and the data type for
>   null_count)
> - If you need statistics in the schema then simply encode the 1-row batch
>   into an IPC buffer (using the streaming format) or maybe just an IPC
>   RecordBatch message since the schema is fixed and store those bytes in the
>   schema

This would avoid having to define a separate "schema" for
the JSON metadata, but might be more effort to work with in
certain contexts (e.g. a library that currently only needs the
C data interface would now also have to learn how to parse
Arrow IPC).

If we do go down the JSON route, how about something like
this to avoid defining the keys for all possible statistics up
front:

  Schema {
custom_metadata: {
  "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
\"value_type\": \"uint64\", \"is_approximate\": false } ]"
}
  }

It's more verbose, but more closely mirrors the Arrow array
schema defined for statistics getter APIs. This could make it
easier to translate between the two.

Thanks,
Shoumyo

From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

>Hi,

>

>> To start, data might be sourced in various manners:

>> 

>> - Arrow IPC files may be mapped from shared memory

>> - Arrow IPC streams may be received via some RPC framework (à la Flight)

>> - The Arrow libraries may be used to read from file formats like Parquet or 

>CSV

>> - ADBC drivers may be used to read from databases

>

>Thanks for listing it.

>

>Regarding to the first case:

>

>Using schema metadata may be a reasonable approach because

>the Arrow data will be on the page cache. There is no

>significant read cost. We don't need to read statistics

>before the Arrow data is ready.

>

>But if the Arrow data will not be produced based on

>statistics of the Arrow data, separated statistics get API

>may be better.

>

>Regarding to the second case:

>

>Schema metadata is an approach for it but we can choose

>other approaches for this case. For example, Flight has

>FlightData::app_metadata[1] and Arrow IPC message has

>custom_metadata[2] as Dewey mentioned.

>

>[1] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Flight.proto#L512-L515

>[2] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Message.fbs#L154

>

>Regarding to the third case:

>

>Reader objects will provide statistics. For example,

>parquet::ColumnChunkMetaData::statistics()

>(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics

>())

>will provide statistics.

>

>Regarding to the forth case:

>

>We can use ADBC API.

>

>

>Based on the list, how about standardizing both of the

>followings for statistics?

>

>1. Apache Arrow schema for statistics that is used by

>   separated statistics getter API

>2. "ARROW:statistics" metadata format that can be used in

>   Apache Arrow schema metadata

>

>Users can use 1. and/or 2. based on their use cases.

>

>Regarding to 2.: How about the following?

>

>This uses Field::custom_metadata[3] and

>Schema::custom_metadata[4].

>

>[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529

>[4] 

>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo

>rmat/Schema.fbs#L563-L564

>

>"ARROW:statistics" in Field::custom_metadata represents

>column-level statistics. It uses JSON like we did for

>"

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).

Good point. I think that a process that unifies schemas
should remove (or merge if possible) statistics metadata. If
we standardize statistics, the process can do it. For
example, the process can always remove "ARROW:statistics"
metadata when we use "ARROW:statistics" for statistics.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
15:14:49 -0300,
  Dewey Dunnington  wrote:

> Thanks Shoumyo for bringing this up!
> 
> Using a schema to transmit statistica/data dependent values is also
> something we do in GeoParquet (whose schema also finds its way into
> pyarrow and the C data interface when reading). It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).
> 
> I imagine statistics would be opt-in (i.e., a consumer would have to
> explicitly request them), in which case that consumer could possibly
> be required to remove them. With the custom format string that was
> proposed I think this is unlikely to happen; however, that a consumer
> might want to know statistics over IPC too is an excellent point.
> 
>> Unless there are other ways of producing stream-level application metadata 
>> outside of the schema/field metadata
> 
> Technically there is message-level metadata in the IPC flatbuffers,
> although I don't believe it is accessible from most IPC readers. That
> mechanism isn't available from an ArrowArrayStream and so it might not
> help with the specific case at hand.
> 
>> nowhere is it mentioned that metadata must be used to determine schema 
>> equivalence
> 
> I am only familiar with a few implementations, but at least Arrow C++
> and nanoarrow have options to ignore metadata and/or nullability
> and/or possibly field names (e.g., for a list type) depending on what
> type of type/schema equivalence is required.
> 
>> use cases where you want to know the schema *before* the data is produced.
> 
> I may be understanding it incorrectly, but I think it's generally
> possible to emit a schema with metadata before emitting record
> batches. I suppose you would have already started downloading the
> stream, though.
> 
>> I think what we are slowly converging on is the need for a spec to
>> describe the encoding of Arrow array statistics as Arrow arrays.
> 
> +1 (this will be helpful however we decide to transmit statistics)
> 
> On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>>
>>
>> Hi Shoumyo,
>>
>> The problem with communicating data statistics through schema metadata
>> is that it's not compatible with use cases where you want to know the
>> schema *before* the data is produced.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Thu, 23 May 2024 14:28:43 -
>> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>>  wrote:
>> > This is a really exciting development, thank you for putting together this 
>> > proposal!
>> >
>> > It looks like this thread and the linked GitHub issue has lots of input 
>> > from folks who work with Arrow at a low level and have better familiarity 
>> > with the Arrow specifications than I do, so I'll refrain from commenting 
>> > on the technicalities of the proposal. I would, however, like to share my 
>> > perspective as an application developer that heavily uses Arrow at higher 
>> > levels for composing data systems.
>> >
>> > My main concern with the direction of this proposal is that it seems too 
>> > narrowly focused on what the integration with DuckDB will look like (how 
>> > the statistics can be fed into DuckDB). In many applications, executing 
>> > the query is often the "last mile", and it's important to consider where 
>> > the statistics will actually come from. To start, data might be sourced in 
>> > various manners:
>> >
>> > - Arrow IPC files may be mapped from shared memory
>> > - Arrow IPC streams may be received via some RPC framework (à la Flight)
>> > - The Arrow libraries may be used to read from file formats like Parquet 
>> > or 

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases

Thanks for listing it.

Regarding to the first case:

Using schema metadata may be a reasonable approach because
the Arrow data will be on the page cache. There is no
significant read cost. We don't need to read statistics
before the Arrow data is ready.

But if the Arrow data will not be produced based on
statistics of the Arrow data, separated statistics get API
may be better.

Regarding to the second case:

Schema metadata is an approach for it but we can choose
other approaches for this case. For example, Flight has
FlightData::app_metadata[1] and Arrow IPC message has
custom_metadata[2] as Dewey mentioned.

[1] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Flight.proto#L512-L515
[2] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Message.fbs#L154

Regarding to the third case:

Reader objects will provide statistics. For example,
parquet::ColumnChunkMetaData::statistics()
(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics())
will provide statistics.

Regarding to the forth case:

We can use ADBC API.


Based on the list, how about standardizing both of the
followings for statistics?

1. Apache Arrow schema for statistics that is used by
   separated statistics getter API
2. "ARROW:statistics" metadata format that can be used in
   Apache Arrow schema metadata

Users can use 1. and/or 2. based on their use cases.

Regarding to 2.: How about the following?

This uses Field::custom_metadata[3] and
Schema::custom_metadata[4].

[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529
[4] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Schema.fbs#L563-L564

"ARROW:statistics" in Field::custom_metadata represents
column-level statistics. It uses JSON like we did for
"ARROW:extension:metadata"[5]. Here is an example:

  Field {
custom_metadata: {
  "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}"
}
  }

(JSON may not be able to represent complex information but
is it needed for statistics?)

"ARROW:statistics" in Schema::custom_metadata represents
table-level statistics. It uses JSON like we did for
"ARROW:extension:metadata"[5]. Here is an example:

  Schema {
custom_metadata: {
  "ARROW:statistics" => "{\"row_count\": 29}"
}
  }

TODO: Define the JSON content details. For example, we need
to define keys such as "distinct_count" and "row_count".


[5] 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types



Thanks,
-- 
kou

In <664f529b0002a8710c430...@message.bloomberg.net>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
14:28:43 -,
  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  
wrote:

> This is a really exciting development, thank you for putting together this 
> proposal!
> 
> It looks like this thread and the linked GitHub issue has lots of input from 
> folks who work with Arrow at a low level and have better familiarity with the 
> Arrow specifications than I do, so I'll refrain from commenting on the 
> technicalities of the proposal. I would, however, like to share my 
> perspective as an application developer that heavily uses Arrow at higher 
> levels for composing data systems.
> 
> My main concern with the direction of this proposal is that it seems too 
> narrowly focused on what the integration with DuckDB will look like (how the 
> statistics can be fed into DuckDB). In many applications, executing the query 
> is often the "last mile", and it's important to consider where the statistics 
> will actually come from. To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases
> 
> Note that in at least the first two cases, the system _executing the query_ 
> will not be able to provide statistics simply because it is not actually the 
> data producer. As an example, if Process A writes an Arrow IPC file to shared 
> memory, and Process B wants to run a query on it -- how is Process B supposed 
> to get the statistics for query planning? There are a few approaches that I 

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.

Does this mean that we will need C Data Interface +
statistics + XXX + ... for query planning and so on?

Or does this mean that ADBC like statistics schema will not
be able to cover use cases such as query planning?

If it means the former, can we provide extra mechanism at
that time?

If it means the latter, how about adding version to
statistics schema? For example, we can add
'"ARROW:statistics:version" => "1.0.0"' metadata to
statistics schema. We can define statistics schema 2.0.0
when we find a use case that isn't covered by statistics
schema 1.0.0. It doesn't break existing codes because
we can use both of statistics schema 1.0.0 and 2.0.0 at the
same time.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
11:09:07 -0300,
  Felipe Oliveira Carvalho  wrote:

> I want to +1 on what Dewey is saying here and some comments.
> 
> Sutou Kouhei wrote:
>> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
>> statistics related APIs but it has more other APIs.
> 
> It's impossible to keep the responsibility of communication protocols
> cleanly separated, but IMO, we should strive to keep the C Data
> Interface more of a Transport Protocol than an Application Protocol.
> 
> Statistics are application dependent and can complicate the
> implementation of importers/exporters which would hinder the adoption
> of the C Data Interface. Statistics also bring in security concerns
> that are application-specific. e.g. can an algorithm trust min/max
> stats and risk producing incorrect results if the statistics are
> incorrect? A question that can't really be answered at the C Data
> Interface level.
> 
> The need for more sophisticated statistics only grows with time, so
> there is no such thing as a "simple statistics schema".
> 
> Protocols that produce/consume statistics might want to use the C Data
> Interface as a primitive for passing Arrow arrays of statistics.
> 
> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.
> 
> --
> Felipe
> 
> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
>  wrote:
>>
>> Thank you for the background! I understand that these statistics are
>> important for query planning; however, I am not sure that I follow why
>> we are constrained to the ArrowSchema to represent them. The examples
>> given seem to going through Python...would it be easier to request
>> statistics at a higher level of abstraction? There would already need
>> to be a separate mechanism to request an ArrowArrayStream with
>> statistics (unless the PyCapsule `requested_schema` argument would
>> suffice).
>>
>> > ADBC may be a bit larger to use only for transmitting
>> > statistics. ADBC has statistics related APIs but it has more
>> > other APIs.
>>
>> Some examples of producers given in the linked threads (Delta Lake,
>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> can implement an ADBC driver without defining all the methods (where
>> the producer could call AdbcConnectionGetStatistics(), although
>> AdbcStatementGetStatistics() might be more relevant here and doesn't
>> exist). One example listed (using an Arrow Table as a source) seems a
>> bit light to wrap in an ADBC driver; however, it would not take much
>> code to do so and the overhead of getting the reader via ADBC it is
>> something like 100 microseconds (tested via the ADBC R package's
>> "monkey driver" which wraps an existing stream as a statement). In any
>> case, the bulk of the code is building the statistics array.
>>
>> > How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>>
>> Whatever format for statistics is decided on, I imagine it should be
>> exactly the same as the ADBC standard? (Perhaps pushing changes
>> upstream if needed?).
>>
>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>> >
>> > Hi,
>> >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice
>> >
>> > It seems that we should use the approach because all
>> > feedback said so. How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>> >
>> > | Fi

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Weston Pace
> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

This has been something that has always been desired for the Arrow IPC
format too.

My preference would be (apologies if this has been mentioned before):

- Agree on how statistics should be encoded into an array (this is not
hard, we just have to agree on the field order and the data type for
null_count)
- If you need statistics in the schema then simply encode the 1-row batch
into an IPC buffer (using the streaming format) or maybe just an IPC
RecordBatch message since the schema is fixed and store those bytes in the
schema



On Fri, May 24, 2024 at 1:20 AM Sutou Kouhei  wrote:

> Hi,
>
> Could you explain more about your idea? Does it propose that
> we add more callbacks to ArrowArrayStream such as
> ArrowArrayStream::get_statistics()? Or Does it propose that
> we define one more Arrow C XXX interface that wraps
> ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
>
> ArrowDeviceArray:
> https://arrow.apache.org/docs/format/CDeviceDataInterface.html
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May
> 2024 06:55:40 -0700,
>   Curt Hagenlocher  wrote:
>
> >>  would it be easier to request statistics at a higher level of
> > abstraction?
> >
> > What if there were a "single table provider" level of abstraction between
> > ADBC and ArrowArrayStream as a C API; something that can report
> statistics
> > and apply simple predicates?
> >
> > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
> >  wrote:
> >
> >> Thank you for the background! I understand that these statistics are
> >> important for query planning; however, I am not sure that I follow why
> >> we are constrained to the ArrowSchema to represent them. The examples
> >> given seem to going through Python...would it be easier to request
> >> statistics at a higher level of abstraction? There would already need
> >> to be a separate mechanism to request an ArrowArrayStream with
> >> statistics (unless the PyCapsule `requested_schema` argument would
> >> suffice).
> >>
> >> > ADBC may be a bit larger to use only for transmitting
> >> > statistics. ADBC has statistics related APIs but it has more
> >> > other APIs.
> >>
> >> Some examples of producers given in the linked threads (Delta Lake,
> >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> >> can implement an ADBC driver without defining all the methods (where
> >> the producer could call AdbcConnectionGetStatistics(), although
> >> AdbcStatementGetStatistics() might be more relevant here and doesn't
> >> exist). One example listed (using an Arrow Table as a source) seems a
> >> bit light to wrap in an ADBC driver; however, it would not take much
> >> code to do so and the overhead of getting the reader via ADBC it is
> >> something like 100 microseconds (tested via the ADBC R package's
> >> "monkey driver" which wraps an existing stream as a statement). In any
> >> case, the bulk of the code is building the statistics array.
> >>
> >> > How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >>
> >> Whatever format for statistics is decided on, I imagine it should be
> >> exactly the same as the ADBC standard? (Perhaps pushing changes
> >> upstream if needed?).
> >>
> >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > > Why not simply pass the statistics ArrowArray separately in your
> >> > > producer API of choice
> >> >
> >> > It seems that we should use the approach because all
> >> > feedback said so. How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >> >
> >> > | Field Name   | Field Type| Comments |
> >> > |--|---|  |
> >> > | column_name  | utf8  | (1)  |
> >> > | statistic_key| utf8 not null | (2)  |
> >> > | statistic_value  | VALUE_SCHEMA not null |  |
> >> > | statistic_is_approximate | bool not null | (3)  |
> >> >
> >> > 1. If null, then the statistic applies to the entire table.
> >> >   

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
Hi,

Could you explain more about your idea? Does it propose that
we add more callbacks to ArrowArrayStream such as
ArrowArrayStream::get_statistics()? Or Does it propose that
we define one more Arrow C XXX interface that wraps
ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?

ArrowDeviceArray:
https://arrow.apache.org/docs/format/CDeviceDataInterface.html


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
06:55:40 -0700,
  Curt Hagenlocher  wrote:

>>  would it be easier to request statistics at a higher level of
> abstraction?
> 
> What if there were a "single table provider" level of abstraction between
> ADBC and ArrowArrayStream as a C API; something that can report statistics
> and apply simple predicates?
> 
> On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
>  wrote:
> 
>> Thank you for the background! I understand that these statistics are
>> important for query planning; however, I am not sure that I follow why
>> we are constrained to the ArrowSchema to represent them. The examples
>> given seem to going through Python...would it be easier to request
>> statistics at a higher level of abstraction? There would already need
>> to be a separate mechanism to request an ArrowArrayStream with
>> statistics (unless the PyCapsule `requested_schema` argument would
>> suffice).
>>
>> > ADBC may be a bit larger to use only for transmitting
>> > statistics. ADBC has statistics related APIs but it has more
>> > other APIs.
>>
>> Some examples of producers given in the linked threads (Delta Lake,
>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> can implement an ADBC driver without defining all the methods (where
>> the producer could call AdbcConnectionGetStatistics(), although
>> AdbcStatementGetStatistics() might be more relevant here and doesn't
>> exist). One example listed (using an Arrow Table as a source) seems a
>> bit light to wrap in an ADBC driver; however, it would not take much
>> code to do so and the overhead of getting the reader via ADBC it is
>> something like 100 microseconds (tested via the ADBC R package's
>> "monkey driver" which wraps an existing stream as a statement). In any
>> case, the bulk of the code is building the statistics array.
>>
>> > How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>>
>> Whatever format for statistics is decided on, I imagine it should be
>> exactly the same as the ADBC standard? (Perhaps pushing changes
>> upstream if needed?).
>>
>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>> >
>> > Hi,
>> >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice
>> >
>> > It seems that we should use the approach because all
>> > feedback said so. How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>> >
>> > | Field Name   | Field Type| Comments |
>> > |--|---|  |
>> > | column_name  | utf8  | (1)  |
>> > | statistic_key| utf8 not null | (2)  |
>> > | statistic_value  | VALUE_SCHEMA not null |  |
>> > | statistic_is_approximate | bool not null | (3)  |
>> >
>> > 1. If null, then the statistic applies to the entire table.
>> >It's for "row_count".
>> > 2. We'll provide pre-defined keys such as "max", "min",
>> >"byte_width" and "distinct_count" but users can also use
>> >application specific keys.
>> > 3. If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense union with members:
>> >
>> > | Field Name | Field Type |
>> > |||
>> > | int64  | int64  |
>> > | uint64 | uint64 |
>> > | float64| float64|
>> > | binary | binary |
>> >
>> > If a column is an int32 column, it uses int64 for
>> > "max"/"min". We don't provide all types here. Users should
>> > use a compatible type (int64 for a int32 column) instead.
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> > In 
>> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
>> 2024 17

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
Hi,

>I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them.

Ah, sorry. Using ArrowSchema isn't required. It's just one
idea. We can choose another approach like we just define a
schema for statistics ArrowArray as I proposed.

> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).

I think that we can use simpler one for this. For example,
ADBC uses dictionary encoding like approach for statistics
key. It requires additional ID and name mapping for
application-specific statistics key. We can use just name
for it.
See also the related discussion on the issue:
https://github.com/apache/arrow/issues/38837#issuecomment-2108895904


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
09:57:05 -0300,
  Dewey Dunnington  wrote:

> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
> 
>> ADBC may be a bit larger to use only for transmitting
>> statistics. ADBC has statistics related APIs but it has more
>> other APIs.
> 
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
> 
>> How about the following schema for the
>> statistics ArrowArray? It's based on ADBC.
> 
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
> 
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>>
>> Hi,
>>
>> > Why not simply pass the statistics ArrowArray separately in your
>> > producer API of choice
>>
>> It seems that we should use the approach because all
>> feedback said so. How about the following schema for the
>> statistics ArrowArray? It's based on ADBC.
>>
>> | Field Name   | Field Type| Comments |
>> |--|---|  |
>> | column_name  | utf8  | (1)  |
>> | statistic_key| utf8 not null | (2)  |
>> | statistic_value  | VALUE_SCHEMA not null |  |
>> | statistic_is_approximate | bool not null | (3)  |
>>
>> 1. If null, then the statistic applies to the entire table.
>>It's for "row_count".
>> 2. We'll provide pre-defined keys such as "max", "min",
>>"byte_width" and "distinct_count" but users can also use
>>application specific keys.
>> 3. If true, then the value is approximate or best-effort.
>>
>> VALUE_SCHEMA is a dense union with members:
>>
>> | Field Name | Field Type |
>> ||----|
>> | int64      | int64  |
>> | uint64 | uint64 |
>> | float64| float64|
>> | binary | binary |
>>
>> If a column is an int32 column, it uses int64 for
>> "max"/"min". We don't provide all types here. Users should
>> use a compatible type (int64 for a int32 column) instead.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 
>> 2024 17:04:57 +0200,
>>   Antoine Pitrou  wrote:
>>
>> >
>> > Hi Kou,
>> >
>> > I agree that Dewey that this is overstretching the capabilities of the
>> > C Data Interface. In particular, stuffing a pointer as m

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Aldrin
For what it's worth, duckdb accesses arrow data via IPC in an extension then 
exports to C data interface to call into code in its core.
Also, assumptions about when query optimization occurs relative to data access 
potentially breaks down in scenarios involving: views, distributed tables, 
substrait and decomposed query engines, and a few others.
 Sent from Proton Mail for iOS 
On Thu, May 23, 2024 at 13:28, Shoumyo Chakravorti (BLOOMBERG/ 120 PARK) 
schakravo...@bloomberg.net wrote:  Appreciate the additional context!

 use cases where you want to know the schema *before*
 the data is produced

I think my understanding aligns with Dewey's on this point.
I guess I'm struggling to imagine a scenario where a query
planner would want the schema but not the statistics. Because
by the time the query engine starts consuming data, the plan
should've already been optimized, which implies that the
statistics should've come earlier at some point (so having
it in the schema wouldn't hurt per se). But please correct
me if I misunderstood.

 It is usually fine but occasionally ends up with schema
 metadata that is lying

This is a totally valid point and I'm definitely aware
of it - there would be an onus on developers to make sure
that they're not plumbing around nonsensical metadata. And
to your point, making the production of statistics opt-in
would make this decision explicit.

I guess the other saving grace is that query optimization
should never affect the *correctness* of a query, only its
performance. However, I can appreciate that it would be
difficult to diagnose a query being slow just because of bad
metadata.

 Technically there is message-level metadata in the IPC
 flatbuffers... That mechanism isn't available from an
 ArrowArrayStream and so it might not help with the specific
 case at hand.

Gotcha. So it sounds like the schema and field metadata are
the only ones available at the "top" level in Arrow IPC
streams or files; glad to know we didn't miss something :)

As mentioned earlier, my understanding is that query
optimization happens in its entirety before the query engine
consumes any actual data. So I believe the schema- and
field-level metadata are the only ones relevant for the
use-case being considered anyway.

Taking a step back -- my thought process was that if there
is a case for transmitting statistics over Arrow IPC, then
it would be nice to have a consistent solution in the C data
interface as well. Using schema metadata just seemed like
one approach that would achieve this goal.

Best,
Shoumyo

From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

Thanks Shoumyo for bringing this up!

Using a schema to transmit statistica/data dependent values is also
something we do in GeoParquet (whose schema also finds its way into
pyarrow and the C data interface when reading). It is usually fine but
occasionally ends up with schema metadata that is lying (e.g., when
unifying schemas from multiple files in a dataset, I believe pyarrow
will sometimes assign metadata from one file to the entire dataset
and/or propagate it through projections/filters).

I imagine statistics would be opt-in (i.e., a consumer would have to
explicitly request them), in which case that consumer could possibly
be required to remove them. With the custom format string that was
proposed I think this is unlikely to happen; however, that a consumer
might want to know statistics over IPC too is an excellent point.

 Unless there are other ways of producing stream-level application metadata
outside of the schema/field metadata

Technically there is message-level metadata in the IPC flatbuffers,
although I don't believe it is accessible from most IPC readers. That
mechanism isn't available from an ArrowArrayStream and so it might not
help with the specific case at hand.

 nowhere is it mentioned that metadata must be used to determine schema
equivalence

I am only familiar with a few implementations, but at least Arrow C++
and nanoarrow have options to ignore metadata and/or nullability
and/or possibly field names (e.g., for a list type) depending on what
type of type/schema equivalence is required.

 use cases where you want to know the schema *before* the data is produced.

I may be understanding it incorrectly, but I think it's generally
possible to emit a schema with metadata before emitting record
batches. I suppose you would have already started downloading the
stream, though.

 I think what we are slowly converging on is the need for a spec to
 describe the encoding of Arrow array statistics as Arrow arrays.

+1 (this will be helpful however we decide to transmit statistics)

On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou anto...@python.org wrote:


 Hi Shoumyo,

 The problem with communicating data statistics through schema metadata
 is that it's not compatible with use cases where you want to know the
 schema *before

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
Appreciate the additional context!

> use cases where you want to know the schema *before*
> the data is produced

I think my understanding aligns with Dewey's on this point.
I guess I'm struggling to imagine a scenario where a query
planner would want the schema but not the statistics. Because
by the time the query engine starts consuming data, the plan
should've already been optimized, which implies that the
statistics should've come earlier at some point (so having
it in the schema wouldn't hurt per se). But please correct
me if I misunderstood.

> It is usually fine but occasionally ends up with schema
> metadata that is lying

This is a totally valid point and I'm definitely aware
of it - there would be an onus on developers to make sure
that they're not plumbing around nonsensical metadata. And
to your point, making the production of statistics opt-in
would make this decision explicit.

I guess the other saving grace is that query optimization
should never affect the *correctness* of a query, only its
performance. However, I can appreciate that it would be
difficult to diagnose a query being slow just because of bad
metadata.

> Technically there is message-level metadata in the IPC
> flatbuffers... That mechanism isn't available from an
> ArrowArrayStream and so it might not help with the specific
> case at hand.

Gotcha. So it sounds like the schema and field metadata are
the only ones available at the "top" level in Arrow IPC
streams or files; glad to know we didn't miss something :)

As mentioned earlier, my understanding is that query
optimization happens in its entirety before the query engine
consumes any actual data. So I believe the schema- and
field-level metadata are the only ones relevant for the
use-case being considered anyway.

Taking a step back -- my thought process was that if there
is a case for transmitting statistics over Arrow IPC, then
it would be nice to have a consistent solution in the C data
interface as well. Using schema metadata just seemed like
one approach that would achieve this goal.

Best,
Shoumyo

From: dev@arrow.apache.org At: 05/23/24 14:16:32 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

Thanks Shoumyo for bringing this up!

Using a schema to transmit statistica/data dependent values is also
something we do in GeoParquet (whose schema also finds its way into
pyarrow and the C data interface when reading). It is usually fine but
occasionally ends up with schema metadata that is lying (e.g., when
unifying schemas from multiple files in a dataset, I believe pyarrow
will sometimes assign metadata from one file to the entire dataset
and/or propagate it through projections/filters).

I imagine statistics would be opt-in (i.e., a consumer would have to
explicitly request them), in which case that consumer could possibly
be required to remove them. With the custom format string that was
proposed I think this is unlikely to happen; however, that a consumer
might want to know statistics over IPC too is an excellent point.

> Unless there are other ways of producing stream-level application metadata 
outside of the schema/field metadata

Technically there is message-level metadata in the IPC flatbuffers,
although I don't believe it is accessible from most IPC readers. That
mechanism isn't available from an ArrowArrayStream and so it might not
help with the specific case at hand.

> nowhere is it mentioned that metadata must be used to determine schema 
equivalence

I am only familiar with a few implementations, but at least Arrow C++
and nanoarrow have options to ignore metadata and/or nullability
and/or possibly field names (e.g., for a list type) depending on what
type of type/schema equivalence is required.

> use cases where you want to know the schema *before* the data is produced.

I may be understanding it incorrectly, but I think it's generally
possible to emit a schema with metadata before emitting record
batches. I suppose you would have already started downloading the
stream, though.

> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

+1 (this will be helpful however we decide to transmit statistics)

On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>
>
> Hi Shoumyo,
>
> The problem with communicating data statistics through schema metadata
> is that it's not compatible with use cases where you want to know the
> schema *before* the data is produced.
>
> Regards
>
> Antoine.
>
>
> On Thu, 23 May 2024 14:28:43 -
> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>  wrote:
> > This is a really exciting development, thank you for putting together this 
proposal!
> >
> > It looks like this thread and the linked GitHub issue has lots of input 
from folks who work with Arrow at a low level

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
> > files). In some sense, the statistics actually *are* a property of the 
> > stream.
> >
> > In systems that I work on, we already use schema metadata to communicate 
> > information that is unrelated to the structure of the data. From my reading 
> > of the documentation [1], this sounds like a reasonable (and perhaps 
> > intended?) use of metadata, and nowhere is it mentioned that metadata must 
> > be used to determine schema equivalence. Unless there are other ways of 
> > producing stream-level application metadata outside of the schema/field 
> > metadata, the lack of purity was not a concern for me to begin with.
> >
> > I would appreciate an approach that communicates statistics via schema 
> > metadata, or at least in some in-band fashion that is consistent across the 
> > IPC and C data specifications. This would make it much easier to uniformly 
> > and transparently plumb statistics through applications, regardless of 
> > where they source Arrow data from. As developers are likely to create 
> > bespoke conventions for this anyways, it seems reasonable to standardize it 
> > as canonical metadata.
> >
> > I say this all as a happy user of DuckDB's Arrow scan functionality that is 
> > excited to see better query optimization capabilities. It's just that, in 
> > its current form, the changes in this proposal are not something I could 
> > foreseeably integrate with.
> >
> > Best,
> > Shoumyo
> >
> > [1]: 
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >
> > From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
> > dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Statistics through the C data interface
> >
> > I want to +1 on what Dewey is saying here and some comments.
> >
> > Sutou Kouhei wrote:
> > > ADBC may be a bit larger to use only for transmitting statistics. ADBC has
> > statistics related APIs but it has more other APIs.
> >
> > It's impossible to keep the responsibility of communication protocols
> > cleanly separated, but IMO, we should strive to keep the C Data
> > Interface more of a Transport Protocol than an Application Protocol.
> >
> > Statistics are application dependent and can complicate the
> > implementation of importers/exporters which would hinder the adoption
> > of the C Data Interface. Statistics also bring in security concerns
> > that are application-specific. e.g. can an algorithm trust min/max
> > stats and risk producing incorrect results if the statistics are
> > incorrect? A question that can't really be answered at the C Data
> > Interface level.
> >
> > The need for more sophisticated statistics only grows with time, so
> > there is no such thing as a "simple statistics schema".
> >
> > Protocols that produce/consume statistics might want to use the C Data
> > Interface as a primitive for passing Arrow arrays of statistics.
> >
> > ADBC might be too big of a leap in complexity now, but "we just need C
> > Data Interface + statistics" is unlikely to remain true for very long
> > as projects grow in complexity.
> >
> > --
> > Felipe
> >
> > On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
> >  wrote:
> > >
> > > Thank you for the background! I understand that these statistics are
> > > important for query planning; however, I am not sure that I follow why
> > > we are constrained to the ArrowSchema to represent them. The examples
> > > given seem to going through Python...would it be easier to request
> > > statistics at a higher level of abstraction? There would already need
> > > to be a separate mechanism to request an ArrowArrayStream with
> > > statistics (unless the PyCapsule `requested_schema` argument would
> > > suffice).
> > >
> > > > ADBC may be a bit larger to use only for transmitting
> > > > statistics. ADBC has statistics related APIs but it has more
> > > > other APIs.
> > >
> > > Some examples of producers given in the linked threads (Delta Lake,
> > > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> > > can implement an ADBC driver without defining all the methods (where
> > > the producer could call AdbcConnectionGetStatistics(), although
> > > AdbcStatementGetStatistics() might be more relevant here and doesn't
> > > exist). One example listed (using an Arrow Table as a source) seems a
> > > bit light to wrap in an ADBC driver; however, it would not take much
&g

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou


Hi Shoumyo,

The problem with communicating data statistics through schema metadata
is that it's not compatible with use cases where you want to know the
schema *before* the data is produced.

Regards

Antoine.


On Thu, 23 May 2024 14:28:43 -
"Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
 wrote:
> This is a really exciting development, thank you for putting together this 
> proposal!
> 
> It looks like this thread and the linked GitHub issue has lots of input from 
> folks who work with Arrow at a low level and have better familiarity with the 
> Arrow specifications than I do, so I'll refrain from commenting on the 
> technicalities of the proposal. I would, however, like to share my 
> perspective as an application developer that heavily uses Arrow at higher 
> levels for composing data systems.
> 
> My main concern with the direction of this proposal is that it seems too 
> narrowly focused on what the integration with DuckDB will look like (how the 
> statistics can be fed into DuckDB). In many applications, executing the query 
> is often the "last mile", and it's important to consider where the statistics 
> will actually come from. To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases
> 
> Note that in at least the first two cases, the system _executing the query_ 
> will not be able to provide statistics simply because it is not actually the 
> data producer. As an example, if Process A writes an Arrow IPC file to shared 
> memory, and Process B wants to run a query on it -- how is Process B supposed 
> to get the statistics for query planning? There are a few approaches that I 
> anticipate application developers might consider:
> 
> 1. Design an out-of-band mechanism for Process B to fetch statistics from 
> Process A.
> 2. Design an encoding that is a superset of Arrow IPC and includes statistics 
> information, allowing statistics to be communicated in-band.
> 3. Use custom schema metadata to communicate statistics in-band.
> 
> Options 1 and 2 require considerably more effort than Option 3. Also, Option 
> 3 feels somewhat natural because it makes sense for the statistics to come 
> with the data (similar to how statistics are embedded in Parquet files). In 
> some sense, the statistics actually *are* a property of the stream.
> 
> In systems that I work on, we already use schema metadata to communicate 
> information that is unrelated to the structure of the data. From my reading 
> of the documentation [1], this sounds like a reasonable (and perhaps 
> intended?) use of metadata, and nowhere is it mentioned that metadata must be 
> used to determine schema equivalence. Unless there are other ways of 
> producing stream-level application metadata outside of the schema/field 
> metadata, the lack of purity was not a concern for me to begin with.
> 
> I would appreciate an approach that communicates statistics via schema 
> metadata, or at least in some in-band fashion that is consistent across the 
> IPC and C data specifications. This would make it much easier to uniformly 
> and transparently plumb statistics through applications, regardless of where 
> they source Arrow data from. As developers are likely to create bespoke 
> conventions for this anyways, it seems reasonable to standardize it as 
> canonical metadata.
> 
> I say this all as a happy user of DuckDB's Arrow scan functionality that is 
> excited to see better query optimization capabilities. It's just that, in its 
> current form, the changes in this proposal are not something I could 
> foreseeably integrate with.
> 
> Best,
> Shoumyo
> 
> [1]: 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
> From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
> dev@arrow.apache.org
> Subject: Re: [DISCUSS] Statistics through the C data interface
> 
> I want to +1 on what Dewey is saying here and some comments.
> 
> Sutou Kouhei wrote:
> > ADBC may be a bit larger to use only for transmitting statistics. ADBC has  
> >  
> statistics related APIs but it has more other APIs.
> 
> It's impossible to keep the responsibility of communication protocols
> cleanly separated, but IMO, we should strive to keep the C Data
> Interface more of a Transport Protocol than an Application Protocol.
> 
> Statistics are application dependent and can complicate the
> implementation of importers/exporters which would hinder the adoption
> of the C Data Interface. Statist

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)
This is a really exciting development, thank you for putting together this 
proposal!

It looks like this thread and the linked GitHub issue has lots of input from 
folks who work with Arrow at a low level and have better familiarity with the 
Arrow specifications than I do, so I'll refrain from commenting on the 
technicalities of the proposal. I would, however, like to share my perspective 
as an application developer that heavily uses Arrow at higher levels for 
composing data systems.

My main concern with the direction of this proposal is that it seems too 
narrowly focused on what the integration with DuckDB will look like (how the 
statistics can be fed into DuckDB). In many applications, executing the query 
is often the "last mile", and it's important to consider where the statistics 
will actually come from. To start, data might be sourced in various manners:

- Arrow IPC files may be mapped from shared memory
- Arrow IPC streams may be received via some RPC framework (à la Flight)
- The Arrow libraries may be used to read from file formats like Parquet or CSV
- ADBC drivers may be used to read from databases

Note that in at least the first two cases, the system _executing the query_ 
will not be able to provide statistics simply because it is not actually the 
data producer. As an example, if Process A writes an Arrow IPC file to shared 
memory, and Process B wants to run a query on it -- how is Process B supposed 
to get the statistics for query planning? There are a few approaches that I 
anticipate application developers might consider:

1. Design an out-of-band mechanism for Process B to fetch statistics from 
Process A.
2. Design an encoding that is a superset of Arrow IPC and includes statistics 
information, allowing statistics to be communicated in-band.
3. Use custom schema metadata to communicate statistics in-band.

Options 1 and 2 require considerably more effort than Option 3. Also, Option 3 
feels somewhat natural because it makes sense for the statistics to come with 
the data (similar to how statistics are embedded in Parquet files). In some 
sense, the statistics actually *are* a property of the stream.

In systems that I work on, we already use schema metadata to communicate 
information that is unrelated to the structure of the data. From my reading of 
the documentation [1], this sounds like a reasonable (and perhaps intended?) 
use of metadata, and nowhere is it mentioned that metadata must be used to 
determine schema equivalence. Unless there are other ways of producing 
stream-level application metadata outside of the schema/field metadata, the 
lack of purity was not a concern for me to begin with.

I would appreciate an approach that communicates statistics via schema 
metadata, or at least in some in-band fashion that is consistent across the IPC 
and C data specifications. This would make it much easier to uniformly and 
transparently plumb statistics through applications, regardless of where they 
source Arrow data from. As developers are likely to create bespoke conventions 
for this anyways, it seems reasonable to standardize it as canonical metadata.

I say this all as a happy user of DuckDB's Arrow scan functionality that is 
excited to see better query optimization capabilities. It's just that, in its 
current form, the changes in this proposal are not something I could 
foreseeably integrate with.

Best,
Shoumyo

[1]: 
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

I want to +1 on what Dewey is saying here and some comments.

Sutou Kouhei wrote:
> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
statistics related APIs but it has more other APIs.

It's impossible to keep the responsibility of communication protocols
cleanly separated, but IMO, we should strive to keep the C Data
Interface more of a Transport Protocol than an Application Protocol.

Statistics are application dependent and can complicate the
implementation of importers/exporters which would hinder the adoption
of the C Data Interface. Statistics also bring in security concerns
that are application-specific. e.g. can an algorithm trust min/max
stats and risk producing incorrect results if the statistics are
incorrect? A question that can't really be answered at the C Data
Interface level.

The need for more sophisticated statistics only grows with time, so
there is no such thing as a "simple statistics schema".

Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.

ADBC might be too big of a leap in complexity now, but "we just need C
Data Interface + statistics" is unlikely to remain true for very long
as projects grow in complexity.

--
Felipe

On Thu, May 23, 2024 a

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou



Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit :


Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.


This is also my opinion.

I think what we are slowly converging on is the need for a spec to 
describe the encoding of Arrow array statistics as Arrow arrays.


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Felipe Oliveira Carvalho
I want to +1 on what Dewey is saying here and some comments.

Sutou Kouhei wrote:
> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
> statistics related APIs but it has more other APIs.

It's impossible to keep the responsibility of communication protocols
cleanly separated, but IMO, we should strive to keep the C Data
Interface more of a Transport Protocol than an Application Protocol.

Statistics are application dependent and can complicate the
implementation of importers/exporters which would hinder the adoption
of the C Data Interface. Statistics also bring in security concerns
that are application-specific. e.g. can an algorithm trust min/max
stats and risk producing incorrect results if the statistics are
incorrect? A question that can't really be answered at the C Data
Interface level.

The need for more sophisticated statistics only grows with time, so
there is no such thing as a "simple statistics schema".

Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.

ADBC might be too big of a leap in complexity now, but "we just need C
Data Interface + statistics" is unlikely to remain true for very long
as projects grow in complexity.

--
Felipe

On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
 wrote:
>
> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
>
> > ADBC may be a bit larger to use only for transmitting
> > statistics. ADBC has statistics related APIs but it has more
> > other APIs.
>
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
>
> > How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
>
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
>
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice
> >
> > It seems that we should use the approach because all
> > feedback said so. How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
> >
> > | Field Name   | Field Type| Comments |
> > |--|---|  |
> > | column_name  | utf8  | (1)  |
> > | statistic_key| utf8 not null | (2)  |
> > | statistic_value  | VALUE_SCHEMA not null |  |
> > | statistic_is_approximate | bool not null | (3)  |
> >
> > 1. If null, then the statistic applies to the entire table.
> >It's for "row_count".
> > 2. We'll provide pre-defined keys such as "max", "min",
> >"byte_width" and "distinct_count" but users can also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > ||----|
> > | int64      | int64  |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> >
> >
> > Thanks,

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Curt Hagenlocher
>  would it be easier to request statistics at a higher level of
abstraction?

What if there were a "single table provider" level of abstraction between
ADBC and ArrowArrayStream as a C API; something that can report statistics
and apply simple predicates?

On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
 wrote:

> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
>
> > ADBC may be a bit larger to use only for transmitting
> > statistics. ADBC has statistics related APIs but it has more
> > other APIs.
>
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
>
> > How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
>
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
>
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice
> >
> > It seems that we should use the approach because all
> > feedback said so. How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
> >
> > | Field Name   | Field Type| Comments |
> > |--|---|  |
> > | column_name  | utf8  | (1)  |
> > | statistic_key| utf8 not null | (2)  |
> > | statistic_value  | VALUE_SCHEMA not null |  |
> > | statistic_is_approximate | bool not null | (3)  |
> >
> > 1. If null, then the statistic applies to the entire table.
> >It's for "row_count".
> > 2. We'll provide pre-defined keys such as "max", "min",
> >"byte_width" and "distinct_count" but users can also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > |||
> > | int64  | int64  |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
> 2024 17:04:57 +0200,
> >   Antoine Pitrou  wrote:
> >
> > >
> > > Hi Kou,
> > >
> > > I agree that Dewey that this is overstretching the capabilities of the
> > > C Data Interface. In particular, stuffing a pointer as metadata value
> > > and decreeing it immortal doesn't sound like a good design decision.
> > >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice (Dewey mentioned ADBC but it is of course just
> > > a possible API among others)?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> > >> Hi,
> > >> We're discussing how to provide statistics through the C
> > >> data i

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Dewey Dunnington
Thank you for the background! I understand that these statistics are
important for query planning; however, I am not sure that I follow why
we are constrained to the ArrowSchema to represent them. The examples
given seem to going through Python...would it be easier to request
statistics at a higher level of abstraction? There would already need
to be a separate mechanism to request an ArrowArrayStream with
statistics (unless the PyCapsule `requested_schema` argument would
suffice).

> ADBC may be a bit larger to use only for transmitting
> statistics. ADBC has statistics related APIs but it has more
> other APIs.

Some examples of producers given in the linked threads (Delta Lake,
Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
can implement an ADBC driver without defining all the methods (where
the producer could call AdbcConnectionGetStatistics(), although
AdbcStatementGetStatistics() might be more relevant here and doesn't
exist). One example listed (using an Arrow Table as a source) seems a
bit light to wrap in an ADBC driver; however, it would not take much
code to do so and the overhead of getting the reader via ADBC it is
something like 100 microseconds (tested via the ADBC R package's
"monkey driver" which wraps an existing stream as a statement). In any
case, the bulk of the code is building the statistics array.

> How about the following schema for the
> statistics ArrowArray? It's based on ADBC.

Whatever format for statistics is decided on, I imagine it should be
exactly the same as the ADBC standard? (Perhaps pushing changes
upstream if needed?).

On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>
> Hi,
>
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice
>
> It seems that we should use the approach because all
> feedback said so. How about the following schema for the
> statistics ArrowArray? It's based on ADBC.
>
> | Field Name   | Field Type| Comments |
> |--|---|  |
> | column_name  | utf8  | (1)  |
> | statistic_key| utf8 not null | (2)  |
> | statistic_value  | VALUE_SCHEMA not null |  |
> | statistic_is_approximate | bool not null | (3)  |
>
> 1. If null, then the statistic applies to the entire table.
>It's for "row_count".
> 2. We'll provide pre-defined keys such as "max", "min",
>"byte_width" and "distinct_count" but users can also use
>application specific keys.
> 3. If true, then the value is approximate or best-effort.
>
> VALUE_SCHEMA is a dense union with members:
>
> | Field Name | Field Type |
> |||
> | int64  | int64  |
> | uint64 | uint64 |
> | float64| float64|
> | binary | binary |
>
> If a column is an int32 column, it uses int64 for
> "max"/"min". We don't provide all types here. Users should
> use a compatible type (int64 for a int32 column) instead.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
> 17:04:57 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > Hi Kou,
> >
> > I agree that Dewey that this is overstretching the capabilities of the
> > C Data Interface. In particular, stuffing a pointer as metadata value
> > and decreeing it immortal doesn't sound like a good design decision.
> >
> > Why not simply pass the statistics ArrowArray separately in your
> > producer API of choice (Dewey mentioned ADBC but it is of course just
> > a possible API among others)?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> >> Hi,
> >> We're discussing how to provide statistics through the C
> >> data interface at:
> >> https://github.com/apache/arrow/issues/38837
> >> If you're interested in this feature, could you share your
> >> comments?
> >> Motivation:
> >> We can interchange Apache Arrow data by the C data interface
> >> in the same process. For example, we can pass Apache Arrow
> >> data read by Apache Arrow C++ (provider) to DuckDB
> >> (consumer) through the C data interface.
> >> A provider may know Apache Arrow data statistics. For
> >> example, a provider can know statistics when it reads Apache
> >> Parquet data because Apache Parquet may provide statistics.
> >> But a consumer can't know statistics that are known by a
> >> producer. Because there isn't a standard way to provide

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Sutou Kouhei
Hi,

> Why not simply pass the statistics ArrowArray separately in your
> producer API of choice

It seems that we should use the approach because all
feedback said so. How about the following schema for the
statistics ArrowArray? It's based on ADBC.

| Field Name   | Field Type| Comments |
|--|---|  |
| column_name  | utf8  | (1)  |
| statistic_key| utf8 not null | (2)  |
| statistic_value  | VALUE_SCHEMA not null |  |
| statistic_is_approximate | bool not null | (3)  |

1. If null, then the statistic applies to the entire table.
   It's for "row_count".
2. We'll provide pre-defined keys such as "max", "min",
   "byte_width" and "distinct_count" but users can also use
   application specific keys.
3. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type |
|||
| int64  | int64  |
| uint64 | uint64 |
| float64| float64|
| binary | binary |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
17:04:57 +0200,
  Antoine Pitrou  wrote:

> 
> Hi Kou,
> 
> I agree that Dewey that this is overstretching the capabilities of the
> C Data Interface. In particular, stuffing a pointer as metadata value
> and decreeing it immortal doesn't sound like a good design decision.
> 
> Why not simply pass the statistics ArrowArray separately in your
> producer API of choice (Dewey mentioned ADBC but it is of course just
> a possible API among others)?
> 
> Regards
> 
> Antoine.
> 
> 
> Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> Hi,
>> We're discussing how to provide statistics through the C
>> data interface at:
>> https://github.com/apache/arrow/issues/38837
>> If you're interested in this feature, could you share your
>> comments?
>> Motivation:
>> We can interchange Apache Arrow data by the C data interface
>> in the same process. For example, we can pass Apache Arrow
>> data read by Apache Arrow C++ (provider) to DuckDB
>> (consumer) through the C data interface.
>> A provider may know Apache Arrow data statistics. For
>> example, a provider can know statistics when it reads Apache
>> Parquet data because Apache Parquet may provide statistics.
>> But a consumer can't know statistics that are known by a
>> producer. Because there isn't a standard way to provide
>> statistics through the C data interface. If a consumer can
>> know statistics, it can process Apache Arrow data faster
>> based on statistics.
>> Proposal:
>> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> How about providing statistics as a metadata in ArrowSchema?
>> We reserve "ARROW" namespace for internal Apache Arrow use:
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> 
>>> The ARROW pattern is a reserved namespace for internal
>>> Arrow use in the custom_metadata fields. For example,
>>> ARROW:extension:name.
>> So we can use "ARROW:statistics" for the metadata key.
>> We can represent statistics as a ArrowArray like ADBC does.
>> Here is an example ArrowSchema that is for a record batch
>> that has "int32 column1" and "string column2":
>> ArrowSchema {
>>.format = "+siu",
>>.metadata = {
>>  "ARROW:statistics" => ArrowArray*, /* table-level statistics such as
>>  row count */
>>},
>>.children = {
>>  ArrowSchema {
>>.name = "column1",
>>.format = "i",
>>.metadata = {
>>  "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>  count distinct */
>>},
>>  },
>>  ArrowSchema {
>>.name = "column2",
>>.format = "u",
>>.metadata = {
>>  "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
>> as
>>  count distinct */
>>},
>>  },
>>},
>> }
>> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> => ArrowArray*' is a base 10 string of the address of the
>> ArrowArray.

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Sutou Kouhei
Hi,

I agree with the proposed approach is a departure use of
ArrowSchema.

ADBC may be a bit larger to use only for transmitting
statistics. ADBC has statistics related APIs but it has more
other APIs.


>  It is also not the first time it has come up to encode
> data-dependent information in a schema (e.g., encoding scalar/record
> batch-ness), so perhaps there is a need for another type of array
> stream or descriptor struct?

This may be a candidate approach. Or we may want to add
ArrowArrayStream::get_statistics() callback. But it brakes
ABI...


How about only defining a schema for statistics ArrowArray?
It doesn't define how to get the statistics ArrowArray like
the Arrow C data interface but defining a schema for
statistics ArrowArray will improve how to transmit
statistics.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
11:38:13 -0300,
  Dewey Dunnington  wrote:

> I am definitely in favor of adding (or adopting an existing)
> ABI-stable way to transmit statistics (the one that comes up most
> frequently for me is just the number of values that are about to show
> up in an ArrowArrayStream, since the producer often knows this and the
> consumer often would like to preallocate).
> 
> I am skeptical of using the existing C ArrowSchema ABI to do this. The
> ArrowSchema is exceptionally good at representing Arrow data types
> (which in the presence of dictionaries, nested types, and extensions,
> is difficult to do); however, using it to handle all aspects of a
> consumer request/producer response I think dilutes its ability to do
> this well.
> 
> If I'm understanding the proposal (and I may not be), the ArrowSchema
> will be used to encode data-dependent values, which means the same
> ArrowSchema is very tightly paired to a particular array stream (or
> array). This means that one could no longer (e.g.) consume an array
> stream and blindly assign each array in the stream the schema that was
> returned by get_schema(). This is not impossible to work around but it
> is a conceptual departure from the role the ArrowSchema has had in the
> past. Encoding pointers as strings in metadata is also a departure
> from what we have done previously.
> 
> It is possible to condense the boilerplate of an ADBC driver to about
> 10 lines of code [1]. Is there a reason we can't use ADBC (or an
> extension to that standard) to more precisely handle those types of
> requests/responses (and extensions to them that come up in the
> future)? It is also not the first time it has come up to encode
> data-dependent information in a schema (e.g., encoding scalar/record
> batch-ness), so perhaps there is a need for another type of array
> stream or descriptor struct?
> 
> [1] 
> https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66
> 
> On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies
>  wrote:
>>
>> Hi,
>>
>> One potential challenge with encoding statistics in the schema metadata
>> is that some systems may consider this metadata as part of assessing
>> schema equivalence.
>>
>> However, I think the bigger question is what the intended use-case for
>> these statistics is? Often query engines want to collect statistics from
>> multiple containers in one go, as this allows for efficient vectorised
>> pruning across multiple files, row groups, etc... I therefore wonder if
>> the solution is simply to return separate arrays of min, max, etc...
>> potentially even grouped together into a single StructArray?
>>
>> This would have the benefit of not needing specification changes, whilst
>> being significantly more efficient than an approach centered on scalar
>> statistics. FWIW this is the approach taken by DataFusion for pruning
>> statistics [1], and in arrow-rs we represent scalars as arrays to avoid
>> needing to define a parallel serialization standard [2].
>>
>> Kind Regards,
>>
>> Raphael
>>
>> [1]:
>> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
>> [2]: https://github.com/apache/arrow-rs/pull/4393
>>
>> On 22/05/2024 03:37, Sutou Kouhei wrote:
>> > Hi,
>> >
>> > We're discussing how to provide statistics through the C
>> > data interface at:
>> > https://github.com/apache/arrow/issues/38837
>> >
>> > If you're interested in this feature, could you share your
>> > comments?
>> >
>> >
>> > Motivation:
>> >
>> > We can interchange Apache Arrow data by the C data interface
>> > in the same process. For

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Sutou Kouhei
Hi,

> One potential challenge with encoding statistics in the schema
> metadata is that some systems may consider this metadata as part of
> assessing schema equivalence.

It's a good point. I didn't notice it. The proposed approach
makes schemas different because they have addresses of
ArrowArray for statistics not contents of ArrowArray for
statistics.

> However, I think the bigger question is what the intended use-case for
> these statistics is?

Ah, sorry. I should have also explained it in the initial
e-mail. The intended use-case is for query planning. See the
DuckDB use case in the issue:

https://github.com/apache/arrow/issues/38837#issuecomment-2031914284

>  I
> therefore wonder if the solution is simply to return separate arrays
> of min, max, etc... potentially even grouped together into a single
> StructArray?

Yes. It's a candidate approach. I also considered similar
approach that is based of ADBC. It's a bit abstracted than
the separated arrays/grouped StructArray because we can
represent not only pre-defined statistics (min, max, etc...)
but also application-specific statistics:
https://github.com/apache/arrow/issues/38837#issuecomment-2074371230

But we can't provide a cross-language API with this
approach. Apache Arrow C++ will provide
arrow::RecordBatch::statistics() or something but Apache
Arrow Rust will use different API. So I proposed the C data
interface based approach.

We'll define only schema for statistics record batch with
the ADBC like approach or the grouped StructArray
approach. Is it sufficient for sharing statistics widely?


>FWIW this is the approach taken by DataFusion for
> pruning statistics [1],

Thanks for sharing this. It seems that DataFusion supports
only pre-defined statistics. Does DataFusion ever require
application-specific statistics? I'm not sure whether we
should support application-specific statistics or not.


Thanks,
-- 
kou


In 
  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
12:14:40 +0100,
  Raphael Taylor-Davies  wrote:

> Hi,
> 
> One potential challenge with encoding statistics in the schema
> metadata is that some systems may consider this metadata as part of
> assessing schema equivalence.
> 
> However, I think the bigger question is what the intended use-case for
> these statistics is? Often query engines want to collect statistics
> from multiple containers in one go, as this allows for efficient
> vectorised pruning across multiple files, row groups, etc... I
> therefore wonder if the solution is simply to return separate arrays
> of min, max, etc... potentially even grouped together into a single
> StructArray?
> 
> This would have the benefit of not needing specification changes,
> whilst being significantly more efficient than an approach centered on
> scalar statistics. FWIW this is the approach taken by DataFusion for
> pruning statistics [1], and in arrow-rs we represent scalars as arrays
> to avoid needing to define a parallel serialization standard [2].
> 
> Kind Regards,
> 
> Raphael
> 
> [1]:
> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
> [2]: https://github.com/apache/arrow-rs/pull/4393
> 
> On 22/05/2024 03:37, Sutou Kouhei wrote:
>> Hi,
>>
>> We're discussing how to provide statistics through the C
>> data interface at:
>> https://github.com/apache/arrow/issues/38837
>>
>> If you're interested in this feature, could you share your
>> comments?
>>
>>
>> Motivation:
>>
>> We can interchange Apache Arrow data by the C data interface
>> in the same process. For example, we can pass Apache Arrow
>> data read by Apache Arrow C++ (provider) to DuckDB
>> (consumer) through the C data interface.
>>
>> A provider may know Apache Arrow data statistics. For
>> example, a provider can know statistics when it reads Apache
>> Parquet data because Apache Parquet may provide statistics.
>>
>> But a consumer can't know statistics that are known by a
>> producer. Because there isn't a standard way to provide
>> statistics through the C data interface. If a consumer can
>> know statistics, it can process Apache Arrow data faster
>> based on statistics.
>>
>>
>> Proposal:
>>
>> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>>
>> How about providing statistics as a metadata in ArrowSchema?
>>
>> We reserve "ARROW" namespace for internal Apache Arrow use:
>>
>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>>
>>> The ARROW pattern is a reserved namespace

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou



Hi Kou,

I agree that Dewey that this is overstretching the capabilities of the C 
Data Interface. In particular, stuffing a pointer as metadata value and 
decreeing it immortal doesn't sound like a good design decision.


Why not simply pass the statistics ArrowArray separately in your 
producer API of choice (Dewey mentioned ADBC but it is of course just a 
possible API among others)?


Regards

Antoine.


Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata


The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.


So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
 ArrowSchema {
   .name = "column1",
   .format = "i",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
 ArrowSchema {
   .name = "column2",
   .format = "u",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,


Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Dewey Dunnington
I am definitely in favor of adding (or adopting an existing)
ABI-stable way to transmit statistics (the one that comes up most
frequently for me is just the number of values that are about to show
up in an ArrowArrayStream, since the producer often knows this and the
consumer often would like to preallocate).

I am skeptical of using the existing C ArrowSchema ABI to do this. The
ArrowSchema is exceptionally good at representing Arrow data types
(which in the presence of dictionaries, nested types, and extensions,
is difficult to do); however, using it to handle all aspects of a
consumer request/producer response I think dilutes its ability to do
this well.

If I'm understanding the proposal (and I may not be), the ArrowSchema
will be used to encode data-dependent values, which means the same
ArrowSchema is very tightly paired to a particular array stream (or
array). This means that one could no longer (e.g.) consume an array
stream and blindly assign each array in the stream the schema that was
returned by get_schema(). This is not impossible to work around but it
is a conceptual departure from the role the ArrowSchema has had in the
past. Encoding pointers as strings in metadata is also a departure
from what we have done previously.

It is possible to condense the boilerplate of an ADBC driver to about
10 lines of code [1]. Is there a reason we can't use ADBC (or an
extension to that standard) to more precisely handle those types of
requests/responses (and extensions to them that come up in the
future)? It is also not the first time it has come up to encode
data-dependent information in a schema (e.g., encoding scalar/record
batch-ness), so perhaps there is a need for another type of array
stream or descriptor struct?

[1] 
https://github.com/apache/arrow-adbc/blob/a40cf88408d6cb776cedeaa4d1d0945675c156cc/c/driver/common/driver_test.cc#L56-L66

On Wed, May 22, 2024 at 8:15 AM Raphael Taylor-Davies
 wrote:
>
> Hi,
>
> One potential challenge with encoding statistics in the schema metadata
> is that some systems may consider this metadata as part of assessing
> schema equivalence.
>
> However, I think the bigger question is what the intended use-case for
> these statistics is? Often query engines want to collect statistics from
> multiple containers in one go, as this allows for efficient vectorised
> pruning across multiple files, row groups, etc... I therefore wonder if
> the solution is simply to return separate arrays of min, max, etc...
> potentially even grouped together into a single StructArray?
>
> This would have the benefit of not needing specification changes, whilst
> being significantly more efficient than an approach centered on scalar
> statistics. FWIW this is the approach taken by DataFusion for pruning
> statistics [1], and in arrow-rs we represent scalars as arrays to avoid
> needing to define a parallel serialization standard [2].
>
> Kind Regards,
>
> Raphael
>
> [1]:
> https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
> [2]: https://github.com/apache/arrow-rs/pull/4393
>
> On 22/05/2024 03:37, Sutou Kouhei wrote:
> > Hi,
> >
> > We're discussing how to provide statistics through the C
> > data interface at:
> > https://github.com/apache/arrow/issues/38837
> >
> > If you're interested in this feature, could you share your
> > comments?
> >
> >
> > Motivation:
> >
> > We can interchange Apache Arrow data by the C data interface
> > in the same process. For example, we can pass Apache Arrow
> > data read by Apache Arrow C++ (provider) to DuckDB
> > (consumer) through the C data interface.
> >
> > A provider may know Apache Arrow data statistics. For
> > example, a provider can know statistics when it reads Apache
> > Parquet data because Apache Parquet may provide statistics.
> >
> > But a consumer can't know statistics that are known by a
> > producer. Because there isn't a standard way to provide
> > statistics through the C data interface. If a consumer can
> > know statistics, it can process Apache Arrow data faster
> > based on statistics.
> >
> >
> > Proposal:
> >
> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> >
> > How about providing statistics as a metadata in ArrowSchema?
> >
> > We reserve "ARROW" namespace for internal Apache Arrow use:
> >
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> >
> >> The ARROW pattern is a reserved namespace for internal
> >> Arrow use in the custom_metadata fields. For example,
> >> ARROW:extension:name.
> > So we can use "ARROW:statistics" for the metadata key.
> >
> > We can represent statistics as a ArrowArray like ADBC does.
> >
> > Here is an example ArrowSchema that is for a record batch
> > that has "int32 column1" and "string column2":
> >
> > ArrowSchema {
> >.format = "+siu",
> >.metadata = {
> >  "ARROW:statistics" => ArrowArray*, /* table-level statistics such as 
> > row count */
> >},
> >.children 

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Raphael Taylor-Davies

Hi,

One potential challenge with encoding statistics in the schema metadata 
is that some systems may consider this metadata as part of assessing 
schema equivalence.


However, I think the bigger question is what the intended use-case for 
these statistics is? Often query engines want to collect statistics from 
multiple containers in one go, as this allows for efficient vectorised 
pruning across multiple files, row groups, etc... I therefore wonder if 
the solution is simply to return separate arrays of min, max, etc... 
potentially even grouped together into a single StructArray?


This would have the benefit of not needing specification changes, whilst 
being significantly more efficient than an approach centered on scalar 
statistics. FWIW this is the approach taken by DataFusion for pruning 
statistics [1], and in arrow-rs we represent scalars as arrays to avoid 
needing to define a parallel serialization standard [2].


Kind Regards,

Raphael

[1]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html

[2]: https://github.com/apache/arrow-rs/pull/4393

On 22/05/2024 03:37, Sutou Kouhei wrote:

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata


The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
 ArrowSchema {
   .name = "column1",
   .format = "i",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
 ArrowSchema {
   .name = "column2",
   .format = "u",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...


What do you think about this proposal? Could you share your

[DISCUSS] Statistics through the C data interface

2024-05-21 Thread Sutou Kouhei
Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

> The ARROW pattern is a reserved namespace for internal
> Arrow use in the custom_metadata fields. For example,
> ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
  .format = "+siu",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
  },
  .children = {
ArrowSchema {
  .name = "column1",
  .format = "i",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
  },
},
ArrowSchema {
  .name = "column2",
  .format = "u",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
  },
},
  },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
   "byte_width" and "distinct_count" but users can also use
   application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

   TODO: Is "value" good name? If we refer it from the
   top-level statistics schema, we need to use
   "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,
-- 
kou