Hmmm, I strive to understand why a `(int32, utf8)` tuple for statistic keys would be any simpler to implement than either `int32` *or* `utf8` *or* `dictionary(int32, utf8)`.

Let's keep in mind that we would like to keep things simple for consumers and producers of statistics.

We should also ask ourselves what the typical cardinality would be.

* Cardinality of columns can be very high (thousands or more?);
* Cardinality of statistic kinds/keys is bound to remain low (how many different kinds statistics can we reasonably devise that would be useful to transmit over Arrow).

Regards

Antoine.


Le 01/07/2024 à 16:58, Felipe Oliveira Carvalho a écrit :
Hi,

You can promise that well-known int32 statistic keys won't ever be higher
than a certain value (2^18) [1] like TCP IP ports (well-known ports in [0,
2^10)) but for non-standard statistics from open-source products the key=0
combined with string label is the way to go, otherwise collisions would be
inevitable and everyone would hate us for having integer keys.

This is not a very weird proposal from my part because integer keys
representing labels are common in most low-level standardized C APIs (e.g.
linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to
map all these keys to strings, but with them we keep the "C Data Interface"
low-level and portable as it should be.

--
Felipe

[1] 2^16 is too small. For instance, OpenGL constants can't be enums
because C limits enum to 2^16 and that is *not enough*.

On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote:

Hi,

Here is an updated summary so far:

----
Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
   Examples:
   * Transmit statistics through Apache Arrow IPC file
   * Transmit statistics through Apache Arrow Flight
* Multi-column statistics
* Constraints information
* Indexes information

Discussing approach:

Standardize Apache Arrow schema for statistics and transmit
statistics via separated API call that uses the C data
interface.

This also works for per-batch statistics.

Candidate schema:

map<
   // The column index or null if the statistics refer to whole table or
batch.
   column: int32,
   // Statistics key is int32.
   // Different keys are assigned for exact value and
   // approximate value.
   map<int32, dense_union<...needed types based on stat kinds in the
keys...>>


Discussions:

1. Can we use int32 for statistic keys?
    Should we use utf8 (or dictionary<int32, utf8>) for
    statistic keys?
2. Hot to support non-standard (vendor-specific)
    statistic keys?
----

Here is my idea:

1. We can use int32 for statistic keys.
2. We can reserve a specific range for non-standard
    statistic keys. Prerequisites of this:
    * There is no use case to merge some statistics for
      the same data.
    * We can't merge statistics for different data.

If the prerequisites aren't satisfied:

1. We should use utf8 (or dictionary<int32, utf8>) for
    statistic keys?
2. We can use reserved prefix such as "ARROW:"/"arrow." for
    standard statistic keys or use prefix such as
    "vendor1:"/"vendor1." for non-standard statistic keys.

Here is Felipe's idea:
https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2

1. We can use int32 for statistic keys.
2. We can use the special statistic key + a string identifier
    for non-standard statistic keys.


What do you think about this?


Thanks,
--
kou

In <20240606.182727.1004633558059795207....@clear-code.com>
   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun
2024 18:27:27 +0900 (JST),
   Sutou Kouhei <k...@clear-code.com> wrote:

Hi,

Thanks for sharing your comments. Here is a summary so far:

----

Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
   Examples:
   * Transmit statistics through Apache Arrow IPC file
   * Transmit statistics through Apache Arrow Flight

Candidate approaches:

1. Pass statistics (encoded as an Apache Arrow data) via
    ArrowSchema metadata
    * This embeds statistics address into metadata
    * It's for avoiding using Apache Arrow IPC format with
      the C data interface
2. Embed statistics (encoded as an Apache Arrow data) into
    ArrowSchema metadata
    * This adds statistics to metadata in Apache Arrow IPC
      format
3. Embed statistics (encoded as JSON) into ArrowArray
    metadata
4. Standardize Apache Arrow schema for statistics and
    transmit statistics via separated API call that uses the
    C data interface
5. Use ADBC

----

I think that 4. is the best approach in these candidates.

1. Embedding statistics address is tricky.
2. Consumers need to parse Apache Arrow IPC format data.
    (The C data interface consumers may not have the
    feature.)
3. This will work but 4. is more generic.
5. ADBC is too large to use only for statistics.

What do you think about this?


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?

----
Metadata:

| Name                       | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1)      |

(1) This follows semantic versioning.

Fields:

| Name           | Type                  | Comments |
|----------------|-----------------------| -------- |
| column         | utf8                  | (2)      |
| key            | utf8 not null         | (3)      |
| value          | VALUE_SCHEMA not null |          |
| is_approximate | bool not null         | (4)      |

(2) If null, then the statistic applies to the entire table.
     It's for "row_count".
(3) We'll provide pre-defined keys such as "max", "min",
     "byte_width" and "distinct_count" but users can also use
     application specific keys.
(4) If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Name    | Type    |
|---------|---------|
| int64   | int64   |
| uint64  | uint64  |
| float64 | float64 |
| binary  | binary  |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.
----


Thanks,
--
kou


In <20240522.113708.2023905028549001143....@clear-code.com>
   "[DISCUSS] Statistics through the C data interface" on Wed, 22 May
2024 11:37:08 +0900 (JST),
   Sutou Kouhei <k...@clear-code.com> wrote:

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:


https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
     "ARROW:statistics" => ArrowArray*, /* table-level statistics such
as row count */
   },
   .children = {
     ArrowSchema {
       .name = "column1",
       .format = "i",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics
such as count distinct */
       },
     },
     ArrowSchema {
       .name = "column2",
       .format = "u",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics
such as count distinct */
       },
     },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows

https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name     | Field Type                       | Comments |
|----------------|----------------------------------| -------- |
| key            | string not null                  | (1)      |
| value          | `VALUE_SCHEMA` not null          |          |
| is_approximate | bool not null                    | (2)      |

1. We'll provide pre-defined keys such as "max", "min",
    "byte_width" and "distinct_count" but users can also use
    application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type                       | Comments |
|------------|----------------------------------| -------- |
| int64      | int64                            |          |
| uint64     | uint64                           |          |
| float64    | float64                          |          |
| value      | The same type of the ArrowSchema | (3)      |
|            | that is belonged to.             |          |

3. If the ArrowSchema's type is string, this type is also string.

    TODO: Is "value" good name? If we refer it from the
    top-level statistics schema, we need to use
    "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,
--
kou


Reply via email to