Hmmm, I strive to understand why a `(int32, utf8)` tuple for statistic
keys would be any simpler to implement than either `int32` *or* `utf8`
*or* `dictionary(int32, utf8)`.
Let's keep in mind that we would like to keep things simple for
consumers and producers of statistics.
We should also ask ourselves what the typical cardinality would be.
* Cardinality of columns can be very high (thousands or more?);
* Cardinality of statistic kinds/keys is bound to remain low (how many
different kinds statistics can we reasonably devise that would be useful
to transmit over Arrow).
Regards
Antoine.
Le 01/07/2024 à 16:58, Felipe Oliveira Carvalho a écrit :
Hi,
You can promise that well-known int32 statistic keys won't ever be higher
than a certain value (2^18) [1] like TCP IP ports (well-known ports in [0,
2^10)) but for non-standard statistics from open-source products the key=0
combined with string label is the way to go, otherwise collisions would be
inevitable and everyone would hate us for having integer keys.
This is not a very weird proposal from my part because integer keys
representing labels are common in most low-level standardized C APIs (e.g.
linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to
map all these keys to strings, but with them we keep the "C Data Interface"
low-level and portable as it should be.
--
Felipe
[1] 2^16 is too small. For instance, OpenGL constants can't be enums
because C limits enum to 2^16 and that is *not enough*.
On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote:
Hi,
Here is an updated summary so far:
----
Use cases:
* Optimize query plan: e.g. JOIN for DuckDB
Out of scope:
* Transmit statistics through not the C data interface
Examples:
* Transmit statistics through Apache Arrow IPC file
* Transmit statistics through Apache Arrow Flight
* Multi-column statistics
* Constraints information
* Indexes information
Discussing approach:
Standardize Apache Arrow schema for statistics and transmit
statistics via separated API call that uses the C data
interface.
This also works for per-batch statistics.
Candidate schema:
map<
// The column index or null if the statistics refer to whole table or
batch.
column: int32,
// Statistics key is int32.
// Different keys are assigned for exact value and
// approximate value.
map<int32, dense_union<...needed types based on stat kinds in the
keys...>>
Discussions:
1. Can we use int32 for statistic keys?
Should we use utf8 (or dictionary<int32, utf8>) for
statistic keys?
2. Hot to support non-standard (vendor-specific)
statistic keys?
----
Here is my idea:
1. We can use int32 for statistic keys.
2. We can reserve a specific range for non-standard
statistic keys. Prerequisites of this:
* There is no use case to merge some statistics for
the same data.
* We can't merge statistics for different data.
If the prerequisites aren't satisfied:
1. We should use utf8 (or dictionary<int32, utf8>) for
statistic keys?
2. We can use reserved prefix such as "ARROW:"/"arrow." for
standard statistic keys or use prefix such as
"vendor1:"/"vendor1." for non-standard statistic keys.
Here is Felipe's idea:
https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
1. We can use int32 for statistic keys.
2. We can use the special statistic key + a string identifier
for non-standard statistic keys.
What do you think about this?
Thanks,
--
kou
In <20240606.182727.1004633558059795207....@clear-code.com>
"Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun
2024 18:27:27 +0900 (JST),
Sutou Kouhei <k...@clear-code.com> wrote:
Hi,
Thanks for sharing your comments. Here is a summary so far:
----
Use cases:
* Optimize query plan: e.g. JOIN for DuckDB
Out of scope:
* Transmit statistics through not the C data interface
Examples:
* Transmit statistics through Apache Arrow IPC file
* Transmit statistics through Apache Arrow Flight
Candidate approaches:
1. Pass statistics (encoded as an Apache Arrow data) via
ArrowSchema metadata
* This embeds statistics address into metadata
* It's for avoiding using Apache Arrow IPC format with
the C data interface
2. Embed statistics (encoded as an Apache Arrow data) into
ArrowSchema metadata
* This adds statistics to metadata in Apache Arrow IPC
format
3. Embed statistics (encoded as JSON) into ArrowArray
metadata
4. Standardize Apache Arrow schema for statistics and
transmit statistics via separated API call that uses the
C data interface
5. Use ADBC
----
I think that 4. is the best approach in these candidates.
1. Embedding statistics address is tricky.
2. Consumers need to parse Apache Arrow IPC format data.
(The C data interface consumers may not have the
feature.)
3. This will work but 4. is more generic.
5. ADBC is too large to use only for statistics.
What do you think about this?
If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?
----
Metadata:
| Name | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1) |
(1) This follows semantic versioning.
Fields:
| Name | Type | Comments |
|----------------|-----------------------| -------- |
| column | utf8 | (2) |
| key | utf8 not null | (3) |
| value | VALUE_SCHEMA not null | |
| is_approximate | bool not null | (4) |
(2) If null, then the statistic applies to the entire table.
It's for "row_count".
(3) We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.
(4) If true, then the value is approximate or best-effort.
VALUE_SCHEMA is a dense union with members:
| Name | Type |
|---------|---------|
| int64 | int64 |
| uint64 | uint64 |
| float64 | float64 |
| binary | binary |
If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.
----
Thanks,
--
kou
In <20240522.113708.2023905028549001143....@clear-code.com>
"[DISCUSS] Statistics through the C data interface" on Wed, 22 May
2024 11:37:08 +0900 (JST),
Sutou Kouhei <k...@clear-code.com> wrote:
Hi,
We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837
If you're interested in this feature, could you share your
comments?
Motivation:
We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.
A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.
But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.
Proposal:
https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
How about providing statistics as a metadata in ArrowSchema?
We reserve "ARROW" namespace for internal Apache Arrow use:
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.
So we can use "ARROW:statistics" for the metadata key.
We can represent statistics as a ArrowArray like ADBC does.
Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":
ArrowSchema {
.format = "+siu",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* table-level statistics such
as row count */
},
.children = {
ArrowSchema {
.name = "column1",
.format = "i",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics
such as count distinct */
},
},
ArrowSchema {
.name = "column2",
.format = "u",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics
such as count distinct */
},
},
},
}
The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)
ArrowArray* for statistics use the following schema:
| Field Name | Field Type | Comments |
|----------------|----------------------------------| -------- |
| key | string not null | (1) |
| value | `VALUE_SCHEMA` not null | |
| is_approximate | bool not null | (2) |
1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.
2. If true, then the value is approximate or best-effort.
VALUE_SCHEMA is a dense union with members:
| Field Name | Field Type | Comments |
|------------|----------------------------------| -------- |
| int64 | int64 | |
| uint64 | uint64 | |
| float64 | float64 | |
| value | The same type of the ArrowSchema | (3) |
| | that is belonged to. | |
3. If the ArrowSchema's type is string, this type is also string.
TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...
What do you think about this proposal? Could you share your
comments?
Thanks,
--
kou