Hi,
One potential challenge with encoding statistics in the schema metadata
is that some systems may consider this metadata as part of assessing
schema equivalence.
However, I think the bigger question is what the intended use-case for
these statistics is? Often query engines want to collect statistics from
multiple containers in one go, as this allows for efficient vectorised
pruning across multiple files, row groups, etc... I therefore wonder if
the solution is simply to return separate arrays of min, max, etc...
potentially even grouped together into a single StructArray?
This would have the benefit of not needing specification changes, whilst
being significantly more efficient than an approach centered on scalar
statistics. FWIW this is the approach taken by DataFusion for pruning
statistics [1], and in arrow-rs we represent scalars as arrays to avoid
needing to define a parallel serialization standard [2].
Kind Regards,
Raphael
[1]:
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
[2]: https://github.com/apache/arrow-rs/pull/4393
On 22/05/2024 03:37, Sutou Kouhei wrote:
Hi,
We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837
If you're interested in this feature, could you share your
comments?
Motivation:
We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.
A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.
But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.
Proposal:
https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
How about providing statistics as a metadata in ArrowSchema?
We reserve "ARROW" namespace for internal Apache Arrow use:
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.
So we can use "ARROW:statistics" for the metadata key.
We can represent statistics as a ArrowArray like ADBC does.
Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":
ArrowSchema {
.format = "+siu",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* table-level statistics such as row
count */
},
.children = {
ArrowSchema {
.name = "column1",
.format = "i",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as
count distinct */
},
},
ArrowSchema {
.name = "column2",
.format = "u",
.metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as
count distinct */
},
},
},
}
The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)
ArrowArray* for statistics use the following schema:
| Field Name | Field Type | Comments |
|----------------|----------------------------------| -------- |
| key | string not null | (1) |
| value | `VALUE_SCHEMA` not null | |
| is_approximate | bool not null | (2) |
1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.
2. If true, then the value is approximate or best-effort.
VALUE_SCHEMA is a dense union with members:
| Field Name | Field Type | Comments |
|------------|----------------------------------| -------- |
| int64 | int64 | |
| uint64 | uint64 | |
| float64 | float64 | |
| value | The same type of the ArrowSchema | (3) |
| | that is belonged to. | |
3. If the ArrowSchema's type is string, this type is also string.
TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...
What do you think about this proposal? Could you share your
comments?
Thanks,