felipecrv commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2155478257
> Summary for the discussion on the mailing list so far: https://lists.apache.org/thread/6m9xrhfktnt0nnyss7qo333n0cl76ypc Thank you for the summary @kou. I replied to that thread [1] with a statistics encoding scheme that avoids free-form strings. Strings lead to bloat of metadata and parsing bugs when consumers match only on prefixes (or other "clever" key matching ideas) of string identifiers hindering format evolution. The value of a standard for statistics comes from producers and consumers agreeing on the statistics available, so I think centralizing the definitions in the Arrow C Data Interface specification is better than letting the ecosystem work out the naming of statistics from different producers. I'm trying to avoid a query processors (consumers of statistics) having to deal with quirks based on the system producing the statistics. In the scheme I proposed, a single array would contain statistics for the table/batch and all the columns. That's more friendly to the memory allocator and allows for all the statistics to be processed in a single loop. If the array grows too big, it can be chunked at any position allowing for the streaming of statistics if big values start being produced. @kou proposed this schema for an `ArrowArray*` carrying statistics in [2]: > | Field Name | Field Type | Comments | > |----------------|----------------------------------| -------- | > | key | string not null | (1) | > | value | `VALUE_SCHEMA` not null | | > | is_approximate | bool not null | (2) | > > 1. We'll provide pre-defined keys such as max, min, byte_width and distinct_count but users can use application specific keys too. > 2. If true, then the value is approximate or best-effort. I will post what I tried to explain in the mailing list [1] in this format so the comparison is easier: | Field Name | Field Type | Comments | |----------------|------------------------------------------| -------- | | subject | int32 | Field/column index (1) (2) (3) | | statistics | map<int32, dense_union<...>> NOT NULL | (4) | 1. 0-based. 2. When `subject` is `NULL` the statistics refer to the whole schema (table or batch, this is inferable from the context of the `ArrowArray*` of statistics) 3. The `subject` column isn't unique to allow for the streaming of statistics: a field or NULL can come up again in the array with more statistics about the table (NULL) or the column. 4. This column stores maps from `ArrowStatKind` (int32) [3] to a `dense_union` that changes according to the kinds of statistics generated by a specific provider. The meaning of each key is standardized in the C Data Interface spec together with the list of types in the dense_union that consumers should expect. Like @kou's proposal it uses `dense_union` for the values. But doesn't restrict the statistic to be the same type of the column (some columns might have a certain type, but a statistic about it might integer or boolean-typed like null counts and any predicate). The `is_approximate` was removed because that is a property of the statistic kind and affects the expected type of the value: the approximate value of an integer-typed column might be a floating-point number if it results from some estimation process. [1] https://lists.apache.org/thread/gnjq46wn7dbkj2145dskr9tkgfg1ncdw [2] https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 [3] Illustrative examples of statistic kinds ```c // Statistics values are identified by specified int32-valued keys // so that producers and consumers can agree on physical // encoding and semantics. Statistics can be about a column, // a record batch, or both. typedef ArrowStatKind int32_t; // Used for non-standard statistic kinds. // Value must be a struct<name: utf8, value: dense_union<...>> #define ARROW_STAT_ANY 0 // Exact number of nulls in a column. Value must be int32 or int64. #define ARROW_STAT_NULL_COUNT_EXACT 1 // Approximate number of nulls in a column. Value must be float32 or float64. #define ARROW_STAT_NULL_COUNT_APPROX 2 // The minimum and maximum values of a column. // Value must be the same type of the column or a wider type. // Supported types are: ... #define ARROW_STAT_MIN_APPROX 3 #define ARROW_STAT_MIN_EXACT 4 #define ARROW_STAT_MIN_APPROX 5 #define ARROW_STAT_MAX_EXACT 6 #define ARROW_STAT_CARDINALITY_APPROX 7 #define ARROW_STAT_CARDINALITY_EXACT 8 #define ARROW_STAT_COUNT_DISTINCT_APPROX 9 #define ARROW_STAT_COUNT_DISTINCT_EXACT 10 // ... Represented as a // list< // struct<quantile: float32 | float64, // sum: "same as column type or a type with wider precision">> #define ARROW_STAT_CUMULATIVE_QUANTILES 11 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
