felipecrv commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2155478257

   > Summary for the discussion on the mailing list so far: 
https://lists.apache.org/thread/6m9xrhfktnt0nnyss7qo333n0cl76ypc
   
   Thank you for the summary @kou. I replied to that thread [1] with a 
statistics encoding scheme that avoids free-form strings. Strings lead to bloat 
of metadata and parsing bugs when consumers match only on prefixes (or other 
"clever" key matching ideas) of string identifiers hindering format evolution.
   
   The value of a standard for statistics comes from producers and consumers 
agreeing on the statistics available, so I think centralizing the definitions 
in the Arrow C Data Interface specification is better than letting the 
ecosystem work out the naming of statistics from different producers. I'm 
trying to avoid a query processors (consumers of statistics) having to deal 
with quirks based on the system producing the statistics.
   
   In the scheme I proposed, a single array would contain statistics for the 
table/batch and all the columns. That's more friendly to the memory allocator 
and allows for all the statistics to be processed in a single loop. If the 
array grows too big, it can be chunked at any position allowing for the 
streaming of statistics if big values start being produced.
   
   @kou proposed this schema for an `ArrowArray*` carrying statistics in [2]:
   
   > | Field Name     | Field Type                       | Comments |
   > |----------------|----------------------------------| -------- |
   > | key            | string not null                  | (1)      |
   > | value          | `VALUE_SCHEMA` not null          |          |
   > | is_approximate | bool not null                    | (2)      |
   > 
   > 1. We'll provide pre-defined keys such as max, min, byte_width and 
distinct_count but users can use application specific keys too.
   > 2. If true, then the value is approximate or best-effort.
   
   
   I will post what I tried to explain in the mailing list [1] in this format 
so the comparison is easier:
   
   | Field Name     | Field Type                               | Comments |
   |----------------|------------------------------------------| -------- |
   | subject        | int32                                    | Field/column 
index (1) (2) (3)  |
   | statistics     | map<int32, dense_union<...>> NOT NULL     | (4)         |
   
   1. 0-based.
   2. When `subject` is `NULL` the statistics refer to the whole schema (table 
or batch, this is inferable from the context of the `ArrowArray*` of statistics)
   3. The `subject` column isn't unique to allow for the streaming of 
statistics: a field or NULL can come up again in the array with more statistics 
about the table (NULL) or the column.
   4. This column stores maps from `ArrowStatKind` (int32) [3] to a 
`dense_union` that changes according to the kinds of statistics generated by a 
specific provider. The meaning of each key is standardized in the C Data 
Interface spec together with the list of types in the dense_union that 
consumers should expect.
   
   Like @kou's proposal it uses `dense_union` for the values. But doesn't 
restrict the statistic to be the same type of the column (some columns might 
have a certain type, but a statistic about it might integer or boolean-typed 
like null counts and any predicate). The `is_approximate` was removed because 
that is a property of the statistic kind and affects the expected type of the 
value: the approximate value of an integer-typed column might be a 
floating-point number if it results from some estimation process.
    
   
   [1] https://lists.apache.org/thread/gnjq46wn7dbkj2145dskr9tkgfg1ncdw
   [2] https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
   [3] Illustrative examples of statistic kinds
   
   ```c
   // Statistics values are identified by specified int32-valued keys
   // so that producers and consumers can agree on physical
   // encoding and semantics. Statistics can be about a column,
   // a record batch, or both.
   typedef ArrowStatKind int32_t;
   
   // Used for non-standard statistic kinds.
   // Value must be a struct<name: utf8, value: dense_union<...>>
   #define ARROW_STAT_ANY 0
   // Exact number of nulls in a column. Value must be int32 or int64.
   #define ARROW_STAT_NULL_COUNT_EXACT 1
   // Approximate number of nulls in a column. Value must be float32 or float64.
   #define ARROW_STAT_NULL_COUNT_APPROX 2
   // The minimum and maximum values of a column.
   // Value must be the same type of the column or a wider type.
   // Supported types are: ...
   #define ARROW_STAT_MIN_APPROX 3
   #define ARROW_STAT_MIN_EXACT 4
   #define ARROW_STAT_MIN_APPROX 5
   #define ARROW_STAT_MAX_EXACT 6
   #define ARROW_STAT_CARDINALITY_APPROX 7
   #define ARROW_STAT_CARDINALITY_EXACT 8
   #define ARROW_STAT_COUNT_DISTINCT_APPROX 9
   #define ARROW_STAT_COUNT_DISTINCT_EXACT 10
   // ... Represented as a
   // list<
   //   struct<quantile: float32 | float64,
   //          sum: "same as column type or a type with wider precision">>
   #define ARROW_STAT_CUMULATIVE_QUANTILES 11
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to