Hi, >> Metadata: >> | Name | Value | Comments | >> |----------------------------|-------|--------- | >> | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt.
The Apache Arrow columnar format uses semantic versioning. So I think that other specifications should also use semantic versioning. FYI: ADBC API standard also uses semantic versioning. https://arrow.apache.org/docs/format/ADBC.html#adbc-api-standard-1-0-0 > Nit: this should be "ARROW:statistics:version" for consistency with > https://arrow.apache.org/docs/format/Columnar.html#extension-types You're right. I should have used ":" not "::" here... >> Fields: >> | Name | Type | Comments | >> |----------------|-----------------------| -------- | >> | column | utf8 | (2) | >> | key | utf8 not null | (3) | > > 1. Should the key be something like `dictionary(int32, utf8)` to make > the representation more efficient where there are many columns? Dictionary is more efficient. But we need to standardize not only key but also ID -> key mapping. If we standardize ID -> key mapping, we don't need to use dictionary. We can just use ID like the Felipe's approach does. > 2. Should the statistics perhaps be nested as a map type under each > column to avoid repeating `column`, or is that overkill? Ah, I didn't think of it. A nested type may be a bit complex but we already use union (nested type) for value. So using map here isn't a problem. > 3. Should there also be room for multi-column statistics (such as > cardinality of a given column pair), or is it too complex for now? I didn't think of multi-column statistics too... It seems that PostgreSQL supports multi-column statistics: https://www.postgresql.org/docs/current/catalog-pg-statistic-ext.html We can support multi-column statistics by using list for the "column" field. But we also need to add more fields to VALUE_SCHEMA to store a value of multi-column statistics. If we support PostgreSQL's multi-column N-distinct counts case, we need "map<list<COLUMN_VALUE_TYPE>, uint64>": https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS > k | 1 2 5 > nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} If we support PostgreSQL's multi-column most common value lists case, we need a more complex type... https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-MCV-LISTS > index | values | nulls | frequency | base_frequency > -------+------------------------+-------+-----------+---------------- > 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05 > 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05 > 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133 > 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113 > 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114 > 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05 > 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05 > 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05 > 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05 > 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05 > ... > (99 rows) It may be complex to support full multi-column statistics use cases. How about standardizing this without multi-columns statistics support for the first version? We can add support for multi-column statistics later. We can use feedback from users of the first version at that time. Thanks, -- kou In <57595559-a561-4bd2-9efd-b67aa9a32...@python.org> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 11:40:50 +0200, Antoine Pitrou <anto...@python.org> wrote: > > Hi Kou, > > Thanks for pushing for this! > > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : >> 4. Standardize Apache Arrow schema for statistics and >> transmit statistics via separated API call that uses the >> C data interface > [...] >> I think that 4. is the best approach in these candidates. > > I agree. > >> If we select 4., we need to standardize Apache Arrow schema >> for statistics. How about the following schema? >> ---- >> Metadata: >> | Name | Value | Comments | >> |----------------------------|-------|--------- | >> | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt. > > Nit: this should be "ARROW:statistics:version" for consistency with > https://arrow.apache.org/docs/format/Columnar.html#extension-types > >> Fields: >> | Name | Type | Comments | >> |----------------|-----------------------| -------- | >> | column | utf8 | (2) | >> | key | utf8 not null | (3) | > > 1. Should the key be something like `dictionary(int32, utf8)` to make > the representation more efficient where there are many columns? > > 2. Should the statistics perhaps be nested as a map type under each > column to avoid repeating `column`, or is that overkill? > > 3. Should there also be room for multi-column statistics (such as > cardinality of a given column pair), or is it too complex for now? > > Regards > > Antoine.