Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sat, 08 Jun 2024 23:33:28 -0700

Hi,

>> Metadata:
>> | Name                       | Value | Comments |
>> |----------------------------|-------|--------- |
>> | ARROW::statistics::version | 1.0.0 | (1)      |
> 
> I'm not sure this is useful, but it doesn't hurt.


The Apache Arrow columnar format uses semantic
versioning. So I think that other specifications should also
use semantic versioning. FYI: ADBC API standard also uses
semantic versioning.

https://arrow.apache.org/docs/format/ADBC.html#adbc-api-standard-1-0-0

> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types

You're right. I should have used ":" not "::" here...

>> Fields:
>> | Name           | Type                  | Comments |
>> |----------------|-----------------------| -------- |
>> | column         | utf8                  | (2)      |
>> | key            | utf8 not null         | (3)      |
> 
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?

Dictionary is more efficient. But we need to standardize not
only key but also ID -> key mapping. If we standardize ID ->
key mapping, we don't need to use dictionary. We can just
use ID like the Felipe's approach does.

> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?

Ah, I didn't think of it. A nested type may be a bit complex
but we already use union (nested type) for value. So using
map here isn't a problem.

> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?

I didn't think of multi-column statistics too...
It seems that PostgreSQL supports multi-column statistics:
https://www.postgresql.org/docs/current/catalog-pg-statistic-ext.html

We can support multi-column statistics by using list for the
"column" field. But we also need to add more fields to
VALUE_SCHEMA to store a value of multi-column statistics.

If we support PostgreSQL's multi-column N-distinct counts
case, we need "map<list<COLUMN_VALUE_TYPE>, uint64>":

https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS

> k  | 1 2 5
> nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}


If we support PostgreSQL's multi-column most common value
lists case, we need a more complex type...

https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED-MCV-LISTS

>  index |         values         | nulls | frequency | base_frequency
> -------+------------------------+-------+-----------+----------------
>      0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
>      1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
>      2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
>      3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
>      4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
>      5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
>      6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
>      7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
>      8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
>      9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
>    ...
> (99 rows)


It may be complex to support full multi-column statistics
use cases. How about standardizing this without
multi-columns statistics support for the first version? We
can add support for multi-column statistics later. We can
use feedback from users of the first version at that time.


Thanks,
-- 
kou

In <57595559-a561-4bd2-9efd-b67aa9a32...@python.org>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
11:40:50 +0200,
  Antoine Pitrou <anto...@python.org> wrote:

> 
> Hi Kou,
> 
> Thanks for pushing for this!
> 
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> 4. Standardize Apache Arrow schema for statistics and
>>     transmit statistics via separated API call that uses the
>>     C data interface
> [...]
>> I think that 4. is the best approach in these candidates.
> 
> I agree.
> 
>> If we select 4., we need to standardize Apache Arrow schema
>> for statistics. How about the following schema?
>> ----
>> Metadata:
>> | Name                       | Value | Comments |
>> |----------------------------|-------|--------- |
>> | ARROW::statistics::version | 1.0.0 | (1)      |
> 
> I'm not sure this is useful, but it doesn't hurt.
> 
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
> 
>> Fields:
>> | Name           | Type                  | Comments |
>> |----------------|-----------------------| -------- |
>> | column         | utf8                  | (2)      |
>> | key            | utf8 not null         | (3)      |
> 
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
> 
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
> 
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
> 
> Regards
> 
> Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to