On Sun, Jun 9, 2024 at 7:53 PM Sutou Kouhei <k...@clear-code.com> wrote: > > Hi, > > In <d6e52b13-a822-4c2f-8e9e-6023c2dd8...@python.org> > "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 > 22:11:54 +0200, > Antoine Pitrou <anto...@python.org> wrote: > > >>>> Fields: > >>>> | Name | Type | Comments | > >>>> |----------------|-----------------------| -------- | > >>>> | column | utf8 | (2) | > >>>> | key | utf8 not null | (3) | > >>> > >>> 1. Should the key be something like `dictionary(int32, utf8)` to make > >>> the representation more efficient where there are many columns? > >> Dictionary is more efficient. But we need to standardize not > >> only key but also ID -> key mapping. > > > > I don't get why we would need to standardize ID -> key mapping. The > > key names would be significant, the dictionary mapping is just for > > efficiency. > > Ah, space efficiency was only discussed here, right? I > thought that computational efficiency is also discussed > here. If we standardize ID -> key mapping, consumers don't > need to compare key names. > > Example: We want to find "distinct_count" statistics. > > If we standardize ID -> key mapping (1 -> "distinct_count"), > consumers can find "distinct_count" statistics by finding ID > 1 entry.
A consumer should decode the statistics array before using it into the structure of their choice. my_specific_stats_map.Clear(); // all stats are unknown for each item in statistics_array { switch (stat_kind): case ARRAY_STAT_X: auto value_from_the_dense_union = .... my_specific_stats_map.SetStatX(value_from_the_dense_union); break; case ARRAY_STAT_Y: ... break; default: // just ignore statistics that a producer might add, but that this // consumer doesn't take advantage of. break; } Maps on the wire (network wire, or shared memory "wire") are better represented as lists that become a hash-table (or simple structs) at the receiving end. We don't have to standardize the positions in the statistics array which allows a small statistics array when not many statistics are available. It also allows the number of standardized statistics to grow without creating overhead unless a provider opts-in to producing that statistic. > If we don't standardize ID -> key mapping, consumers need to > compare key name to find "distinct_count" statistics. > > > Anyway, this (string comparison) will not be a large > overhead because (1) statistics data will not be large data > and (2) consumers can cache ID -> key mapping to avoid > duplicated string comparisons. So standardizing ID -> key > mapping isn't required. > > > Thanks, > -- > kou