Re: [DISCUSS] Statistics through the C data interface

Felipe Oliveira Carvalho Fri, 14 Jun 2024 08:39:24 -0700

On Sun, Jun 9, 2024 at 7:53 PM Sutou Kouhei <k...@clear-code.com> wrote:
>
> Hi,
>
> In <d6e52b13-a822-4c2f-8e9e-6023c2dd8...@python.org>
>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
> 22:11:54 +0200,
>   Antoine Pitrou <anto...@python.org> wrote:
>
> >>>> Fields:
> >>>> | Name           | Type                  | Comments |
> >>>> |----------------|-----------------------| -------- |
> >>>> | column         | utf8                  | (2)      |
> >>>> | key            | utf8 not null         | (3)      |
> >>>
> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
> >>> the representation more efficient where there are many columns?
> >> Dictionary is more efficient. But we need to standardize not
> >> only key but also ID -> key mapping.
> >
> > I don't get why we would need to standardize ID -> key mapping. The
> > key names would be significant, the dictionary mapping is just for
> > efficiency.
>
> Ah, space efficiency was only discussed here, right? I
> thought that computational efficiency is also discussed
> here. If we standardize ID -> key mapping, consumers don't
> need to compare key names.
>
> Example: We want to find "distinct_count" statistics.
>
> If we standardize ID -> key mapping (1 -> "distinct_count"),
> consumers can find "distinct_count" statistics by finding ID
> 1 entry.


A consumer should decode the statistics array before using it into the
structure of their choice.

my_specific_stats_map.Clear();  // all stats are unknown
for each item in statistics_array {
  switch (stat_kind):
     case ARRAY_STAT_X:
          auto value_from_the_dense_union = ....
          my_specific_stats_map.SetStatX(value_from_the_dense_union);
          break;
     case ARRAY_STAT_Y:
         ...
         break;
     default:
         // just ignore statistics that a producer might add, but that this
         // consumer doesn't take advantage of.
         break;
}

Maps on the wire (network wire, or shared memory "wire") are better
represented as lists that become a hash-table (or simple structs) at
the receiving end. We don't have to standardize the positions in the
statistics array which allows a small statistics array when not many
statistics are available. It also allows the number of standardized
statistics to grow without creating overhead unless a provider opts-in
to producing that statistic.

> If we don't standardize ID -> key mapping, consumers need to
> compare key name to find "distinct_count" statistics.
>
>
> Anyway, this (string comparison) will not be a large
> overhead because (1) statistics data will not be large data
> and (2) consumers can cache ID -> key mapping to avoid
> duplicated string comparisons. So standardizing ID -> key
> mapping isn't required.
>
>
> Thanks,
> --
> kou

Re: [DISCUSS] Statistics through the C data interface

Reply via email to