kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2110583274
> Since there's no available "side channel" here, string names probably make more sense OK. Let's use string for statistic key. > > TODO: Should we embed VALUE_SCHEMA to the statistics schema something like the following? > > That would essentially be a sparse union? Yes. > Though I guess if you assume the caller knows the right type for a particular kind of statistic you can save a bit on the encoding, and presumably there aren't enough different statistics for the extra allocated space to matter (as compared to a dense union) This idea is not for space efficient. I thought this may be easier to use. Union may be a bit complicated. But users need to know which column is used for each statistics key as you mentioned. (e.g. `distinct_count` uses `value_uint64` and `max_value` uses `value`.) This may be harder to use for implementation specific statistics. Let`s use union like ADBC does. But we have a problem for the union approach: > TODO: Is value good name? `value.value` to refer `VALUE_SCHEMA`'s value from the top-level record batch (`{key, value, is_approximate}`) may be a bit strange... > Just to be clear, when we say > > ``` > "ARROW:statistics" => ArrowArray*, > ``` > > this means the address of the ArrowArray will be encoded (as a base 10 string?) in the metadata? Yes. I should have mentioned it explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org