Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Tue, 14 May 2024 08:52:29 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2110583274


   > Since there's no available "side channel" here, string names probably make 
more sense
   
   OK. Let's use string for statistic key.
   
   > > TODO: Should we embed VALUE_SCHEMA to the statistics schema something 
like the following?
   > 
   > That would essentially be a sparse union?
   
   Yes.
   
   >  Though I guess if you assume the caller knows the right type for a 
particular kind of statistic you can save a bit on the encoding, and presumably 
there aren't enough different statistics for the extra allocated space to 
matter (as compared to a dense union)
   
   This idea is not for space efficient. I thought this may be easier to use. 
Union may be a bit complicated. But users need to know which column is used for 
each statistics key as you mentioned. (e.g. `distinct_count` uses 
`value_uint64` and `max_value` uses `value`.) This may be harder to use for 
implementation specific statistics.
   
   Let`s use union like ADBC does.
   
   But we have a problem for the union approach:
   
   > TODO: Is value good name?
   
   `value.value` to refer `VALUE_SCHEMA`'s value from the top-level record 
batch (`{key, value, is_approximate}`) may be a bit strange...
   
   > Just to be clear, when we say
   > 
   > ```
   >     "ARROW:statistics" => ArrowArray*,
   > ```
   > 
   > this means the address of the ArrowArray will be encoded (as a base 10 
string?) in the metadata?
   
   Yes. I should have mentioned it explicitly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to