Re: [DISCUSS] Arrow array representation of statistics

Sutou Kouhei Sun, 22 Dec 2024 21:27:20 -0800

Hi,

There are no objections how to process TODOs. I've used
suggested approaches for them.


I've added examples for array.

I'll start a vote for this again.

Thanks,
-- 
kou

In <20241218.170107.1224859230223769469....@clear-code.com>
  "Re: [DISCUSS] Arrow array representation of statistics" on Wed, 18 Dec 2024 
17:01:07 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> Thanks for sharing your opinion. There is no objection to
> remove the C data interface limitation.
> 
> I opened a new PR that focus on only statistics schema:
> * PR: https://github.com/apache/arrow/pull/45058
> * Preview: 
> http://crossbow.voltrondata.com/pr_docs/45058/format/StatisticsSchema.html
> 
> It still includes the original motivation (passing
> statistics through the C data/stream interface for query
> plan) but it doesn't limit to the C data interface. It also
> includes "this is the canonical schema to represent
> statistics about an Arrow dataset as Arrow data" provided by
> David.
> 
> 
> There are some TODOs in the specification. If you have any
> opinions for them, please share them.
> 
> 1. Should we standardize field names of dense union that is
>    used for statistics values?
> 
>    See also:
>    * The related part in the PR:
>      
> https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR136-R150
>    * The discussion:
>      https://github.com/apache/arrow/pull/43553/files#r1871755575
> 
>    I think that we don't need to standardize them because we
`>    can access dense union values via type codes not
>    names. If there aren't any opinions, I will not
>    standardize these field names.
> 
> 2. Should we use int64 not float64 for approximate
>    max_byte_width?
> 
>    See also:
>    * The related part in the PR:
>      
> https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR199-R202
>    * The discussion:
>      https://github.com/apache/arrow/pull/43553/files#r1871759234
> 
>    We can use ceil(float_max_byte_width) for int64
>    approximate max_byte_width.
> 
>    If there aren't any opinions... I don't have strong
>    opinion for this... Hmm. I will use float64.
> 
> 
> There are examples for record batches but not for arrays for
> now.  I'll add examples for arrays later...
> 
> 
> Thanks,
> -- 
> kou
> 
> In <20241212.102712.700919417179383555....@clear-code.com>
>   "[DISCUSS] Arrow array representation of statistics" on Thu, 12 Dec 2024 
> 10:27:12 +0900 (JST),
>   Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>> 
>> I want to discuss Arrow array representation of statistics
>> and usable contexts of it.
>> 
>> Background:
>> 
>> We discussed how to pass statistics through the C data
>> interface:
>> 
>> * [DISCUSS] Statistics through the C data interface
>>   https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
>> * [VOTE] Statistics through the C data interface
>>   https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd
>> * GH-38837: [Format] Add the specification to pass
>>   statistics through the Arrow C data interface
>>   https://github.com/apache/arrow/pull/43553
>> 
>> The latest proposal is that we standardize schema for Arrow
>> array that represents statistics. See the above PR for
>> details.
>> 
>> I think that the proposed approach is the best approach for
>> the C data interface. But I'm not sure whether the approach
>> is the best approach for other contexts such as IPC format,
>> Flight, ADBC and so on. So the latest proposal limits its
>> target to only the C data interface.
>> 
>> But there are comments that can we standardize this approach
>> for all contexts including the C data interface?
>> I want to discuss this in this thread.
>> 
>> Here are related comments so far:
>> 
>> * https://github.com/apache/arrow/pull/43553/files#r1871749972
>> * https://github.com/apache/arrow/pull/43553/files#r1704373291
>> * https://github.com/apache/arrow/pull/43553/files#r1871757604
>> 
>> 
>> Could you share your opinions?
>> 
>> 
>> If we can remove the C data interface only limitation, I'll
>> open a new PR for it.
>> 
>> 
>> Thanks,
>> -- 
>> kou

Re: [DISCUSS] Arrow array representation of statistics

Reply via email to