Hi, There are no objections how to process TODOs. I've used suggested approaches for them.
I've added examples for array. I'll start a vote for this again. Thanks, -- kou In <20241218.170107.1224859230223769469....@clear-code.com> "Re: [DISCUSS] Arrow array representation of statistics" on Wed, 18 Dec 2024 17:01:07 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > Thanks for sharing your opinion. There is no objection to > remove the C data interface limitation. > > I opened a new PR that focus on only statistics schema: > * PR: https://github.com/apache/arrow/pull/45058 > * Preview: > http://crossbow.voltrondata.com/pr_docs/45058/format/StatisticsSchema.html > > It still includes the original motivation (passing > statistics through the C data/stream interface for query > plan) but it doesn't limit to the C data interface. It also > includes "this is the canonical schema to represent > statistics about an Arrow dataset as Arrow data" provided by > David. > > > There are some TODOs in the specification. If you have any > opinions for them, please share them. > > 1. Should we standardize field names of dense union that is > used for statistics values? > > See also: > * The related part in the PR: > > https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR136-R150 > * The discussion: > https://github.com/apache/arrow/pull/43553/files#r1871755575 > > I think that we don't need to standardize them because we `> can access dense union values via type codes not > names. If there aren't any opinions, I will not > standardize these field names. > > 2. Should we use int64 not float64 for approximate > max_byte_width? > > See also: > * The related part in the PR: > > https://github.com/apache/arrow/pull/45058/files#diff-516092add42944da6a0cd1a601014c5b89a41a61098adb57536e79c105211f4fR199-R202 > * The discussion: > https://github.com/apache/arrow/pull/43553/files#r1871759234 > > We can use ceil(float_max_byte_width) for int64 > approximate max_byte_width. > > If there aren't any opinions... I don't have strong > opinion for this... Hmm. I will use float64. > > > There are examples for record batches but not for arrays for > now. I'll add examples for arrays later... > > > Thanks, > -- > kou > > In <20241212.102712.700919417179383555....@clear-code.com> > "[DISCUSS] Arrow array representation of statistics" on Thu, 12 Dec 2024 > 10:27:12 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> I want to discuss Arrow array representation of statistics >> and usable contexts of it. >> >> Background: >> >> We discussed how to pass statistics through the C data >> interface: >> >> * [DISCUSS] Statistics through the C data interface >> https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx >> * [VOTE] Statistics through the C data interface >> https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd >> * GH-38837: [Format] Add the specification to pass >> statistics through the Arrow C data interface >> https://github.com/apache/arrow/pull/43553 >> >> The latest proposal is that we standardize schema for Arrow >> array that represents statistics. See the above PR for >> details. >> >> I think that the proposed approach is the best approach for >> the C data interface. But I'm not sure whether the approach >> is the best approach for other contexts such as IPC format, >> Flight, ADBC and so on. So the latest proposal limits its >> target to only the C data interface. >> >> But there are comments that can we standardize this approach >> for all contexts including the C data interface? >> I want to discuss this in this thread. >> >> Here are related comments so far: >> >> * https://github.com/apache/arrow/pull/43553/files#r1871749972 >> * https://github.com/apache/arrow/pull/43553/files#r1704373291 >> * https://github.com/apache/arrow/pull/43553/files#r1871757604 >> >> >> Could you share your opinions? >> >> >> If we can remove the C data interface only limitation, I'll >> open a new PR for it. >> >> >> Thanks, >> -- >> kou