Re: [DISCUSS] Statistics through the C data interface

Shoumyo Chakravorti (BLOOMBERG/ 120 PARK) Fri, 31 May 2024 10:05:38 -0700

Agreed, it doesn't seem like a good idea to require users of the
C data interface to also depend on the IPC format. JSON sounds
more reasonable in that case.


Shoumyo

From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:  
dev@arrow.apache.org
Subject: Re: [DISCUSS] Statistics through the C data interface

>Hi,

>

>>> - If you need statistics in the schema then simply encode the 1-row batch

>>>   into an IPC buffer (using the streaming format) or maybe just an IPC

>>>   RecordBatch message since the schema is fixed and store those bytes in the

>>>   schema

>> 

>> This would avoid having to define a separate "schema" for

>> the JSON metadata

>

>Right. What I'm worried about with this approach is that

>this may not match with the C data interface.

>

>In the C data interface, we don't use the IPC format. If we

>want to transmit statistics with schema through the C data

>interface, we need to mix the IPC format and the C data

>interface. (This is why I used the address in my first

>proposal.)

>

>Note that we can use separated API to transmit statistics

>instead of embedding statistics into schema for this case.

>

>I thought using JSON is easier to use for both of the IPC

>format and the C data interface. Statistics data will not be

>large. So this will not affect performance.

>

>

>> If we do go down the JSON route, how about something like

>> this to avoid defining the keys for all possible statistics up

>> front:

>> 

>>   Schema {

>>     custom_metadata: {

>>       "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 

>\"value_type\": \"uint64\", \"is_approximate\": false } ]"

>>     }

>>   }

>> 

>> It's more verbose, but more closely mirrors the Arrow array

>> schema defined for statistics getter APIs. This could make it

>> easier to translate between the two.

>

>Thanks. I didn't think of it.

>It makes sense.

>

>

>Thanks,

>-- 

>kou

>

>In <665673b500015f5808ce0...@message.bloomberg.net>

>  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 

>00:15:49 -0000,

>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> 

>wrote:

>

>> Thanks for addressing the feedback! I didn't know that an

>> Arrow IPC `Message` (not just Schema) could also contain

>> `custom_metadata` -- thanks for pointing it out.

>> 

>>> Based on the list, how about standardizing both of the

>>> followings for statistics?

>>>

>>> 1. Apache Arrow schema for statistics that is used by

>>>    separated statistics getter API

>>> 2. "ARROW:statistics" metadata format that can be used in

>>>    Apache Arrow schema metadata

>>>

>>> Users can use 1. and/or 2. based on their use cases.

>> 

>> This sounds good to me. Using JSON to represent the metadata

>> for #2 also sounds reasonable. I think elsewhere on this

>> thread, Weston mentioned that we could alternatively use

>> the schema defined for #1 and directly use that to encode

>> the schema metadata as an Arrow IPC RecordBatch:

>> 

>>> This has been something that has always been desired for the Arrow IPC

>>> format too.

>>>

>>> My preference would be (apologies if this has been mentioned before):

>>>

>>> - Agree on how statistics should be encoded into an array (this is not

>>>   hard, we just have to agree on the field order and the data type for

>>>   null_count)

>>> - If you need statistics in the schema then simply encode the 1-row batch

>>>   into an IPC buffer (using the streaming format) or maybe just an IPC

>>>   RecordBatch message since the schema is fixed and store those bytes in the

>>>   schema

>> 

>> This would avoid having to define a separate "schema" for

>> the JSON metadata, but might be more effort to work with in

>> certain contexts (e.g. a library that currently only needs the

>> C data interface would now also have to learn how to parse

>> Arrow IPC).

>> 

>> If we do go down the JSON route, how about something like

>> this to avoid defining the keys for all possible statistics up

>> front:

>> 

>>   Schema {

>>     custom_metadata: {

>>       "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 

>\"value_type\": \"uint64\", \"is_approximate\": false } ]"

>>     }

>>   }

>> 

>> It's more verbose, but more closely mirrors the Arrow array

>> schema defined for statistics getter APIs. This could make it

>> easier to translate between the two.

>> 

>> Thanks,

>> Shoumyo

>> 

>> From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To:  

>dev@arrow.apache.org

>> Subject: Re: [DISCUSS] Statistics through the C data interface

>> 

>>>Hi,

>> 

>>>

>> 

>>>> To start, data might be sourced in various manners:

>> 

>>>> 

>> 

>>>> - Arrow IPC files may be mapped from shared memory

>> 

>>>> - Arrow IPC streams may be received via some RPC framework (à la Flight)

>> 

>>>> - The Arrow libraries may be used to read from file formats like Parquet 
>>>> or 

>> 

>>>CSV

>> 

>>>> - ADBC drivers may be used to read from databases

>> 

>>>

>> 

>>>Thanks for listing it.

>> 

>>>

>> 

>>>Regarding to the first case:

>> 

>>>

>> 

>>>Using schema metadata may be a reasonable approach because

>> 

>>>the Arrow data will be on the page cache. There is no

>> 

>>>significant read cost. We don't need to read statistics

>> 

>>>before the Arrow data is ready.

>> 

>>>

>> 

>>>But if the Arrow data will not be produced based on

>> 

>>>statistics of the Arrow data, separated statistics get API

>> 

>>>may be better.

>> 

>>>

>> 

>>>Regarding to the second case:

>> 

>>>

>> 

>>>Schema metadata is an approach for it but we can choose

>> 

>>>other approaches for this case. For example, Flight has

>> 

>>>FlightData::app_metadata[1] and Arrow IPC message has

>> 

>>>custom_metadata[2] as Dewey mentioned.

>> 

>>>

>> 

>>>[1] 

>> 

>>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/

>fo

>> 

>>>rmat/Flight.proto#L512-L515

>> 

>>>[2] 

>> 

>>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/

>fo

>> 

>>>rmat/Message.fbs#L154

>> 

>>>

>> 

>>>Regarding to the third case:

>> 

>>>

>> 

>>>Reader objects will provide statistics. For example,

>> 

>>>parquet::ColumnChunkMetaData::statistics()

>> 

>>>(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statisti

>cs

>> 

>>>())

>> 

>>>will provide statistics.

>> 

>>>

>> 

>>>Regarding to the forth case:

>> 

>>>

>> 

>>>We can use ADBC API.

>> 

>>>

>> 

>>>

>> 

>>>Based on the list, how about standardizing both of the

>> 

>>>followings for statistics?

>> 

>>>

>> 

>>>1. Apache Arrow schema for statistics that is used by

>> 

>>>   separated statistics getter API

>> 

>>>2. "ARROW:statistics" metadata format that can be used in

>> 

>>>   Apache Arrow schema metadata

>> 

>>>

>> 

>>>Users can use 1. and/or 2. based on their use cases.

>> 

>>>

>> 

>>>Regarding to 2.: How about the following?

>> 

>>>

>> 

>>>This uses Field::custom_metadata[3] and

>> 

>>>Schema::custom_metadata[4].

>> 

>>>

>> 

>>>[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529

>> 

>>>[4] 

>> 

>>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/

>fo

>> 

>>>rmat/Schema.fbs#L563-L564

>> 

>>>

>> 

>>>"ARROW:statistics" in Field::custom_metadata represents

>> 

>>>column-level statistics. It uses JSON like we did for

>> 

>>>"ARROW:extension:metadata"[5]. Here is an example:

>> 

>>>

>> 

>>>  Field {

>> 

>>>    custom_metadata: {

>> 

>>>      "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}"

>> 

>>>    }

>> 

>>>  }

>> 

>>>

>> 

>>>(JSON may not be able to represent complex information but

>> 

>>>is it needed for statistics?)

>> 

>>>

>> 

>>>"ARROW:statistics" in Schema::custom_metadata represents

>> 

>>>table-level statistics. It uses JSON like we did for

>> 

>>>"ARROW:extension:metadata"[5]. Here is an example:

>> 

>>>

>> 

>>>  Schema {

>> 

>>>    custom_metadata: {

>> 

>>>      "ARROW:statistics" => "{\"row_count\": 29}"

>> 

>>>    }

>> 

>>>  }

>> 

>>>

>> 

>>>TODO: Define the JSON content details. For example, we need

>> 

>>>to define keys such as "distinct_count" and "row_count".

>> 

>>>

>> 

>>>

>> 

>>>[5] 

>> 

>>>https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-t

>yp

>> 

>>>es

>> 

>>>

>> 

>>>

>> 

>>>

>> 

>>>Thanks,

>> 

>>>-- 

>> 

>>>kou

>> 

>>>

>> 

>>>In <664f529b0002a8710c430...@message.bloomberg.net>

>> 

>>>  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 
>>> 2024 

>> 

>>>14:28:43 -0000,

>> 

>>>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> 

>> 

>>>wrote:

>> 

>>>

>> 

>>>> This is a really exciting development, thank you for putting together this 

>> 

>>>proposal!

>> 

>>>> 

>> 

>>>> It looks like this thread and the linked GitHub issue has lots of input 

>from 

>> 

>>>folks who work with Arrow at a low level and have better familiarity with 
>>>the 

>> 

>>>Arrow specifications than I do, so I'll refrain from commenting on the 

>> 

>>>technicalities of the proposal. I would, however, like to share my 

>perspective 

>> 

>>>as an application developer that heavily uses Arrow at higher levels for 

>> 

>>>composing data systems.

>> 

>>>> 

>> 

>>>> My main concern with the direction of this proposal is that it seems too 

>> 

>>>narrowly focused on what the integration with DuckDB will look like (how the 

>> 

>>>statistics can be fed into DuckDB). In many applications, executing the 
>>>query 

>> 

>>>is often the "last mile", and it's important to consider where the 
>>>statistics 

>> 

>>>will actually come from. To start, data might be sourced in various manners:

>> 

>>>> 

>> 

>>>> - Arrow IPC files may be mapped from shared memory

>> 

>>>> - Arrow IPC streams may be received via some RPC framework (à la Flight)

>> 

>>>> - The Arrow libraries may be used to read from file formats like Parquet 
>>>> or 

>> 

>>>CSV

>> 

>>>> - ADBC drivers may be used to read from databases

>> 

>>>> 

>> 

>>>> Note that in at least the first two cases, the system _executing the 
>>>> query_ 

>> 

>>>will not be able to provide statistics simply because it is not actually the 

>> 

>>>data producer. As an example, if Process A writes an Arrow IPC file to 
>>>shared 

>> 

>>>memory, and Process B wants to run a query on it -- how is Process B 
>>>supposed 

>> 

>>>to get the statistics for query planning? There are a few approaches that I 

>> 

>>>anticipate application developers might consider:

>> 

>>>> 

>> 

>>>> 1. Design an out-of-band mechanism for Process B to fetch statistics from 

>> 

>>>Process A.

>> 

>>>> 2. Design an encoding that is a superset of Arrow IPC and includes 

>statistics 

>> 

>>>information, allowing statistics to be communicated in-band.

>> 

>>>> 3. Use custom schema metadata to communicate statistics in-band.

>> 

>>>> 

>> 

>>>> Options 1 and 2 require considerably more effort than Option 3. Also, 

>Option 

>> 

>>>3 feels somewhat natural because it makes sense for the statistics to come 

>with 

>> 

>>>the data (similar to how statistics are embedded in Parquet files). In some 

>> 

>>>sense, the statistics actually *are* a property of the stream.

>> 

>>>> 

>> 

>>>> In systems that I work on, we already use schema metadata to communicate 

>> 

>>>information that is unrelated to the structure of the data. From my reading 

>of 

>> 

>>>the documentation [1], this sounds like a reasonable (and perhaps intended?) 

>> 

>>>use of metadata, and nowhere is it mentioned that metadata must be used to 

>> 

>>>determine schema equivalence. Unless there are other ways of producing 

>> 

>>>stream-level application metadata outside of the schema/field metadata, the 

>> 

>>>lack of purity was not a concern for me to begin with.

>> 

>>>> 

>> 

>>>> I would appreciate an approach that communicates statistics via schema 

>> 

>>>metadata, or at least in some in-band fashion that is consistent across the 

>IPC 

>> 

>>>and C data specifications. This would make it much easier to uniformly and 

>> 

>>>transparently plumb statistics through applications, regardless of where 
>>>they 

>> 

>>>source Arrow data from. As developers are likely to create bespoke 

>conventions 

>> 

>>>for this anyways, it seems reasonable to standardize it as canonical 
>>>metadata.

>> 

>>>> 

>> 

>>>> I say this all as a happy user of DuckDB's Arrow scan functionality that 
>>>> is 

>> 

>>>excited to see better query optimization capabilities. It's just that, in 
>>>its 

>> 

>>>current form, the changes in this proposal are not something I could 

>> 

>>>foreseeably integrate with.

>> 

>>>> 

>> 

>>>> Best,

>> 

>>>> Shoumyo

>> 

>>>> 

>> 

>>>> [1]: 

>> 

>>>https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

>> 

>>>> 

>> 

>>>> From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  

>> 

>>>dev@arrow.apache.org

>> 

>>>> Subject: Re: [DISCUSS] Statistics through the C data interface

>> 

>>>> 

>> 

>>>> I want to +1 on what Dewey is saying here and some comments.

>> 

>>>> 

>> 

>>>> Sutou Kouhei wrote:

>> 

>>>>> ADBC may be a bit larger to use only for transmitting statistics. ADBC 
>>>>> has 

>> 

>>>> statistics related APIs but it has more other APIs.

>> 

>>>> 

>> 

>>>> It's impossible to keep the responsibility of communication protocols

>> 

>>>> cleanly separated, but IMO, we should strive to keep the C Data

>> 

>>>> Interface more of a Transport Protocol than an Application Protocol.

>> 

>>>> 

>> 

>>>> Statistics are application dependent and can complicate the

>> 

>>>> implementation of importers/exporters which would hinder the adoption

>> 

>>>> of the C Data Interface. Statistics also bring in security concerns

>> 

>>>> that are application-specific. e.g. can an algorithm trust min/max

>> 

>>>> stats and risk producing incorrect results if the statistics are

>> 

>>>> incorrect? A question that can't really be answered at the C Data

>> 

>>>> Interface level.

>> 

>>>> 

>> 

>>>> The need for more sophisticated statistics only grows with time, so

>> 

>>>> there is no such thing as a "simple statistics schema".

>> 

>>>> 

>> 

>>>> Protocols that produce/consume statistics might want to use the C Data

>> 

>>>> Interface as a primitive for passing Arrow arrays of statistics.

>> 

>>>> 

>> 

>>>> ADBC might be too big of a leap in complexity now, but "we just need C

>> 

>>>> Data Interface + statistics" is unlikely to remain true for very long

>> 

>>>> as projects grow in complexity.

>> 

>>>> 

>> 

>>>> --

>> 

>>>> Felipe

>> 

>>>> 

>> 

>>>> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington

>> 

>>>> <de...@voltrondata.com.invalid> wrote:

>> 

>>>>>

>> 

>>>>> Thank you for the background! I understand that these statistics are

>> 

>>>>> important for query planning; however, I am not sure that I follow why

>> 

>>>>> we are constrained to the ArrowSchema to represent them. The examples

>> 

>>>>> given seem to going through Python...would it be easier to request

>> 

>>>>> statistics at a higher level of abstraction? There would already need

>> 

>>>>> to be a separate mechanism to request an ArrowArrayStream with

>> 

>>>>> statistics (unless the PyCapsule `requested_schema` argument would

>> 

>>>>> suffice).

>> 

>>>>>

>> 

>>>>> > ADBC may be a bit larger to use only for transmitting

>> 

>>>>> > statistics. ADBC has statistics related APIs but it has more

>> 

>>>>> > other APIs.

>> 

>>>>>

>> 

>>>>> Some examples of producers given in the linked threads (Delta Lake,

>> 

>>>>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One

>> 

>>>>> can implement an ADBC driver without defining all the methods (where

>> 

>>>>> the producer could call AdbcConnectionGetStatistics(), although

>> 

>>>>> AdbcStatementGetStatistics() might be more relevant here and doesn't

>> 

>>>>> exist). One example listed (using an Arrow Table as a source) seems a

>> 

>>>>> bit light to wrap in an ADBC driver; however, it would not take much

>> 

>>>>> code to do so and the overhead of getting the reader via ADBC it is

>> 

>>>>> something like 100 microseconds (tested via the ADBC R package's

>> 

>>>>> "monkey driver" which wraps an existing stream as a statement). In any

>> 

>>>>> case, the bulk of the code is building the statistics array.

>> 

>>>>>

>> 

>>>>> > How about the following schema for the

>> 

>>>>> > statistics ArrowArray? It's based on ADBC.

>> 

>>>>>

>> 

>>>>> Whatever format for statistics is decided on, I imagine it should be

>> 

>>>>> exactly the same as the ADBC standard? (Perhaps pushing changes

>> 

>>>>> upstream if needed?).

>> 

>>>>>

>> 

>>>>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote:

>> 

>>>>> >

>> 

>>>>> > Hi,

>> 

>>>>> >

>> 

>>>>> > > Why not simply pass the statistics ArrowArray separately in your

>> 

>>>>> > > producer API of choice

>> 

>>>>> >

>> 

>>>>> > It seems that we should use the approach because all

>> 

>>>>> > feedback said so. How about the following schema for the

>> 

>>>>> > statistics ArrowArray? It's based on ADBC.

>> 

>>>>> >

>> 

>>>>> > | Field Name               | Field Type            | Comments |

>> 

>>>>> > |--------------------------|-----------------------| -------- |

>> 

>>>>> > | column_name              | utf8                  | (1)      |

>> 

>>>>> > | statistic_key            | utf8 not null         | (2)      |

>> 

>>>>> > | statistic_value          | VALUE_SCHEMA not null |          |

>> 

>>>>> > | statistic_is_approximate | bool not null         | (3)      |

>> 

>>>>> >

>> 

>>>>> > 1. If null, then the statistic applies to the entire table.

>> 

>>>>> >    It's for "row_count".

>> 

>>>>> > 2. We'll provide pre-defined keys such as "max", "min",

>> 

>>>>> >    "byte_width" and "distinct_count" but users can also use

>> 

>>>>> >    application specific keys.

>> 

>>>>> > 3. If true, then the value is approximate or best-effort.

>> 

>>>>> >

>> 

>>>>> > VALUE_SCHEMA is a dense union with members:

>> 

>>>>> >

>> 

>>>>> > | Field Name | Field Type |

>> 

>>>>> > |------------|------------|

>> 

>>>>> > | int64      | int64      |

>> 

>>>>> > | uint64     | uint64     |

>> 

>>>>> > | float64    | float64    |

>> 

>>>>> > | binary     | binary     |

>> 

>>>>> >

>> 

>>>>> > If a column is an int32 column, it uses int64 for

>> 

>>>>> > "max"/"min". We don't provide all types here. Users should

>> 

>>>>> > use a compatible type (int64 for a int32 column) instead.

>> 

>>>>> >

>> 

>>>>> >

>> 

>>>>> > Thanks,

>> 

>>>>> > --

>> 

>>>>> > kou

>> 

>>>>> >

>> 

>>>>> > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org>

>> 

>>>>> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 
>>>>> > May 

>> 

>>>> 2024 17:04:57 +0200,

>> 

>>>>> >   Antoine Pitrou <anto...@python.org> wrote:

>> 

>>>>> >

>> 

>>>>> > >

>> 

>>>>> > > Hi Kou,

>> 

>>>>> > >

>> 

>>>>> > > I agree that Dewey that this is overstretching the capabilities of the

>> 

>>>>> > > C Data Interface. In particular, stuffing a pointer as metadata value

>> 

>>>>> > > and decreeing it immortal doesn't sound like a good design decision.

>> 

>>>>> > >

>> 

>>>>> > > Why not simply pass the statistics ArrowArray separately in your

>> 

>>>>> > > producer API of choice (Dewey mentioned ADBC but it is of course just

>> 

>>>>> > > a possible API among others)?

>> 

>>>>> > >

>> 

>>>>> > > Regards

>> 

>>>>> > >

>> 

>>>>> > > Antoine.

>> 

>>>>> > >

>> 

>>>>> > >

>> 

>>>>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :

>> 

>>>>> > >> Hi,

>> 

>>>>> > >> We're discussing how to provide statistics through the C

>> 

>>>>> > >> data interface at:

>> 

>>>>> > >> https://github.com/apache/arrow/issues/38837

>> 

>>>>> > >> If you're interested in this feature, could you share your

>> 

>>>>> > >> comments?

>> 

>>>>> > >> Motivation:

>> 

>>>>> > >> We can interchange Apache Arrow data by the C data interface

>> 

>>>>> > >> in the same process. For example, we can pass Apache Arrow

>> 

>>>>> > >> data read by Apache Arrow C++ (provider) to DuckDB

>> 

>>>>> > >> (consumer) through the C data interface.

>> 

>>>>> > >> A provider may know Apache Arrow data statistics. For

>> 

>>>>> > >> example, a provider can know statistics when it reads Apache

>> 

>>>>> > >> Parquet data because Apache Parquet may provide statistics.

>> 

>>>>> > >> But a consumer can't know statistics that are known by a

>> 

>>>>> > >> producer. Because there isn't a standard way to provide

>> 

>>>>> > >> statistics through the C data interface. If a consumer can

>> 

>>>>> > >> know statistics, it can process Apache Arrow data faster

>> 

>>>>> > >> based on statistics.

>> 

>>>>> > >> Proposal:

>> 

>>>>> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

>> 

>>>>> > >> How about providing statistics as a metadata in ArrowSchema?

>> 

>>>>> > >> We reserve "ARROW" namespace for internal Apache Arrow use:

>> 

>>>>> > >> 

>> 

>>>> 

>https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

>> 

>>>>> > >>

>> 

>>>>> > >>> The ARROW pattern is a reserved namespace for internal

>> 

>>>>> > >>> Arrow use in the custom_metadata fields. For example,

>> 

>>>>> > >>> ARROW:extension:name.

>> 

>>>>> > >> So we can use "ARROW:statistics" for the metadata key.

>> 

>>>>> > >> We can represent statistics as a ArrowArray like ADBC does.

>> 

>>>>> > >> Here is an example ArrowSchema that is for a record batch

>> 

>>>>> > >> that has "int32 column1" and "string column2":

>> 

>>>>> > >> ArrowSchema {

>> 

>>>>> > >>    .format = "+siu",

>> 

>>>>> > >>    .metadata = {

>> 

>>>>> > >>      "ARROW:statistics" => ArrowArray*, /* table-level statistics 

>such 

>> 

>>>as

>> 

>>>>> > >>      row count */

>> 

>>>>> > >>    },

>> 

>>>>> > >>    .children = {

>> 

>>>>> > >>      ArrowSchema {

>> 

>>>>> > >>        .name = "column1",

>> 

>>>>> > >>        .format = "i",

>> 

>>>>> > >>        .metadata = {

>> 

>>>>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level 

>statistics 

>> 

>>>> such as

>> 

>>>>> > >>          count distinct */

>> 

>>>>> > >>        },

>> 

>>>>> > >>      },

>> 

>>>>> > >>      ArrowSchema {

>> 

>>>>> > >>        .name = "column2",

>> 

>>>>> > >>        .format = "u",

>> 

>>>>> > >>        .metadata = {

>> 

>>>>> > >>          "ARROW:statistics" => ArrowArray*, /* column-level 

>statistics 

>> 

>>>> such as

>> 

>>>>> > >>          count distinct */

>> 

>>>>> > >>        },

>> 

>>>>> > >>      },

>> 

>>>>> > >>    },

>> 

>>>>> > >> }

>> 

>>>>> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics"

>> 

>>>>> > >> => ArrowArray*' is a base 10 string of the address of the

>> 

>>>>> > >> ArrowArray. Because we can use only string for metadata

>> 

>>>>> > >> value. You can't release the statistics ArrowArray*. (Its

>> 

>>>>> > >> release is a no-op function.) It follows

>> 

>>>>> > >> 

>> 

>>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation

>> 

>>>>> > >> semantics. (The base ArrowSchema owns statistics

>> 

>>>>> > >> ArrowArray*.)

>> 

>>>>> > >> ArrowArray* for statistics use the following schema:

>> 

>>>>> > >> | Field Name     | Field Type                       | Comments |

>> 

>>>>> > >> |----------------|----------------------------------| -------- |

>> 

>>>>> > >> | key            | string not null                  | (1)      |

>> 

>>>>> > >> | value          | `VALUE_SCHEMA` not null          |          |

>> 

>>>>> > >> | is_approximate | bool not null                    | (2)      |

>> 

>>>>> > >> 1. We'll provide pre-defined keys such as "max", "min",

>> 

>>>>> > >>     "byte_width" and "distinct_count" but users can also use

>> 

>>>>> > >>     application specific keys.

>> 

>>>>> > >> 2. If true, then the value is approximate or best-effort.

>> 

>>>>> > >> VALUE_SCHEMA is a dense union with members:

>> 

>>>>> > >> | Field Name | Field Type                       | Comments |

>> 

>>>>> > >> |------------|----------------------------------| -------- |

>> 

>>>>> > >> | int64      | int64                            |          |

>> 

>>>>> > >> | uint64     | uint64                           |          |

>> 

>>>>> > >> | float64    | float64                          |          |

>> 

>>>>> > >> | value      | The same type of the ArrowSchema | (3)      |

>> 

>>>>> > >> |            | that is belonged to.             |          |

>> 

>>>>> > >> 3. If the ArrowSchema's type is string, this type is also string.

>> 

>>>>> > >>     TODO: Is "value" good name? If we refer it from the

>> 

>>>>> > >>     top-level statistics schema, we need to use

>> 

>>>>> > >>     "value.value". It's a bit strange...

>> 

>>>>> > >> What do you think about this proposal? Could you share your

>> 

>>>>> > >> comments?

>> 

>>>>> > >> Thanks,

>> 

>>>> 

>> 

>>>> 

>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to