> I think what we are slowly converging on is the need for a spec to > describe the encoding of Arrow array statistics as Arrow arrays.
This has been something that has always been desired for the Arrow IPC format too. My preference would be (apologies if this has been mentioned before): - Agree on how statistics should be encoded into an array (this is not hard, we just have to agree on the field order and the data type for null_count) - If you need statistics in the schema then simply encode the 1-row batch into an IPC buffer (using the streaming format) or maybe just an IPC RecordBatch message since the schema is fixed and store those bytes in the schema On Fri, May 24, 2024 at 1:20 AM Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > Could you explain more about your idea? Does it propose that > we add more callbacks to ArrowArrayStream such as > ArrowArrayStream::get_statistics()? Or Does it propose that > we define one more Arrow C XXX interface that wraps > ArrowArrayStream like ArrowDeviceArray wraps ArrowArray? > > ArrowDeviceArray: > https://arrow.apache.org/docs/format/CDeviceDataInterface.html > > > Thanks, > -- > kou > > In <cao-cae6rnfohw+gr4hw9hu5lhwcq6+wbsvc0j_xuayxqwgk...@mail.gmail.com> > "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May > 2024 06:55:40 -0700, > Curt Hagenlocher <c...@hagenlocher.org> wrote: > > >> would it be easier to request statistics at a higher level of > > abstraction? > > > > What if there were a "single table provider" level of abstraction between > > ADBC and ArrowArrayStream as a C API; something that can report > statistics > > and apply simple predicates? > > > > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington > > <de...@voltrondata.com.invalid> wrote: > > > >> Thank you for the background! I understand that these statistics are > >> important for query planning; however, I am not sure that I follow why > >> we are constrained to the ArrowSchema to represent them. The examples > >> given seem to going through Python...would it be easier to request > >> statistics at a higher level of abstraction? There would already need > >> to be a separate mechanism to request an ArrowArrayStream with > >> statistics (unless the PyCapsule `requested_schema` argument would > >> suffice). > >> > >> > ADBC may be a bit larger to use only for transmitting > >> > statistics. ADBC has statistics related APIs but it has more > >> > other APIs. > >> > >> Some examples of producers given in the linked threads (Delta Lake, > >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One > >> can implement an ADBC driver without defining all the methods (where > >> the producer could call AdbcConnectionGetStatistics(), although > >> AdbcStatementGetStatistics() might be more relevant here and doesn't > >> exist). One example listed (using an Arrow Table as a source) seems a > >> bit light to wrap in an ADBC driver; however, it would not take much > >> code to do so and the overhead of getting the reader via ADBC it is > >> something like 100 microseconds (tested via the ADBC R package's > >> "monkey driver" which wraps an existing stream as a statement). In any > >> case, the bulk of the code is building the statistics array. > >> > >> > How about the following schema for the > >> > statistics ArrowArray? It's based on ADBC. > >> > >> Whatever format for statistics is decided on, I imagine it should be > >> exactly the same as the ADBC standard? (Perhaps pushing changes > >> upstream if needed?). > >> > >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> > wrote: > >> > > >> > Hi, > >> > > >> > > Why not simply pass the statistics ArrowArray separately in your > >> > > producer API of choice > >> > > >> > It seems that we should use the approach because all > >> > feedback said so. How about the following schema for the > >> > statistics ArrowArray? It's based on ADBC. > >> > > >> > | Field Name | Field Type | Comments | > >> > |--------------------------|-----------------------| -------- | > >> > | column_name | utf8 | (1) | > >> > | statistic_key | utf8 not null | (2) | > >> > | statistic_value | VALUE_SCHEMA not null | | > >> > | statistic_is_approximate | bool not null | (3) | > >> > > >> > 1. If null, then the statistic applies to the entire table. > >> > It's for "row_count". > >> > 2. We'll provide pre-defined keys such as "max", "min", > >> > "byte_width" and "distinct_count" but users can also use > >> > application specific keys. > >> > 3. If true, then the value is approximate or best-effort. > >> > > >> > VALUE_SCHEMA is a dense union with members: > >> > > >> > | Field Name | Field Type | > >> > |------------|------------| > >> > | int64 | int64 | > >> > | uint64 | uint64 | > >> > | float64 | float64 | > >> > | binary | binary | > >> > > >> > If a column is an int32 column, it uses int64 for > >> > "max"/"min". We don't provide all types here. Users should > >> > use a compatible type (int64 for a int32 column) instead. > >> > > >> > > >> > Thanks, > >> > -- > >> > kou > >> > > >> > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> > >> > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 > May > >> 2024 17:04:57 +0200, > >> > Antoine Pitrou <anto...@python.org> wrote: > >> > > >> > > > >> > > Hi Kou, > >> > > > >> > > I agree that Dewey that this is overstretching the capabilities of > the > >> > > C Data Interface. In particular, stuffing a pointer as metadata > value > >> > > and decreeing it immortal doesn't sound like a good design decision. > >> > > > >> > > Why not simply pass the statistics ArrowArray separately in your > >> > > producer API of choice (Dewey mentioned ADBC but it is of course > just > >> > > a possible API among others)? > >> > > > >> > > Regards > >> > > > >> > > Antoine. > >> > > > >> > > > >> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : > >> > >> Hi, > >> > >> We're discussing how to provide statistics through the C > >> > >> data interface at: > >> > >> https://github.com/apache/arrow/issues/38837 > >> > >> If you're interested in this feature, could you share your > >> > >> comments? > >> > >> Motivation: > >> > >> We can interchange Apache Arrow data by the C data interface > >> > >> in the same process. For example, we can pass Apache Arrow > >> > >> data read by Apache Arrow C++ (provider) to DuckDB > >> > >> (consumer) through the C data interface. > >> > >> A provider may know Apache Arrow data statistics. For > >> > >> example, a provider can know statistics when it reads Apache > >> > >> Parquet data because Apache Parquet may provide statistics. > >> > >> But a consumer can't know statistics that are known by a > >> > >> producer. Because there isn't a standard way to provide > >> > >> statistics through the C data interface. If a consumer can > >> > >> know statistics, it can process Apache Arrow data faster > >> > >> based on statistics. > >> > >> Proposal: > >> > >> > https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > >> > >> How about providing statistics as a metadata in ArrowSchema? > >> > >> We reserve "ARROW" namespace for internal Apache Arrow use: > >> > >> > >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >> > >> > >> > >>> The ARROW pattern is a reserved namespace for internal > >> > >>> Arrow use in the custom_metadata fields. For example, > >> > >>> ARROW:extension:name. > >> > >> So we can use "ARROW:statistics" for the metadata key. > >> > >> We can represent statistics as a ArrowArray like ADBC does. > >> > >> Here is an example ArrowSchema that is for a record batch > >> > >> that has "int32 column1" and "string column2": > >> > >> ArrowSchema { > >> > >> .format = "+siu", > >> > >> .metadata = { > >> > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics > >> such as > >> > >> row count */ > >> > >> }, > >> > >> .children = { > >> > >> ArrowSchema { > >> > >> .name = "column1", > >> > >> .format = "i", > >> > >> .metadata = { > >> > >> "ARROW:statistics" => ArrowArray*, /* column-level > >> statistics such as > >> > >> count distinct */ > >> > >> }, > >> > >> }, > >> > >> ArrowSchema { > >> > >> .name = "column2", > >> > >> .format = "u", > >> > >> .metadata = { > >> > >> "ARROW:statistics" => ArrowArray*, /* column-level > >> statistics such as > >> > >> count distinct */ > >> > >> }, > >> > >> }, > >> > >> }, > >> > >> } > >> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" > >> > >> => ArrowArray*' is a base 10 string of the address of the > >> > >> ArrowArray. Because we can use only string for metadata > >> > >> value. You can't release the statistics ArrowArray*. (Its > >> > >> release is a no-op function.) It follows > >> > >> > >> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > >> > >> semantics. (The base ArrowSchema owns statistics > >> > >> ArrowArray*.) > >> > >> ArrowArray* for statistics use the following schema: > >> > >> | Field Name | Field Type | Comments | > >> > >> |----------------|----------------------------------| -------- | > >> > >> | key | string not null | (1) | > >> > >> | value | `VALUE_SCHEMA` not null | | > >> > >> | is_approximate | bool not null | (2) | > >> > >> 1. We'll provide pre-defined keys such as "max", "min", > >> > >> "byte_width" and "distinct_count" but users can also use > >> > >> application specific keys. > >> > >> 2. If true, then the value is approximate or best-effort. > >> > >> VALUE_SCHEMA is a dense union with members: > >> > >> | Field Name | Field Type | Comments | > >> > >> |------------|----------------------------------| -------- | > >> > >> | int64 | int64 | | > >> > >> | uint64 | uint64 | | > >> > >> | float64 | float64 | | > >> > >> | value | The same type of the ArrowSchema | (3) | > >> > >> | | that is belonged to. | | > >> > >> 3. If the ArrowSchema's type is string, this type is also string. > >> > >> TODO: Is "value" good name? If we refer it from the > >> > >> top-level statistics schema, we need to use > >> > >> "value.value". It's a bit strange... > >> > >> What do you think about this proposal? Could you share your > >> > >> comments? > >> > >> Thanks, > >> >