Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Wed, 24 Apr 2024 01:31:36 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2074371230


   I'm considering some approaches for this use case. This is not completed yet 
but share my idea so far. Feedback is appreciated.
   
   ADBC uses the following schema to return statistics:
   
   
https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L1739-L1778
   
   It's designed for returning statistics of a database.
   
   We can simplify this schema because we can just return statistics of a 
record batch. For example:
   
   | Field Name               | Field Type                       | Comments |
   |--------------------------|----------------------------------| -------- |
   | column_name              | utf8                             | (1)      |
   | statistic_key            | int16 not null                   | (2)      |
   | statistic_value          | `VALUE_SCHEMA` not null            |          |
   | statistic_is_approximate | bool not null                    | (3)      |
   
   1. If null, then the statistic applies to the entire table.
   2. A dictionary-encoded statistic name (although we do not use the Arrow
      dictionary type). Values in [0, 1024) are reserved for ADBC.  Other
      values are for implementation-specific statistics.  For the definitions
      of predefined statistic types, see 
[adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570).
  To get
      driver-specific statistic names, use `AdbcConnectionGetStatisticNames()`.
   3. If true, then the value is approximate or best-effort.
   
   `VALUE_SCHEMA` is a dense union with members:
   
   | Field Name               | Field Type                       |
   |--------------------------|----------------------------------|
   | int64                    | int64                            |
   | uint64                   | uint64                           |
   | float64                  | float64                          |
   | binary                   | binary                           |
   
   TODO: How to represent statistic key? Should we use ADBC style? (Assigning 
an ID for each statistic key and using it.)
   
   If we represent statistics as a record batch, we can pass statistics through 
Arrow C data interface. This may be a reasonable approach.
   
   If we use this approach, we need to do the followings:
   * Define a schema as a specification
   * Add statistics related APIs to Apache Arrow C++ and other implementation 
because we need two more implementations for specification change
     * 
https://arrow.apache.org/docs/format/Changing.html#at-least-two-reference-implementations
     * This is not a format change but it's better that we should follow the 
rule
     * We can work on this for Apache Arrow C++ before we propose a 
specification because statistics will be useful for general propose
   * Apache Arrow C++: Add support for importing statistics from Apache Parquet 
C++
   * ...
   
   TODO: Consider statistics related API for Apache Arrow C++. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to