Re: [DISCUSS] Statistics through the C data interface

Antoine Pitrou Thu, 23 May 2024 09:57:48 -0700


Hi Shoumyo,


The problem with communicating data statistics through schema metadata
is that it's not compatible with use cases where you want to know the
schema *before* the data is produced.

Regards

Antoine.


On Thu, 23 May 2024 14:28:43 -0000
"Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
<[email protected]> wrote:
> This is a really exciting development, thank you for putting together this 
> proposal!
> 
> It looks like this thread and the linked GitHub issue has lots of input from 
> folks who work with Arrow at a low level and have better familiarity with the 
> Arrow specifications than I do, so I'll refrain from commenting on the 
> technicalities of the proposal. I would, however, like to share my 
> perspective as an application developer that heavily uses Arrow at higher 
> levels for composing data systems.
> 
> My main concern with the direction of this proposal is that it seems too 
> narrowly focused on what the integration with DuckDB will look like (how the 
> statistics can be fed into DuckDB). In many applications, executing the query 
> is often the "last mile", and it's important to consider where the statistics 
> will actually come from. To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases
> 
> Note that in at least the first two cases, the system _executing the query_ 
> will not be able to provide statistics simply because it is not actually the 
> data producer. As an example, if Process A writes an Arrow IPC file to shared 
> memory, and Process B wants to run a query on it -- how is Process B supposed 
> to get the statistics for query planning? There are a few approaches that I 
> anticipate application developers might consider:
> 
> 1. Design an out-of-band mechanism for Process B to fetch statistics from 
> Process A.
> 2. Design an encoding that is a superset of Arrow IPC and includes statistics 
> information, allowing statistics to be communicated in-band.
> 3. Use custom schema metadata to communicate statistics in-band.
> 
> Options 1 and 2 require considerably more effort than Option 3. Also, Option 
> 3 feels somewhat natural because it makes sense for the statistics to come 
> with the data (similar to how statistics are embedded in Parquet files). In 
> some sense, the statistics actually *are* a property of the stream.
> 
> In systems that I work on, we already use schema metadata to communicate 
> information that is unrelated to the structure of the data. From my reading 
> of the documentation [1], this sounds like a reasonable (and perhaps 
> intended?) use of metadata, and nowhere is it mentioned that metadata must be 
> used to determine schema equivalence. Unless there are other ways of 
> producing stream-level application metadata outside of the schema/field 
> metadata, the lack of purity was not a concern for me to begin with.
> 
> I would appreciate an approach that communicates statistics via schema 
> metadata, or at least in some in-band fashion that is consistent across the 
> IPC and C data specifications. This would make it much easier to uniformly 
> and transparently plumb statistics through applications, regardless of where 
> they source Arrow data from. As developers are likely to create bespoke 
> conventions for this anyways, it seems reasonable to standardize it as 
> canonical metadata.
> 
> I say this all as a happy user of DuckDB's Arrow scan functionality that is 
> excited to see better query optimization capabilities. It's just that, in its 
> current form, the changes in this proposal are not something I could 
> foreseeably integrate with.
> 
> Best,
> Shoumyo
> 
> [1]: 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
> From: [email protected] At: 05/23/24 10:10:51 UTC-4:00To:  
> [email protected]
> Subject: Re: [DISCUSS] Statistics through the C data interface
> 
> I want to +1 on what Dewey is saying here and some comments.
> 
> Sutou Kouhei wrote:
> > ADBC may be a bit larger to use only for transmitting statistics. ADBC has  
> >  
> statistics related APIs but it has more other APIs.
> 
> It's impossible to keep the responsibility of communication protocols
> cleanly separated, but IMO, we should strive to keep the C Data
> Interface more of a Transport Protocol than an Application Protocol.
> 
> Statistics are application dependent and can complicate the
> implementation of importers/exporters which would hinder the adoption
> of the C Data Interface. Statistics also bring in security concerns
> that are application-specific. e.g. can an algorithm trust min/max
> stats and risk producing incorrect results if the statistics are
> incorrect? A question that can't really be answered at the C Data
> Interface level.
> 
> The need for more sophisticated statistics only grows with time, so
> there is no such thing as a "simple statistics schema".
> 
> Protocols that produce/consume statistics might want to use the C Data
> Interface as a primitive for passing Arrow arrays of statistics.
> 
> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.
> 
> --
> Felipe
> 
> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
> <[email protected]> wrote:
> >
> > Thank you for the background! I understand that these statistics are
> > important for query planning; however, I am not sure that I follow why
> > we are constrained to the ArrowSchema to represent them. The examples
> > given seem to going through Python...would it be easier to request
> > statistics at a higher level of abstraction? There would already need
> > to be a separate mechanism to request an ArrowArrayStream with
> > statistics (unless the PyCapsule `requested_schema` argument would
> > suffice).
> >  
> > > ADBC may be a bit larger to use only for transmitting
> > > statistics. ADBC has statistics related APIs but it has more
> > > other APIs.  
> >
> > Some examples of producers given in the linked threads (Delta Lake,
> > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> > can implement an ADBC driver without defining all the methods (where
> > the producer could call AdbcConnectionGetStatistics(), although
> > AdbcStatementGetStatistics() might be more relevant here and doesn't
> > exist). One example listed (using an Arrow Table as a source) seems a
> > bit light to wrap in an ADBC driver; however, it would not take much
> > code to do so and the overhead of getting the reader via ADBC it is
> > something like 100 microseconds (tested via the ADBC R package's
> > "monkey driver" which wraps an existing stream as a statement). In any
> > case, the bulk of the code is building the statistics array.
> >  
> > > How about the following schema for the
> > > statistics ArrowArray? It's based on ADBC.  
> >
> > Whatever format for statistics is decided on, I imagine it should be
> > exactly the same as the ADBC standard? (Perhaps pushing changes
> > upstream if needed?).
> >
> > On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <[email protected]> wrote:  
> > >
> > > Hi,
> > >  
> > > > Why not simply pass the statistics ArrowArray separately in your
> > > > producer API of choice  
> > >
> > > It seems that we should use the approach because all
> > > feedback said so. How about the following schema for the
> > > statistics ArrowArray? It's based on ADBC.
> > >
> > > | Field Name               | Field Type            | Comments |
> > > |--------------------------|-----------------------| -------- |
> > > | column_name              | utf8                  | (1)      |
> > > | statistic_key            | utf8 not null         | (2)      |
> > > | statistic_value          | VALUE_SCHEMA not null |          |
> > > | statistic_is_approximate | bool not null         | (3)      |
> > >
> > > 1. If null, then the statistic applies to the entire table.
> > >    It's for "row_count".
> > > 2. We'll provide pre-defined keys such as "max", "min",
> > >    "byte_width" and "distinct_count" but users can also use
> > >    application specific keys.
> > > 3. If true, then the value is approximate or best-effort.
> > >
> > > VALUE_SCHEMA is a dense union with members:
> > >
> > > | Field Name | Field Type |
> > > |------------|------------|
> > > | int64      | int64      |
> > > | uint64     | uint64     |
> > > | float64    | float64    |
> > > | binary     | binary     |
> > >
> > > If a column is an int32 column, it uses int64 for
> > > "max"/"min". We don't provide all types here. Users should
> > > use a compatible type (int64 for a int32 column) instead.
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In <[email protected]>
> > >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May  
> > >  
> 2024 17:04:57 +0200,
> > >   Antoine Pitrou <[email protected]> wrote:
> > >  
> > > >
> > > > Hi Kou,
> > > >
> > > > I agree that Dewey that this is overstretching the capabilities of the
> > > > C Data Interface. In particular, stuffing a pointer as metadata value
> > > > and decreeing it immortal doesn't sound like a good design decision.
> > > >
> > > > Why not simply pass the statistics ArrowArray separately in your
> > > > producer API of choice (Dewey mentioned ADBC but it is of course just
> > > > a possible API among others)?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :  
> > > >> Hi,
> > > >> We're discussing how to provide statistics through the C
> > > >> data interface at:
> > > >> https://github.com/apache/arrow/issues/38837
> > > >> If you're interested in this feature, could you share your
> > > >> comments?
> > > >> Motivation:
> > > >> We can interchange Apache Arrow data by the C data interface
> > > >> in the same process. For example, we can pass Apache Arrow
> > > >> data read by Apache Arrow C++ (provider) to DuckDB
> > > >> (consumer) through the C data interface.
> > > >> A provider may know Apache Arrow data statistics. For
> > > >> example, a provider can know statistics when it reads Apache
> > > >> Parquet data because Apache Parquet may provide statistics.
> > > >> But a consumer can't know statistics that are known by a
> > > >> producer. Because there isn't a standard way to provide
> > > >> statistics through the C data interface. If a consumer can
> > > >> know statistics, it can process Apache Arrow data faster
> > > >> based on statistics.
> > > >> Proposal:
> > > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> > > >> How about providing statistics as a metadata in ArrowSchema?
> > > >> We reserve "ARROW" namespace for internal Apache Arrow use:
> > > >>   
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> > > >>  
> > > >>> The ARROW pattern is a reserved namespace for internal
> > > >>> Arrow use in the custom_metadata fields. For example,
> > > >>> ARROW:extension:name.  
> > > >> So we can use "ARROW:statistics" for the metadata key.
> > > >> We can represent statistics as a ArrowArray like ADBC does.
> > > >> Here is an example ArrowSchema that is for a record batch
> > > >> that has "int32 column1" and "string column2":
> > > >> ArrowSchema {
> > > >>    .format = "+siu",
> > > >>    .metadata = {
> > > >>      "ARROW:statistics" => ArrowArray*, /* table-level statistics such 
> > > >> as
> > > >>      row count */
> > > >>    },
> > > >>    .children = {
> > > >>      ArrowSchema {
> > > >>        .name = "column1",
> > > >>        .format = "i",
> > > >>        .metadata = {
> > > >>          "ARROW:statistics" => ArrowArray*, /* column-level statistics 
> > > >>   
> such as
> > > >>          count distinct */
> > > >>        },
> > > >>      },
> > > >>      ArrowSchema {
> > > >>        .name = "column2",
> > > >>        .format = "u",
> > > >>        .metadata = {
> > > >>          "ARROW:statistics" => ArrowArray*, /* column-level statistics 
> > > >>   
> such as
> > > >>          count distinct */
> > > >>        },
> > > >>      },
> > > >>    },
> > > >> }
> > > >> The metadata value (ArrowArray* part) of '"ARROW:statistics"  
> > > >> => ArrowArray*' is a base 10 string of the address of the  
> > > >> ArrowArray. Because we can use only string for metadata
> > > >> value. You can't release the statistics ArrowArray*. (Its
> > > >> release is a no-op function.) It follows
> > > >>   
> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
> > > >> semantics. (The base ArrowSchema owns statistics
> > > >> ArrowArray*.)
> > > >> ArrowArray* for statistics use the following schema:
> > > >> | Field Name     | Field Type                       | Comments |
> > > >> |----------------|----------------------------------| -------- |
> > > >> | key            | string not null                  | (1)      |
> > > >> | value          | `VALUE_SCHEMA` not null          |          |
> > > >> | is_approximate | bool not null                    | (2)      |
> > > >> 1. We'll provide pre-defined keys such as "max", "min",
> > > >>     "byte_width" and "distinct_count" but users can also use
> > > >>     application specific keys.
> > > >> 2. If true, then the value is approximate or best-effort.
> > > >> VALUE_SCHEMA is a dense union with members:
> > > >> | Field Name | Field Type                       | Comments |
> > > >> |------------|----------------------------------| -------- |
> > > >> | int64      | int64                            |          |
> > > >> | uint64     | uint64                           |          |
> > > >> | float64    | float64                          |          |
> > > >> | value      | The same type of the ArrowSchema | (3)      |
> > > >> |            | that is belonged to.             |          |
> > > >> 3. If the ArrowSchema's type is string, this type is also string.
> > > >>     TODO: Is "value" good name? If we refer it from the
> > > >>     top-level statistics schema, we need to use
> > > >>     "value.value". It's a bit strange...
> > > >> What do you think about this proposal? Could you share your
> > > >> comments?
> > > >> Thanks,  
> 
>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to