Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sun, 26 May 2024 18:54:02 -0700

Hi,

>                                                 It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).


Good point. I think that a process that unifies schemas
should remove (or merge if possible) statistics metadata. If
we standardize statistics, the process can do it. For
example, the process can always remove "ARROW:statistics"
metadata when we use "ARROW:statistics" for statistics.


Thanks,
-- 
kou

In <cafb7qsdalrucu47+geuhkesff6dexqj-js0bhhtafc736vq...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
15:14:49 -0300,
  Dewey Dunnington <de...@voltrondata.com.INVALID> wrote:

> Thanks Shoumyo for bringing this up!
> 
> Using a schema to transmit statistica/data dependent values is also
> something we do in GeoParquet (whose schema also finds its way into
> pyarrow and the C data interface when reading). It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).
> 
> I imagine statistics would be opt-in (i.e., a consumer would have to
> explicitly request them), in which case that consumer could possibly
> be required to remove them. With the custom format string that was
> proposed I think this is unlikely to happen; however, that a consumer
> might want to know statistics over IPC too is an excellent point.
> 
>> Unless there are other ways of producing stream-level application metadata 
>> outside of the schema/field metadata
> 
> Technically there is message-level metadata in the IPC flatbuffers,
> although I don't believe it is accessible from most IPC readers. That
> mechanism isn't available from an ArrowArrayStream and so it might not
> help with the specific case at hand.
> 
>> nowhere is it mentioned that metadata must be used to determine schema 
>> equivalence
> 
> I am only familiar with a few implementations, but at least Arrow C++
> and nanoarrow have options to ignore metadata and/or nullability
> and/or possibly field names (e.g., for a list type) depending on what
> type of type/schema equivalence is required.
> 
>> use cases where you want to know the schema *before* the data is produced.
> 
> I may be understanding it incorrectly, but I think it's generally
> possible to emit a schema with metadata before emitting record
> batches. I suppose you would have already started downloading the
> stream, though.
> 
>> I think what we are slowly converging on is the need for a spec to
>> describe the encoding of Arrow array statistics as Arrow arrays.
> 
> +1 (this will be helpful however we decide to transmit statistics)
> 
> On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou <anto...@python.org> wrote:
>>
>>
>> Hi Shoumyo,
>>
>> The problem with communicating data statistics through schema metadata
>> is that it's not compatible with use cases where you want to know the
>> schema *before* the data is produced.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Thu, 23 May 2024 14:28:43 -0000
>> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>> <schakravo...@bloomberg.net> wrote:
>> > This is a really exciting development, thank you for putting together this 
>> > proposal!
>> >
>> > It looks like this thread and the linked GitHub issue has lots of input 
>> > from folks who work with Arrow at a low level and have better familiarity 
>> > with the Arrow specifications than I do, so I'll refrain from commenting 
>> > on the technicalities of the proposal. I would, however, like to share my 
>> > perspective as an application developer that heavily uses Arrow at higher 
>> > levels for composing data systems.
>> >
>> > My main concern with the direction of this proposal is that it seems too 
>> > narrowly focused on what the integration with DuckDB will look like (how 
>> > the statistics can be fed into DuckDB). In many applications, executing 
>> > the query is often the "last mile", and it's important to consider where 
>> > the statistics will actually come from. To start, data might be sourced in 
>> > various manners:
>> >
>> > - Arrow IPC files may be mapped from shared memory
>> > - Arrow IPC streams may be received via some RPC framework (à la Flight)
>> > - The Arrow libraries may be used to read from file formats like Parquet 
>> > or CSV
>> > - ADBC drivers may be used to read from databases
>> >
>> > Note that in at least the first two cases, the system _executing the 
>> > query_ will not be able to provide statistics simply because it is not 
>> > actually the data producer. As an example, if Process A writes an Arrow 
>> > IPC file to shared memory, and Process B wants to run a query on it -- how 
>> > is Process B supposed to get the statistics for query planning? There are 
>> > a few approaches that I anticipate application developers might consider:
>> >
>> > 1. Design an out-of-band mechanism for Process B to fetch statistics from 
>> > Process A.
>> > 2. Design an encoding that is a superset of Arrow IPC and includes 
>> > statistics information, allowing statistics to be communicated in-band.
>> > 3. Use custom schema metadata to communicate statistics in-band.
>> >
>> > Options 1 and 2 require considerably more effort than Option 3. Also, 
>> > Option 3 feels somewhat natural because it makes sense for the statistics 
>> > to come with the data (similar to how statistics are embedded in Parquet 
>> > files). In some sense, the statistics actually *are* a property of the 
>> > stream.
>> >
>> > In systems that I work on, we already use schema metadata to communicate 
>> > information that is unrelated to the structure of the data. From my 
>> > reading of the documentation [1], this sounds like a reasonable (and 
>> > perhaps intended?) use of metadata, and nowhere is it mentioned that 
>> > metadata must be used to determine schema equivalence. Unless there are 
>> > other ways of producing stream-level application metadata outside of the 
>> > schema/field metadata, the lack of purity was not a concern for me to 
>> > begin with.
>> >
>> > I would appreciate an approach that communicates statistics via schema 
>> > metadata, or at least in some in-band fashion that is consistent across 
>> > the IPC and C data specifications. This would make it much easier to 
>> > uniformly and transparently plumb statistics through applications, 
>> > regardless of where they source Arrow data from. As developers are likely 
>> > to create bespoke conventions for this anyways, it seems reasonable to 
>> > standardize it as canonical metadata.
>> >
>> > I say this all as a happy user of DuckDB's Arrow scan functionality that 
>> > is excited to see better query optimization capabilities. It's just that, 
>> > in its current form, the changes in this proposal are not something I 
>> > could foreseeably integrate with.
>> >
>> > Best,
>> > Shoumyo
>> >
>> > [1]: 
>> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> >
>> > From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To:  
>> > dev@arrow.apache.org
>> > Subject: Re: [DISCUSS] Statistics through the C data interface
>> >
>> > I want to +1 on what Dewey is saying here and some comments.
>> >
>> > Sutou Kouhei wrote:
>> > > ADBC may be a bit larger to use only for transmitting statistics. ADBC 
>> > > has
>> > statistics related APIs but it has more other APIs.
>> >
>> > It's impossible to keep the responsibility of communication protocols
>> > cleanly separated, but IMO, we should strive to keep the C Data
>> > Interface more of a Transport Protocol than an Application Protocol.
>> >
>> > Statistics are application dependent and can complicate the
>> > implementation of importers/exporters which would hinder the adoption
>> > of the C Data Interface. Statistics also bring in security concerns
>> > that are application-specific. e.g. can an algorithm trust min/max
>> > stats and risk producing incorrect results if the statistics are
>> > incorrect? A question that can't really be answered at the C Data
>> > Interface level.
>> >
>> > The need for more sophisticated statistics only grows with time, so
>> > there is no such thing as a "simple statistics schema".
>> >
>> > Protocols that produce/consume statistics might want to use the C Data
>> > Interface as a primitive for passing Arrow arrays of statistics.
>> >
>> > ADBC might be too big of a leap in complexity now, but "we just need C
>> > Data Interface + statistics" is unlikely to remain true for very long
>> > as projects grow in complexity.
>> >
>> > --
>> > Felipe
>> >
>> > On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
>> > <de...@voltrondata.com.invalid> wrote:
>> > >
>> > > Thank you for the background! I understand that these statistics are
>> > > important for query planning; however, I am not sure that I follow why
>> > > we are constrained to the ArrowSchema to represent them. The examples
>> > > given seem to going through Python...would it be easier to request
>> > > statistics at a higher level of abstraction? There would already need
>> > > to be a separate mechanism to request an ArrowArrayStream with
>> > > statistics (unless the PyCapsule `requested_schema` argument would
>> > > suffice).
>> > >
>> > > > ADBC may be a bit larger to use only for transmitting
>> > > > statistics. ADBC has statistics related APIs but it has more
>> > > > other APIs.
>> > >
>> > > Some examples of producers given in the linked threads (Delta Lake,
>> > > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> > > can implement an ADBC driver without defining all the methods (where
>> > > the producer could call AdbcConnectionGetStatistics(), although
>> > > AdbcStatementGetStatistics() might be more relevant here and doesn't
>> > > exist). One example listed (using an Arrow Table as a source) seems a
>> > > bit light to wrap in an ADBC driver; however, it would not take much
>> > > code to do so and the overhead of getting the reader via ADBC it is
>> > > something like 100 microseconds (tested via the ADBC R package's
>> > > "monkey driver" which wraps an existing stream as a statement). In any
>> > > case, the bulk of the code is building the statistics array.
>> > >
>> > > > How about the following schema for the
>> > > > statistics ArrowArray? It's based on ADBC.
>> > >
>> > > Whatever format for statistics is decided on, I imagine it should be
>> > > exactly the same as the ADBC standard? (Perhaps pushing changes
>> > > upstream if needed?).
>> > >
>> > > On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > > Why not simply pass the statistics ArrowArray separately in your
>> > > > > producer API of choice
>> > > >
>> > > > It seems that we should use the approach because all
>> > > > feedback said so. How about the following schema for the
>> > > > statistics ArrowArray? It's based on ADBC.
>> > > >
>> > > > | Field Name               | Field Type            | Comments |
>> > > > |--------------------------|-----------------------| -------- |
>> > > > | column_name              | utf8                  | (1)      |
>> > > > | statistic_key            | utf8 not null         | (2)      |
>> > > > | statistic_value          | VALUE_SCHEMA not null |          |
>> > > > | statistic_is_approximate | bool not null         | (3)      |
>> > > >
>> > > > 1. If null, then the statistic applies to the entire table.
>> > > >    It's for "row_count".
>> > > > 2. We'll provide pre-defined keys such as "max", "min",
>> > > >    "byte_width" and "distinct_count" but users can also use
>> > > >    application specific keys.
>> > > > 3. If true, then the value is approximate or best-effort.
>> > > >
>> > > > VALUE_SCHEMA is a dense union with members:
>> > > >
>> > > > | Field Name | Field Type |
>> > > > |------------|------------|
>> > > > | int64      | int64      |
>> > > > | uint64     | uint64     |
>> > > > | float64    | float64    |
>> > > > | binary     | binary     |
>> > > >
>> > > > If a column is an int32 column, it uses int64 for
>> > > > "max"/"min". We don't provide all types here. Users should
>> > > > use a compatible type (int64 for a int32 column) instead.
>> > > >
>> > > >
>> > > > Thanks,
>> > > > --
>> > > > kou
>> > > >
>> > > > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org>
>> > > >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 
>> > > > May
>> > 2024 17:04:57 +0200,
>> > > >   Antoine Pitrou <anto...@python.org> wrote:
>> > > >
>> > > > >
>> > > > > Hi Kou,
>> > > > >
>> > > > > I agree that Dewey that this is overstretching the capabilities of 
>> > > > > the
>> > > > > C Data Interface. In particular, stuffing a pointer as metadata value
>> > > > > and decreeing it immortal doesn't sound like a good design decision.
>> > > > >
>> > > > > Why not simply pass the statistics ArrowArray separately in your
>> > > > > producer API of choice (Dewey mentioned ADBC but it is of course just
>> > > > > a possible API among others)?
>> > > > >
>> > > > > Regards
>> > > > >
>> > > > > Antoine.
>> > > > >
>> > > > >
>> > > > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> > > > >> Hi,
>> > > > >> We're discussing how to provide statistics through the C
>> > > > >> data interface at:
>> > > > >> https://github.com/apache/arrow/issues/38837
>> > > > >> If you're interested in this feature, could you share your
>> > > > >> comments?
>> > > > >> Motivation:
>> > > > >> We can interchange Apache Arrow data by the C data interface
>> > > > >> in the same process. For example, we can pass Apache Arrow
>> > > > >> data read by Apache Arrow C++ (provider) to DuckDB
>> > > > >> (consumer) through the C data interface.
>> > > > >> A provider may know Apache Arrow data statistics. For
>> > > > >> example, a provider can know statistics when it reads Apache
>> > > > >> Parquet data because Apache Parquet may provide statistics.
>> > > > >> But a consumer can't know statistics that are known by a
>> > > > >> producer. Because there isn't a standard way to provide
>> > > > >> statistics through the C data interface. If a consumer can
>> > > > >> know statistics, it can process Apache Arrow data faster
>> > > > >> based on statistics.
>> > > > >> Proposal:
>> > > > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>> > > > >> How about providing statistics as a metadata in ArrowSchema?
>> > > > >> We reserve "ARROW" namespace for internal Apache Arrow use:
>> > > > >>
>> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>> > > > >>
>> > > > >>> The ARROW pattern is a reserved namespace for internal
>> > > > >>> Arrow use in the custom_metadata fields. For example,
>> > > > >>> ARROW:extension:name.
>> > > > >> So we can use "ARROW:statistics" for the metadata key.
>> > > > >> We can represent statistics as a ArrowArray like ADBC does.
>> > > > >> Here is an example ArrowSchema that is for a record batch
>> > > > >> that has "int32 column1" and "string column2":
>> > > > >> ArrowSchema {
>> > > > >>    .format = "+siu",
>> > > > >>    .metadata = {
>> > > > >>      "ARROW:statistics" => ArrowArray*, /* table-level statistics 
>> > > > >> such as
>> > > > >>      row count */
>> > > > >>    },
>> > > > >>    .children = {
>> > > > >>      ArrowSchema {
>> > > > >>        .name = "column1",
>> > > > >>        .format = "i",
>> > > > >>        .metadata = {
>> > > > >>          "ARROW:statistics" => ArrowArray*, /* column-level 
>> > > > >> statistics
>> > such as
>> > > > >>          count distinct */
>> > > > >>        },
>> > > > >>      },
>> > > > >>      ArrowSchema {
>> > > > >>        .name = "column2",
>> > > > >>        .format = "u",
>> > > > >>        .metadata = {
>> > > > >>          "ARROW:statistics" => ArrowArray*, /* column-level 
>> > > > >> statistics
>> > such as
>> > > > >>          count distinct */
>> > > > >>        },
>> > > > >>      },
>> > > > >>    },
>> > > > >> }
>> > > > >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>> > > > >> => ArrowArray*' is a base 10 string of the address of the
>> > > > >> ArrowArray. Because we can use only string for metadata
>> > > > >> value. You can't release the statistics ArrowArray*. (Its
>> > > > >> release is a no-op function.) It follows
>> > > > >>
>> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>> > > > >> semantics. (The base ArrowSchema owns statistics
>> > > > >> ArrowArray*.)
>> > > > >> ArrowArray* for statistics use the following schema:
>> > > > >> | Field Name     | Field Type                       | Comments |
>> > > > >> |----------------|----------------------------------| -------- |
>> > > > >> | key            | string not null                  | (1)      |
>> > > > >> | value          | `VALUE_SCHEMA` not null          |          |
>> > > > >> | is_approximate | bool not null                    | (2)      |
>> > > > >> 1. We'll provide pre-defined keys such as "max", "min",
>> > > > >>     "byte_width" and "distinct_count" but users can also use
>> > > > >>     application specific keys.
>> > > > >> 2. If true, then the value is approximate or best-effort.
>> > > > >> VALUE_SCHEMA is a dense union with members:
>> > > > >> | Field Name | Field Type                       | Comments |
>> > > > >> |------------|----------------------------------| -------- |
>> > > > >> | int64      | int64                            |          |
>> > > > >> | uint64     | uint64                           |          |
>> > > > >> | float64    | float64                          |          |
>> > > > >> | value      | The same type of the ArrowSchema | (3)      |
>> > > > >> |            | that is belonged to.             |          |
>> > > > >> 3. If the ArrowSchema's type is string, this type is also string.
>> > > > >>     TODO: Is "value" good name? If we refer it from the
>> > > > >>     top-level statistics schema, we need to use
>> > > > >>     "value.value". It's a bit strange...
>> > > > >> What do you think about this proposal? Could you share your
>> > > > >> comments?
>> > > > >> Thanks,
>> >
>> >
>>
>>
>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to