Hi, > It is usually fine but > occasionally ends up with schema metadata that is lying (e.g., when > unifying schemas from multiple files in a dataset, I believe pyarrow > will sometimes assign metadata from one file to the entire dataset > and/or propagate it through projections/filters).
Good point. I think that a process that unifies schemas should remove (or merge if possible) statistics metadata. If we standardize statistics, the process can do it. For example, the process can always remove "ARROW:statistics" metadata when we use "ARROW:statistics" for statistics. Thanks, -- kou In <cafb7qsdalrucu47+geuhkesff6dexqj-js0bhhtafc736vq...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 15:14:49 -0300, Dewey Dunnington <de...@voltrondata.com.INVALID> wrote: > Thanks Shoumyo for bringing this up! > > Using a schema to transmit statistica/data dependent values is also > something we do in GeoParquet (whose schema also finds its way into > pyarrow and the C data interface when reading). It is usually fine but > occasionally ends up with schema metadata that is lying (e.g., when > unifying schemas from multiple files in a dataset, I believe pyarrow > will sometimes assign metadata from one file to the entire dataset > and/or propagate it through projections/filters). > > I imagine statistics would be opt-in (i.e., a consumer would have to > explicitly request them), in which case that consumer could possibly > be required to remove them. With the custom format string that was > proposed I think this is unlikely to happen; however, that a consumer > might want to know statistics over IPC too is an excellent point. > >> Unless there are other ways of producing stream-level application metadata >> outside of the schema/field metadata > > Technically there is message-level metadata in the IPC flatbuffers, > although I don't believe it is accessible from most IPC readers. That > mechanism isn't available from an ArrowArrayStream and so it might not > help with the specific case at hand. > >> nowhere is it mentioned that metadata must be used to determine schema >> equivalence > > I am only familiar with a few implementations, but at least Arrow C++ > and nanoarrow have options to ignore metadata and/or nullability > and/or possibly field names (e.g., for a list type) depending on what > type of type/schema equivalence is required. > >> use cases where you want to know the schema *before* the data is produced. > > I may be understanding it incorrectly, but I think it's generally > possible to emit a schema with metadata before emitting record > batches. I suppose you would have already started downloading the > stream, though. > >> I think what we are slowly converging on is the need for a spec to >> describe the encoding of Arrow array statistics as Arrow arrays. > > +1 (this will be helpful however we decide to transmit statistics) > > On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou <anto...@python.org> wrote: >> >> >> Hi Shoumyo, >> >> The problem with communicating data statistics through schema metadata >> is that it's not compatible with use cases where you want to know the >> schema *before* the data is produced. >> >> Regards >> >> Antoine. >> >> >> On Thu, 23 May 2024 14:28:43 -0000 >> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" >> <schakravo...@bloomberg.net> wrote: >> > This is a really exciting development, thank you for putting together this >> > proposal! >> > >> > It looks like this thread and the linked GitHub issue has lots of input >> > from folks who work with Arrow at a low level and have better familiarity >> > with the Arrow specifications than I do, so I'll refrain from commenting >> > on the technicalities of the proposal. I would, however, like to share my >> > perspective as an application developer that heavily uses Arrow at higher >> > levels for composing data systems. >> > >> > My main concern with the direction of this proposal is that it seems too >> > narrowly focused on what the integration with DuckDB will look like (how >> > the statistics can be fed into DuckDB). In many applications, executing >> > the query is often the "last mile", and it's important to consider where >> > the statistics will actually come from. To start, data might be sourced in >> > various manners: >> > >> > - Arrow IPC files may be mapped from shared memory >> > - Arrow IPC streams may be received via some RPC framework (à la Flight) >> > - The Arrow libraries may be used to read from file formats like Parquet >> > or CSV >> > - ADBC drivers may be used to read from databases >> > >> > Note that in at least the first two cases, the system _executing the >> > query_ will not be able to provide statistics simply because it is not >> > actually the data producer. As an example, if Process A writes an Arrow >> > IPC file to shared memory, and Process B wants to run a query on it -- how >> > is Process B supposed to get the statistics for query planning? There are >> > a few approaches that I anticipate application developers might consider: >> > >> > 1. Design an out-of-band mechanism for Process B to fetch statistics from >> > Process A. >> > 2. Design an encoding that is a superset of Arrow IPC and includes >> > statistics information, allowing statistics to be communicated in-band. >> > 3. Use custom schema metadata to communicate statistics in-band. >> > >> > Options 1 and 2 require considerably more effort than Option 3. Also, >> > Option 3 feels somewhat natural because it makes sense for the statistics >> > to come with the data (similar to how statistics are embedded in Parquet >> > files). In some sense, the statistics actually *are* a property of the >> > stream. >> > >> > In systems that I work on, we already use schema metadata to communicate >> > information that is unrelated to the structure of the data. From my >> > reading of the documentation [1], this sounds like a reasonable (and >> > perhaps intended?) use of metadata, and nowhere is it mentioned that >> > metadata must be used to determine schema equivalence. Unless there are >> > other ways of producing stream-level application metadata outside of the >> > schema/field metadata, the lack of purity was not a concern for me to >> > begin with. >> > >> > I would appreciate an approach that communicates statistics via schema >> > metadata, or at least in some in-band fashion that is consistent across >> > the IPC and C data specifications. This would make it much easier to >> > uniformly and transparently plumb statistics through applications, >> > regardless of where they source Arrow data from. As developers are likely >> > to create bespoke conventions for this anyways, it seems reasonable to >> > standardize it as canonical metadata. >> > >> > I say this all as a happy user of DuckDB's Arrow scan functionality that >> > is excited to see better query optimization capabilities. It's just that, >> > in its current form, the changes in this proposal are not something I >> > could foreseeably integrate with. >> > >> > Best, >> > Shoumyo >> > >> > [1]: >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> > >> > From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To: >> > dev@arrow.apache.org >> > Subject: Re: [DISCUSS] Statistics through the C data interface >> > >> > I want to +1 on what Dewey is saying here and some comments. >> > >> > Sutou Kouhei wrote: >> > > ADBC may be a bit larger to use only for transmitting statistics. ADBC >> > > has >> > statistics related APIs but it has more other APIs. >> > >> > It's impossible to keep the responsibility of communication protocols >> > cleanly separated, but IMO, we should strive to keep the C Data >> > Interface more of a Transport Protocol than an Application Protocol. >> > >> > Statistics are application dependent and can complicate the >> > implementation of importers/exporters which would hinder the adoption >> > of the C Data Interface. Statistics also bring in security concerns >> > that are application-specific. e.g. can an algorithm trust min/max >> > stats and risk producing incorrect results if the statistics are >> > incorrect? A question that can't really be answered at the C Data >> > Interface level. >> > >> > The need for more sophisticated statistics only grows with time, so >> > there is no such thing as a "simple statistics schema". >> > >> > Protocols that produce/consume statistics might want to use the C Data >> > Interface as a primitive for passing Arrow arrays of statistics. >> > >> > ADBC might be too big of a leap in complexity now, but "we just need C >> > Data Interface + statistics" is unlikely to remain true for very long >> > as projects grow in complexity. >> > >> > -- >> > Felipe >> > >> > On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington >> > <de...@voltrondata.com.invalid> wrote: >> > > >> > > Thank you for the background! I understand that these statistics are >> > > important for query planning; however, I am not sure that I follow why >> > > we are constrained to the ArrowSchema to represent them. The examples >> > > given seem to going through Python...would it be easier to request >> > > statistics at a higher level of abstraction? There would already need >> > > to be a separate mechanism to request an ArrowArrayStream with >> > > statistics (unless the PyCapsule `requested_schema` argument would >> > > suffice). >> > > >> > > > ADBC may be a bit larger to use only for transmitting >> > > > statistics. ADBC has statistics related APIs but it has more >> > > > other APIs. >> > > >> > > Some examples of producers given in the linked threads (Delta Lake, >> > > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One >> > > can implement an ADBC driver without defining all the methods (where >> > > the producer could call AdbcConnectionGetStatistics(), although >> > > AdbcStatementGetStatistics() might be more relevant here and doesn't >> > > exist). One example listed (using an Arrow Table as a source) seems a >> > > bit light to wrap in an ADBC driver; however, it would not take much >> > > code to do so and the overhead of getting the reader via ADBC it is >> > > something like 100 microseconds (tested via the ADBC R package's >> > > "monkey driver" which wraps an existing stream as a statement). In any >> > > case, the bulk of the code is building the statistics array. >> > > >> > > > How about the following schema for the >> > > > statistics ArrowArray? It's based on ADBC. >> > > >> > > Whatever format for statistics is decided on, I imagine it should be >> > > exactly the same as the ADBC standard? (Perhaps pushing changes >> > > upstream if needed?). >> > > >> > > On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote: >> > > > >> > > > Hi, >> > > > >> > > > > Why not simply pass the statistics ArrowArray separately in your >> > > > > producer API of choice >> > > > >> > > > It seems that we should use the approach because all >> > > > feedback said so. How about the following schema for the >> > > > statistics ArrowArray? It's based on ADBC. >> > > > >> > > > | Field Name | Field Type | Comments | >> > > > |--------------------------|-----------------------| -------- | >> > > > | column_name | utf8 | (1) | >> > > > | statistic_key | utf8 not null | (2) | >> > > > | statistic_value | VALUE_SCHEMA not null | | >> > > > | statistic_is_approximate | bool not null | (3) | >> > > > >> > > > 1. If null, then the statistic applies to the entire table. >> > > > It's for "row_count". >> > > > 2. We'll provide pre-defined keys such as "max", "min", >> > > > "byte_width" and "distinct_count" but users can also use >> > > > application specific keys. >> > > > 3. If true, then the value is approximate or best-effort. >> > > > >> > > > VALUE_SCHEMA is a dense union with members: >> > > > >> > > > | Field Name | Field Type | >> > > > |------------|------------| >> > > > | int64 | int64 | >> > > > | uint64 | uint64 | >> > > > | float64 | float64 | >> > > > | binary | binary | >> > > > >> > > > If a column is an int32 column, it uses int64 for >> > > > "max"/"min". We don't provide all types here. Users should >> > > > use a compatible type (int64 for a int32 column) instead. >> > > > >> > > > >> > > > Thanks, >> > > > -- >> > > > kou >> > > > >> > > > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> >> > > > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 >> > > > May >> > 2024 17:04:57 +0200, >> > > > Antoine Pitrou <anto...@python.org> wrote: >> > > > >> > > > > >> > > > > Hi Kou, >> > > > > >> > > > > I agree that Dewey that this is overstretching the capabilities of >> > > > > the >> > > > > C Data Interface. In particular, stuffing a pointer as metadata value >> > > > > and decreeing it immortal doesn't sound like a good design decision. >> > > > > >> > > > > Why not simply pass the statistics ArrowArray separately in your >> > > > > producer API of choice (Dewey mentioned ADBC but it is of course just >> > > > > a possible API among others)? >> > > > > >> > > > > Regards >> > > > > >> > > > > Antoine. >> > > > > >> > > > > >> > > > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : >> > > > >> Hi, >> > > > >> We're discussing how to provide statistics through the C >> > > > >> data interface at: >> > > > >> https://github.com/apache/arrow/issues/38837 >> > > > >> If you're interested in this feature, could you share your >> > > > >> comments? >> > > > >> Motivation: >> > > > >> We can interchange Apache Arrow data by the C data interface >> > > > >> in the same process. For example, we can pass Apache Arrow >> > > > >> data read by Apache Arrow C++ (provider) to DuckDB >> > > > >> (consumer) through the C data interface. >> > > > >> A provider may know Apache Arrow data statistics. For >> > > > >> example, a provider can know statistics when it reads Apache >> > > > >> Parquet data because Apache Parquet may provide statistics. >> > > > >> But a consumer can't know statistics that are known by a >> > > > >> producer. Because there isn't a standard way to provide >> > > > >> statistics through the C data interface. If a consumer can >> > > > >> know statistics, it can process Apache Arrow data faster >> > > > >> based on statistics. >> > > > >> Proposal: >> > > > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> > > > >> How about providing statistics as a metadata in ArrowSchema? >> > > > >> We reserve "ARROW" namespace for internal Apache Arrow use: >> > > > >> >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> > > > >> >> > > > >>> The ARROW pattern is a reserved namespace for internal >> > > > >>> Arrow use in the custom_metadata fields. For example, >> > > > >>> ARROW:extension:name. >> > > > >> So we can use "ARROW:statistics" for the metadata key. >> > > > >> We can represent statistics as a ArrowArray like ADBC does. >> > > > >> Here is an example ArrowSchema that is for a record batch >> > > > >> that has "int32 column1" and "string column2": >> > > > >> ArrowSchema { >> > > > >> .format = "+siu", >> > > > >> .metadata = { >> > > > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >> > > > >> such as >> > > > >> row count */ >> > > > >> }, >> > > > >> .children = { >> > > > >> ArrowSchema { >> > > > >> .name = "column1", >> > > > >> .format = "i", >> > > > >> .metadata = { >> > > > >> "ARROW:statistics" => ArrowArray*, /* column-level >> > > > >> statistics >> > such as >> > > > >> count distinct */ >> > > > >> }, >> > > > >> }, >> > > > >> ArrowSchema { >> > > > >> .name = "column2", >> > > > >> .format = "u", >> > > > >> .metadata = { >> > > > >> "ARROW:statistics" => ArrowArray*, /* column-level >> > > > >> statistics >> > such as >> > > > >> count distinct */ >> > > > >> }, >> > > > >> }, >> > > > >> }, >> > > > >> } >> > > > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> > > > >> => ArrowArray*' is a base 10 string of the address of the >> > > > >> ArrowArray. Because we can use only string for metadata >> > > > >> value. You can't release the statistics ArrowArray*. (Its >> > > > >> release is a no-op function.) It follows >> > > > >> >> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> > > > >> semantics. (The base ArrowSchema owns statistics >> > > > >> ArrowArray*.) >> > > > >> ArrowArray* for statistics use the following schema: >> > > > >> | Field Name | Field Type | Comments | >> > > > >> |----------------|----------------------------------| -------- | >> > > > >> | key | string not null | (1) | >> > > > >> | value | `VALUE_SCHEMA` not null | | >> > > > >> | is_approximate | bool not null | (2) | >> > > > >> 1. We'll provide pre-defined keys such as "max", "min", >> > > > >> "byte_width" and "distinct_count" but users can also use >> > > > >> application specific keys. >> > > > >> 2. If true, then the value is approximate or best-effort. >> > > > >> VALUE_SCHEMA is a dense union with members: >> > > > >> | Field Name | Field Type | Comments | >> > > > >> |------------|----------------------------------| -------- | >> > > > >> | int64 | int64 | | >> > > > >> | uint64 | uint64 | | >> > > > >> | float64 | float64 | | >> > > > >> | value | The same type of the ArrowSchema | (3) | >> > > > >> | | that is belonged to. | | >> > > > >> 3. If the ArrowSchema's type is string, this type is also string. >> > > > >> TODO: Is "value" good name? If we refer it from the >> > > > >> top-level statistics schema, we need to use >> > > > >> "value.value". It's a bit strange... >> > > > >> What do you think about this proposal? Could you share your >> > > > >> comments? >> > > > >> Thanks, >> > >> > >> >> >>