Hi, >> - If you need statistics in the schema then simply encode the 1-row batch >> into an IPC buffer (using the streaming format) or maybe just an IPC >> RecordBatch message since the schema is fixed and store those bytes in the >> schema > > This would avoid having to define a separate "schema" for > the JSON metadata
Right. What I'm worried about with this approach is that this may not match with the C data interface. In the C data interface, we don't use the IPC format. If we want to transmit statistics with schema through the C data interface, we need to mix the IPC format and the C data interface. (This is why I used the address in my first proposal.) Note that we can use separated API to transmit statistics instead of embedding statistics into schema for this case. I thought using JSON is easier to use for both of the IPC format and the C data interface. Statistics data will not be large. So this will not affect performance. > If we do go down the JSON route, how about something like > this to avoid defining the keys for all possible statistics up > front: > > Schema { > custom_metadata: { > "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, > \"value_type\": \"uint64\", \"is_approximate\": false } ]" > } > } > > It's more verbose, but more closely mirrors the Arrow array > schema defined for statistics getter APIs. This could make it > easier to translate between the two. Thanks. I didn't think of it. It makes sense. Thanks, -- kou In <665673b500015f5808ce0...@message.bloomberg.net> "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 00:15:49 -0000, "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> wrote: > Thanks for addressing the feedback! I didn't know that an > Arrow IPC `Message` (not just Schema) could also contain > `custom_metadata` -- thanks for pointing it out. > >> Based on the list, how about standardizing both of the >> followings for statistics? >> >> 1. Apache Arrow schema for statistics that is used by >> separated statistics getter API >> 2. "ARROW:statistics" metadata format that can be used in >> Apache Arrow schema metadata >> >> Users can use 1. and/or 2. based on their use cases. > > This sounds good to me. Using JSON to represent the metadata > for #2 also sounds reasonable. I think elsewhere on this > thread, Weston mentioned that we could alternatively use > the schema defined for #1 and directly use that to encode > the schema metadata as an Arrow IPC RecordBatch: > >> This has been something that has always been desired for the Arrow IPC >> format too. >> >> My preference would be (apologies if this has been mentioned before): >> >> - Agree on how statistics should be encoded into an array (this is not >> hard, we just have to agree on the field order and the data type for >> null_count) >> - If you need statistics in the schema then simply encode the 1-row batch >> into an IPC buffer (using the streaming format) or maybe just an IPC >> RecordBatch message since the schema is fixed and store those bytes in the >> schema > > This would avoid having to define a separate "schema" for > the JSON metadata, but might be more effort to work with in > certain contexts (e.g. a library that currently only needs the > C data interface would now also have to learn how to parse > Arrow IPC). > > If we do go down the JSON route, how about something like > this to avoid defining the keys for all possible statistics up > front: > > Schema { > custom_metadata: { > "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, > \"value_type\": \"uint64\", \"is_approximate\": false } ]" > } > } > > It's more verbose, but more closely mirrors the Arrow array > schema defined for statistics getter APIs. This could make it > easier to translate between the two. > > Thanks, > Shoumyo > > From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To: > dev@arrow.apache.org > Subject: Re: [DISCUSS] Statistics through the C data interface > >>Hi, > >> > >>> To start, data might be sourced in various manners: > >>> > >>> - Arrow IPC files may be mapped from shared memory > >>> - Arrow IPC streams may be received via some RPC framework (à la Flight) > >>> - The Arrow libraries may be used to read from file formats like Parquet or > >>CSV > >>> - ADBC drivers may be used to read from databases > >> > >>Thanks for listing it. > >> > >>Regarding to the first case: > >> > >>Using schema metadata may be a reasonable approach because > >>the Arrow data will be on the page cache. There is no > >>significant read cost. We don't need to read statistics > >>before the Arrow data is ready. > >> > >>But if the Arrow data will not be produced based on > >>statistics of the Arrow data, separated statistics get API > >>may be better. > >> > >>Regarding to the second case: > >> > >>Schema metadata is an approach for it but we can choose > >>other approaches for this case. For example, Flight has > >>FlightData::app_metadata[1] and Arrow IPC message has > >>custom_metadata[2] as Dewey mentioned. > >> > >>[1] > >>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo > >>rmat/Flight.proto#L512-L515 > >>[2] > >>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo > >>rmat/Message.fbs#L154 > >> > >>Regarding to the third case: > >> > >>Reader objects will provide statistics. For example, > >>parquet::ColumnChunkMetaData::statistics() > >>(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics > >>()) > >>will provide statistics. > >> > >>Regarding to the forth case: > >> > >>We can use ADBC API. > >> > >> > >>Based on the list, how about standardizing both of the > >>followings for statistics? > >> > >>1. Apache Arrow schema for statistics that is used by > >> separated statistics getter API > >>2. "ARROW:statistics" metadata format that can be used in > >> Apache Arrow schema metadata > >> > >>Users can use 1. and/or 2. based on their use cases. > >> > >>Regarding to 2.: How about the following? > >> > >>This uses Field::custom_metadata[3] and > >>Schema::custom_metadata[4]. > >> > >>[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529 > >>[4] > >>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/fo > >>rmat/Schema.fbs#L563-L564 > >> > >>"ARROW:statistics" in Field::custom_metadata represents > >>column-level statistics. It uses JSON like we did for > >>"ARROW:extension:metadata"[5]. Here is an example: > >> > >> Field { > >> custom_metadata: { > >> "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}" > >> } > >> } > >> > >>(JSON may not be able to represent complex information but > >>is it needed for statistics?) > >> > >>"ARROW:statistics" in Schema::custom_metadata represents > >>table-level statistics. It uses JSON like we did for > >>"ARROW:extension:metadata"[5]. Here is an example: > >> > >> Schema { > >> custom_metadata: { > >> "ARROW:statistics" => "{\"row_count\": 29}" > >> } > >> } > >> > >>TODO: Define the JSON content details. For example, we need > >>to define keys such as "distinct_count" and "row_count". > >> > >> > >>[5] > >>https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-typ > >>es > >> > >> > >> > >>Thanks, > >>-- > >>kou > >> > >>In <664f529b0002a8710c430...@message.bloomberg.net> > >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 > >>14:28:43 -0000, > >> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> > >>wrote: > >> > >>> This is a really exciting development, thank you for putting together this > >>proposal! > >>> > >>> It looks like this thread and the linked GitHub issue has lots of input >>> from > >>folks who work with Arrow at a low level and have better familiarity with the > >>Arrow specifications than I do, so I'll refrain from commenting on the > >>technicalities of the proposal. I would, however, like to share my >>perspective > >>as an application developer that heavily uses Arrow at higher levels for > >>composing data systems. > >>> > >>> My main concern with the direction of this proposal is that it seems too > >>narrowly focused on what the integration with DuckDB will look like (how the > >>statistics can be fed into DuckDB). In many applications, executing the query > >>is often the "last mile", and it's important to consider where the statistics > >>will actually come from. To start, data might be sourced in various manners: > >>> > >>> - Arrow IPC files may be mapped from shared memory > >>> - Arrow IPC streams may be received via some RPC framework (à la Flight) > >>> - The Arrow libraries may be used to read from file formats like Parquet or > >>CSV > >>> - ADBC drivers may be used to read from databases > >>> > >>> Note that in at least the first two cases, the system _executing the query_ > >>will not be able to provide statistics simply because it is not actually the > >>data producer. As an example, if Process A writes an Arrow IPC file to shared > >>memory, and Process B wants to run a query on it -- how is Process B supposed > >>to get the statistics for query planning? There are a few approaches that I > >>anticipate application developers might consider: > >>> > >>> 1. Design an out-of-band mechanism for Process B to fetch statistics from > >>Process A. > >>> 2. Design an encoding that is a superset of Arrow IPC and includes >>> statistics > >>information, allowing statistics to be communicated in-band. > >>> 3. Use custom schema metadata to communicate statistics in-band. > >>> > >>> Options 1 and 2 require considerably more effort than Option 3. Also, >>> Option > >>3 feels somewhat natural because it makes sense for the statistics to come >>with > >>the data (similar to how statistics are embedded in Parquet files). In some > >>sense, the statistics actually *are* a property of the stream. > >>> > >>> In systems that I work on, we already use schema metadata to communicate > >>information that is unrelated to the structure of the data. From my reading >>of > >>the documentation [1], this sounds like a reasonable (and perhaps intended?) > >>use of metadata, and nowhere is it mentioned that metadata must be used to > >>determine schema equivalence. Unless there are other ways of producing > >>stream-level application metadata outside of the schema/field metadata, the > >>lack of purity was not a concern for me to begin with. > >>> > >>> I would appreciate an approach that communicates statistics via schema > >>metadata, or at least in some in-band fashion that is consistent across the >>IPC > >>and C data specifications. This would make it much easier to uniformly and > >>transparently plumb statistics through applications, regardless of where they > >>source Arrow data from. As developers are likely to create bespoke >>conventions > >>for this anyways, it seems reasonable to standardize it as canonical metadata. > >>> > >>> I say this all as a happy user of DuckDB's Arrow scan functionality that is > >>excited to see better query optimization capabilities. It's just that, in its > >>current form, the changes in this proposal are not something I could > >>foreseeably integrate with. > >>> > >>> Best, > >>> Shoumyo > >>> > >>> [1]: > >>https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >>> > >>> From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To: > >>dev@arrow.apache.org > >>> Subject: Re: [DISCUSS] Statistics through the C data interface > >>> > >>> I want to +1 on what Dewey is saying here and some comments. > >>> > >>> Sutou Kouhei wrote: > >>>> ADBC may be a bit larger to use only for transmitting statistics. ADBC has > >>> statistics related APIs but it has more other APIs. > >>> > >>> It's impossible to keep the responsibility of communication protocols > >>> cleanly separated, but IMO, we should strive to keep the C Data > >>> Interface more of a Transport Protocol than an Application Protocol. > >>> > >>> Statistics are application dependent and can complicate the > >>> implementation of importers/exporters which would hinder the adoption > >>> of the C Data Interface. Statistics also bring in security concerns > >>> that are application-specific. e.g. can an algorithm trust min/max > >>> stats and risk producing incorrect results if the statistics are > >>> incorrect? A question that can't really be answered at the C Data > >>> Interface level. > >>> > >>> The need for more sophisticated statistics only grows with time, so > >>> there is no such thing as a "simple statistics schema". > >>> > >>> Protocols that produce/consume statistics might want to use the C Data > >>> Interface as a primitive for passing Arrow arrays of statistics. > >>> > >>> ADBC might be too big of a leap in complexity now, but "we just need C > >>> Data Interface + statistics" is unlikely to remain true for very long > >>> as projects grow in complexity. > >>> > >>> -- > >>> Felipe > >>> > >>> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington > >>> <de...@voltrondata.com.invalid> wrote: > >>>> > >>>> Thank you for the background! I understand that these statistics are > >>>> important for query planning; however, I am not sure that I follow why > >>>> we are constrained to the ArrowSchema to represent them. The examples > >>>> given seem to going through Python...would it be easier to request > >>>> statistics at a higher level of abstraction? There would already need > >>>> to be a separate mechanism to request an ArrowArrayStream with > >>>> statistics (unless the PyCapsule `requested_schema` argument would > >>>> suffice). > >>>> > >>>> > ADBC may be a bit larger to use only for transmitting > >>>> > statistics. ADBC has statistics related APIs but it has more > >>>> > other APIs. > >>>> > >>>> Some examples of producers given in the linked threads (Delta Lake, > >>>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One > >>>> can implement an ADBC driver without defining all the methods (where > >>>> the producer could call AdbcConnectionGetStatistics(), although > >>>> AdbcStatementGetStatistics() might be more relevant here and doesn't > >>>> exist). One example listed (using an Arrow Table as a source) seems a > >>>> bit light to wrap in an ADBC driver; however, it would not take much > >>>> code to do so and the overhead of getting the reader via ADBC it is > >>>> something like 100 microseconds (tested via the ADBC R package's > >>>> "monkey driver" which wraps an existing stream as a statement). In any > >>>> case, the bulk of the code is building the statistics array. > >>>> > >>>> > How about the following schema for the > >>>> > statistics ArrowArray? It's based on ADBC. > >>>> > >>>> Whatever format for statistics is decided on, I imagine it should be > >>>> exactly the same as the ADBC standard? (Perhaps pushing changes > >>>> upstream if needed?). > >>>> > >>>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote: > >>>> > > >>>> > Hi, > >>>> > > >>>> > > Why not simply pass the statistics ArrowArray separately in your > >>>> > > producer API of choice > >>>> > > >>>> > It seems that we should use the approach because all > >>>> > feedback said so. How about the following schema for the > >>>> > statistics ArrowArray? It's based on ADBC. > >>>> > > >>>> > | Field Name | Field Type | Comments | > >>>> > |--------------------------|-----------------------| -------- | > >>>> > | column_name | utf8 | (1) | > >>>> > | statistic_key | utf8 not null | (2) | > >>>> > | statistic_value | VALUE_SCHEMA not null | | > >>>> > | statistic_is_approximate | bool not null | (3) | > >>>> > > >>>> > 1. If null, then the statistic applies to the entire table. > >>>> > It's for "row_count". > >>>> > 2. We'll provide pre-defined keys such as "max", "min", > >>>> > "byte_width" and "distinct_count" but users can also use > >>>> > application specific keys. > >>>> > 3. If true, then the value is approximate or best-effort. > >>>> > > >>>> > VALUE_SCHEMA is a dense union with members: > >>>> > > >>>> > | Field Name | Field Type | > >>>> > |------------|------------| > >>>> > | int64 | int64 | > >>>> > | uint64 | uint64 | > >>>> > | float64 | float64 | > >>>> > | binary | binary | > >>>> > > >>>> > If a column is an int32 column, it uses int64 for > >>>> > "max"/"min". We don't provide all types here. Users should > >>>> > use a compatible type (int64 for a int32 column) instead. > >>>> > > >>>> > > >>>> > Thanks, > >>>> > -- > >>>> > kou > >>>> > > >>>> > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> > >>>> > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May > >>> 2024 17:04:57 +0200, > >>>> > Antoine Pitrou <anto...@python.org> wrote: > >>>> > > >>>> > > > >>>> > > Hi Kou, > >>>> > > > >>>> > > I agree that Dewey that this is overstretching the capabilities of the > >>>> > > C Data Interface. In particular, stuffing a pointer as metadata value > >>>> > > and decreeing it immortal doesn't sound like a good design decision. > >>>> > > > >>>> > > Why not simply pass the statistics ArrowArray separately in your > >>>> > > producer API of choice (Dewey mentioned ADBC but it is of course just > >>>> > > a possible API among others)? > >>>> > > > >>>> > > Regards > >>>> > > > >>>> > > Antoine. > >>>> > > > >>>> > > > >>>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : > >>>> > >> Hi, > >>>> > >> We're discussing how to provide statistics through the C > >>>> > >> data interface at: > >>>> > >> https://github.com/apache/arrow/issues/38837 > >>>> > >> If you're interested in this feature, could you share your > >>>> > >> comments? > >>>> > >> Motivation: > >>>> > >> We can interchange Apache Arrow data by the C data interface > >>>> > >> in the same process. For example, we can pass Apache Arrow > >>>> > >> data read by Apache Arrow C++ (provider) to DuckDB > >>>> > >> (consumer) through the C data interface. > >>>> > >> A provider may know Apache Arrow data statistics. For > >>>> > >> example, a provider can know statistics when it reads Apache > >>>> > >> Parquet data because Apache Parquet may provide statistics. > >>>> > >> But a consumer can't know statistics that are known by a > >>>> > >> producer. Because there isn't a standard way to provide > >>>> > >> statistics through the C data interface. If a consumer can > >>>> > >> know statistics, it can process Apache Arrow data faster > >>>> > >> based on statistics. > >>>> > >> Proposal: > >>>> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > >>>> > >> How about providing statistics as a metadata in ArrowSchema? > >>>> > >> We reserve "ARROW" namespace for internal Apache Arrow use: > >>>> > >> > >>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >>>> > >> > >>>> > >>> The ARROW pattern is a reserved namespace for internal > >>>> > >>> Arrow use in the custom_metadata fields. For example, > >>>> > >>> ARROW:extension:name. > >>>> > >> So we can use "ARROW:statistics" for the metadata key. > >>>> > >> We can represent statistics as a ArrowArray like ADBC does. > >>>> > >> Here is an example ArrowSchema that is for a record batch > >>>> > >> that has "int32 column1" and "string column2": > >>>> > >> ArrowSchema { > >>>> > >> .format = "+siu", > >>>> > >> .metadata = { > >>>> > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >>>> > >> such > >>as > >>>> > >> row count */ > >>>> > >> }, > >>>> > >> .children = { > >>>> > >> ArrowSchema { > >>>> > >> .name = "column1", > >>>> > >> .format = "i", > >>>> > >> .metadata = { > >>>> > >> "ARROW:statistics" => ArrowArray*, /* column-level >>>> > >> statistics > >>> such as > >>>> > >> count distinct */ > >>>> > >> }, > >>>> > >> }, > >>>> > >> ArrowSchema { > >>>> > >> .name = "column2", > >>>> > >> .format = "u", > >>>> > >> .metadata = { > >>>> > >> "ARROW:statistics" => ArrowArray*, /* column-level >>>> > >> statistics > >>> such as > >>>> > >> count distinct */ > >>>> > >> }, > >>>> > >> }, > >>>> > >> }, > >>>> > >> } > >>>> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" > >>>> > >> => ArrowArray*' is a base 10 string of the address of the > >>>> > >> ArrowArray. Because we can use only string for metadata > >>>> > >> value. You can't release the statistics ArrowArray*. (Its > >>>> > >> release is a no-op function.) It follows > >>>> > >> > >>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > >>>> > >> semantics. (The base ArrowSchema owns statistics > >>>> > >> ArrowArray*.) > >>>> > >> ArrowArray* for statistics use the following schema: > >>>> > >> | Field Name | Field Type | Comments | > >>>> > >> |----------------|----------------------------------| -------- | > >>>> > >> | key | string not null | (1) | > >>>> > >> | value | `VALUE_SCHEMA` not null | | > >>>> > >> | is_approximate | bool not null | (2) | > >>>> > >> 1. We'll provide pre-defined keys such as "max", "min", > >>>> > >> "byte_width" and "distinct_count" but users can also use > >>>> > >> application specific keys. > >>>> > >> 2. If true, then the value is approximate or best-effort. > >>>> > >> VALUE_SCHEMA is a dense union with members: > >>>> > >> | Field Name | Field Type | Comments | > >>>> > >> |------------|----------------------------------| -------- | > >>>> > >> | int64 | int64 | | > >>>> > >> | uint64 | uint64 | | > >>>> > >> | float64 | float64 | | > >>>> > >> | value | The same type of the ArrowSchema | (3) | > >>>> > >> | | that is belonged to. | | > >>>> > >> 3. If the ArrowSchema's type is string, this type is also string. > >>>> > >> TODO: Is "value" good name? If we refer it from the > >>>> > >> top-level statistics schema, we need to use > >>>> > >> "value.value". It's a bit strange... > >>>> > >> What do you think about this proposal? Could you share your > >>>> > >> comments? > >>>> > >> Thanks, > >>> > >>> >