Agreed, it doesn't seem like a good idea to require users of the C data interface to also depend on the IPC format. JSON sounds more reasonable in that case.
Shoumyo From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To: dev@arrow.apache.org Subject: Re: [DISCUSS] Statistics through the C data interface >Hi, > >>> - If you need statistics in the schema then simply encode the 1-row batch >>> into an IPC buffer (using the streaming format) or maybe just an IPC >>> RecordBatch message since the schema is fixed and store those bytes in the >>> schema >> >> This would avoid having to define a separate "schema" for >> the JSON metadata > >Right. What I'm worried about with this approach is that >this may not match with the C data interface. > >In the C data interface, we don't use the IPC format. If we >want to transmit statistics with schema through the C data >interface, we need to mix the IPC format and the C data >interface. (This is why I used the address in my first >proposal.) > >Note that we can use separated API to transmit statistics >instead of embedding statistics into schema for this case. > >I thought using JSON is easier to use for both of the IPC >format and the C data interface. Statistics data will not be >large. So this will not affect performance. > > >> If we do go down the JSON route, how about something like >> this to avoid defining the keys for all possible statistics up >> front: >> >> Schema { >> custom_metadata: { >> "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, >\"value_type\": \"uint64\", \"is_approximate\": false } ]" >> } >> } >> >> It's more verbose, but more closely mirrors the Arrow array >> schema defined for statistics getter APIs. This could make it >> easier to translate between the two. > >Thanks. I didn't think of it. >It makes sense. > > >Thanks, >-- >kou > >In <665673b500015f5808ce0...@message.bloomberg.net> > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 >00:15:49 -0000, > "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> >wrote: > >> Thanks for addressing the feedback! I didn't know that an >> Arrow IPC `Message` (not just Schema) could also contain >> `custom_metadata` -- thanks for pointing it out. >> >>> Based on the list, how about standardizing both of the >>> followings for statistics? >>> >>> 1. Apache Arrow schema for statistics that is used by >>> separated statistics getter API >>> 2. "ARROW:statistics" metadata format that can be used in >>> Apache Arrow schema metadata >>> >>> Users can use 1. and/or 2. based on their use cases. >> >> This sounds good to me. Using JSON to represent the metadata >> for #2 also sounds reasonable. I think elsewhere on this >> thread, Weston mentioned that we could alternatively use >> the schema defined for #1 and directly use that to encode >> the schema metadata as an Arrow IPC RecordBatch: >> >>> This has been something that has always been desired for the Arrow IPC >>> format too. >>> >>> My preference would be (apologies if this has been mentioned before): >>> >>> - Agree on how statistics should be encoded into an array (this is not >>> hard, we just have to agree on the field order and the data type for >>> null_count) >>> - If you need statistics in the schema then simply encode the 1-row batch >>> into an IPC buffer (using the streaming format) or maybe just an IPC >>> RecordBatch message since the schema is fixed and store those bytes in the >>> schema >> >> This would avoid having to define a separate "schema" for >> the JSON metadata, but might be more effort to work with in >> certain contexts (e.g. a library that currently only needs the >> C data interface would now also have to learn how to parse >> Arrow IPC). >> >> If we do go down the JSON route, how about something like >> this to avoid defining the keys for all possible statistics up >> front: >> >> Schema { >> custom_metadata: { >> "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, >\"value_type\": \"uint64\", \"is_approximate\": false } ]" >> } >> } >> >> It's more verbose, but more closely mirrors the Arrow array >> schema defined for statistics getter APIs. This could make it >> easier to translate between the two. >> >> Thanks, >> Shoumyo >> >> From: dev@arrow.apache.org At: 05/26/24 21:48:52 UTC-4:00To: >dev@arrow.apache.org >> Subject: Re: [DISCUSS] Statistics through the C data interface >> >>>Hi, >> >>> >> >>>> To start, data might be sourced in various manners: >> >>>> >> >>>> - Arrow IPC files may be mapped from shared memory >> >>>> - Arrow IPC streams may be received via some RPC framework (à la Flight) >> >>>> - The Arrow libraries may be used to read from file formats like Parquet >>>> or >> >>>CSV >> >>>> - ADBC drivers may be used to read from databases >> >>> >> >>>Thanks for listing it. >> >>> >> >>>Regarding to the first case: >> >>> >> >>>Using schema metadata may be a reasonable approach because >> >>>the Arrow data will be on the page cache. There is no >> >>>significant read cost. We don't need to read statistics >> >>>before the Arrow data is ready. >> >>> >> >>>But if the Arrow data will not be produced based on >> >>>statistics of the Arrow data, separated statistics get API >> >>>may be better. >> >>> >> >>>Regarding to the second case: >> >>> >> >>>Schema metadata is an approach for it but we can choose >> >>>other approaches for this case. For example, Flight has >> >>>FlightData::app_metadata[1] and Arrow IPC message has >> >>>custom_metadata[2] as Dewey mentioned. >> >>> >> >>>[1] >> >>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/ >fo >> >>>rmat/Flight.proto#L512-L515 >> >>>[2] >> >>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/ >fo >> >>>rmat/Message.fbs#L154 >> >>> >> >>>Regarding to the third case: >> >>> >> >>>Reader objects will provide statistics. For example, >> >>>parquet::ColumnChunkMetaData::statistics() >> >>>(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statisti >cs >> >>>()) >> >>>will provide statistics. >> >>> >> >>>Regarding to the forth case: >> >>> >> >>>We can use ADBC API. >> >>> >> >>> >> >>>Based on the list, how about standardizing both of the >> >>>followings for statistics? >> >>> >> >>>1. Apache Arrow schema for statistics that is used by >> >>> separated statistics getter API >> >>>2. "ARROW:statistics" metadata format that can be used in >> >>> Apache Arrow schema metadata >> >>> >> >>>Users can use 1. and/or 2. based on their use cases. >> >>> >> >>>Regarding to 2.: How about the following? >> >>> >> >>>This uses Field::custom_metadata[3] and >> >>>Schema::custom_metadata[4]. >> >>> >> >>>[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529 >> >>>[4] >> >>>https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/ >fo >> >>>rmat/Schema.fbs#L563-L564 >> >>> >> >>>"ARROW:statistics" in Field::custom_metadata represents >> >>>column-level statistics. It uses JSON like we did for >> >>>"ARROW:extension:metadata"[5]. Here is an example: >> >>> >> >>> Field { >> >>> custom_metadata: { >> >>> "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}" >> >>> } >> >>> } >> >>> >> >>>(JSON may not be able to represent complex information but >> >>>is it needed for statistics?) >> >>> >> >>>"ARROW:statistics" in Schema::custom_metadata represents >> >>>table-level statistics. It uses JSON like we did for >> >>>"ARROW:extension:metadata"[5]. Here is an example: >> >>> >> >>> Schema { >> >>> custom_metadata: { >> >>> "ARROW:statistics" => "{\"row_count\": 29}" >> >>> } >> >>> } >> >>> >> >>>TODO: Define the JSON content details. For example, we need >> >>>to define keys such as "distinct_count" and "row_count". >> >>> >> >>> >> >>>[5] >> >>>https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-t >yp >> >>>es >> >>> >> >>> >> >>> >> >>>Thanks, >> >>>-- >> >>>kou >> >>> >> >>>In <664f529b0002a8710c430...@message.bloomberg.net> >> >>> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May >>> 2024 >> >>>14:28:43 -0000, >> >>> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" <schakravo...@bloomberg.net> >> >>>wrote: >> >>> >> >>>> This is a really exciting development, thank you for putting together this >> >>>proposal! >> >>>> >> >>>> It looks like this thread and the linked GitHub issue has lots of input >from >> >>>folks who work with Arrow at a low level and have better familiarity with >>>the >> >>>Arrow specifications than I do, so I'll refrain from commenting on the >> >>>technicalities of the proposal. I would, however, like to share my >perspective >> >>>as an application developer that heavily uses Arrow at higher levels for >> >>>composing data systems. >> >>>> >> >>>> My main concern with the direction of this proposal is that it seems too >> >>>narrowly focused on what the integration with DuckDB will look like (how the >> >>>statistics can be fed into DuckDB). In many applications, executing the >>>query >> >>>is often the "last mile", and it's important to consider where the >>>statistics >> >>>will actually come from. To start, data might be sourced in various manners: >> >>>> >> >>>> - Arrow IPC files may be mapped from shared memory >> >>>> - Arrow IPC streams may be received via some RPC framework (à la Flight) >> >>>> - The Arrow libraries may be used to read from file formats like Parquet >>>> or >> >>>CSV >> >>>> - ADBC drivers may be used to read from databases >> >>>> >> >>>> Note that in at least the first two cases, the system _executing the >>>> query_ >> >>>will not be able to provide statistics simply because it is not actually the >> >>>data producer. As an example, if Process A writes an Arrow IPC file to >>>shared >> >>>memory, and Process B wants to run a query on it -- how is Process B >>>supposed >> >>>to get the statistics for query planning? There are a few approaches that I >> >>>anticipate application developers might consider: >> >>>> >> >>>> 1. Design an out-of-band mechanism for Process B to fetch statistics from >> >>>Process A. >> >>>> 2. Design an encoding that is a superset of Arrow IPC and includes >statistics >> >>>information, allowing statistics to be communicated in-band. >> >>>> 3. Use custom schema metadata to communicate statistics in-band. >> >>>> >> >>>> Options 1 and 2 require considerably more effort than Option 3. Also, >Option >> >>>3 feels somewhat natural because it makes sense for the statistics to come >with >> >>>the data (similar to how statistics are embedded in Parquet files). In some >> >>>sense, the statistics actually *are* a property of the stream. >> >>>> >> >>>> In systems that I work on, we already use schema metadata to communicate >> >>>information that is unrelated to the structure of the data. From my reading >of >> >>>the documentation [1], this sounds like a reasonable (and perhaps intended?) >> >>>use of metadata, and nowhere is it mentioned that metadata must be used to >> >>>determine schema equivalence. Unless there are other ways of producing >> >>>stream-level application metadata outside of the schema/field metadata, the >> >>>lack of purity was not a concern for me to begin with. >> >>>> >> >>>> I would appreciate an approach that communicates statistics via schema >> >>>metadata, or at least in some in-band fashion that is consistent across the >IPC >> >>>and C data specifications. This would make it much easier to uniformly and >> >>>transparently plumb statistics through applications, regardless of where >>>they >> >>>source Arrow data from. As developers are likely to create bespoke >conventions >> >>>for this anyways, it seems reasonable to standardize it as canonical >>>metadata. >> >>>> >> >>>> I say this all as a happy user of DuckDB's Arrow scan functionality that >>>> is >> >>>excited to see better query optimization capabilities. It's just that, in >>>its >> >>>current form, the changes in this proposal are not something I could >> >>>foreseeably integrate with. >> >>>> >> >>>> Best, >> >>>> Shoumyo >> >>>> >> >>>> [1]: >> >>>https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >>>> >> >>>> From: dev@arrow.apache.org At: 05/23/24 10:10:51 UTC-4:00To: >> >>>dev@arrow.apache.org >> >>>> Subject: Re: [DISCUSS] Statistics through the C data interface >> >>>> >> >>>> I want to +1 on what Dewey is saying here and some comments. >> >>>> >> >>>> Sutou Kouhei wrote: >> >>>>> ADBC may be a bit larger to use only for transmitting statistics. ADBC >>>>> has >> >>>> statistics related APIs but it has more other APIs. >> >>>> >> >>>> It's impossible to keep the responsibility of communication protocols >> >>>> cleanly separated, but IMO, we should strive to keep the C Data >> >>>> Interface more of a Transport Protocol than an Application Protocol. >> >>>> >> >>>> Statistics are application dependent and can complicate the >> >>>> implementation of importers/exporters which would hinder the adoption >> >>>> of the C Data Interface. Statistics also bring in security concerns >> >>>> that are application-specific. e.g. can an algorithm trust min/max >> >>>> stats and risk producing incorrect results if the statistics are >> >>>> incorrect? A question that can't really be answered at the C Data >> >>>> Interface level. >> >>>> >> >>>> The need for more sophisticated statistics only grows with time, so >> >>>> there is no such thing as a "simple statistics schema". >> >>>> >> >>>> Protocols that produce/consume statistics might want to use the C Data >> >>>> Interface as a primitive for passing Arrow arrays of statistics. >> >>>> >> >>>> ADBC might be too big of a leap in complexity now, but "we just need C >> >>>> Data Interface + statistics" is unlikely to remain true for very long >> >>>> as projects grow in complexity. >> >>>> >> >>>> -- >> >>>> Felipe >> >>>> >> >>>> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington >> >>>> <de...@voltrondata.com.invalid> wrote: >> >>>>> >> >>>>> Thank you for the background! I understand that these statistics are >> >>>>> important for query planning; however, I am not sure that I follow why >> >>>>> we are constrained to the ArrowSchema to represent them. The examples >> >>>>> given seem to going through Python...would it be easier to request >> >>>>> statistics at a higher level of abstraction? There would already need >> >>>>> to be a separate mechanism to request an ArrowArrayStream with >> >>>>> statistics (unless the PyCapsule `requested_schema` argument would >> >>>>> suffice). >> >>>>> >> >>>>> > ADBC may be a bit larger to use only for transmitting >> >>>>> > statistics. ADBC has statistics related APIs but it has more >> >>>>> > other APIs. >> >>>>> >> >>>>> Some examples of producers given in the linked threads (Delta Lake, >> >>>>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One >> >>>>> can implement an ADBC driver without defining all the methods (where >> >>>>> the producer could call AdbcConnectionGetStatistics(), although >> >>>>> AdbcStatementGetStatistics() might be more relevant here and doesn't >> >>>>> exist). One example listed (using an Arrow Table as a source) seems a >> >>>>> bit light to wrap in an ADBC driver; however, it would not take much >> >>>>> code to do so and the overhead of getting the reader via ADBC it is >> >>>>> something like 100 microseconds (tested via the ADBC R package's >> >>>>> "monkey driver" which wraps an existing stream as a statement). In any >> >>>>> case, the bulk of the code is building the statistics array. >> >>>>> >> >>>>> > How about the following schema for the >> >>>>> > statistics ArrowArray? It's based on ADBC. >> >>>>> >> >>>>> Whatever format for statistics is decided on, I imagine it should be >> >>>>> exactly the same as the ADBC standard? (Perhaps pushing changes >> >>>>> upstream if needed?). >> >>>>> >> >>>>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei <k...@clear-code.com> wrote: >> >>>>> > >> >>>>> > Hi, >> >>>>> > >> >>>>> > > Why not simply pass the statistics ArrowArray separately in your >> >>>>> > > producer API of choice >> >>>>> > >> >>>>> > It seems that we should use the approach because all >> >>>>> > feedback said so. How about the following schema for the >> >>>>> > statistics ArrowArray? It's based on ADBC. >> >>>>> > >> >>>>> > | Field Name | Field Type | Comments | >> >>>>> > |--------------------------|-----------------------| -------- | >> >>>>> > | column_name | utf8 | (1) | >> >>>>> > | statistic_key | utf8 not null | (2) | >> >>>>> > | statistic_value | VALUE_SCHEMA not null | | >> >>>>> > | statistic_is_approximate | bool not null | (3) | >> >>>>> > >> >>>>> > 1. If null, then the statistic applies to the entire table. >> >>>>> > It's for "row_count". >> >>>>> > 2. We'll provide pre-defined keys such as "max", "min", >> >>>>> > "byte_width" and "distinct_count" but users can also use >> >>>>> > application specific keys. >> >>>>> > 3. If true, then the value is approximate or best-effort. >> >>>>> > >> >>>>> > VALUE_SCHEMA is a dense union with members: >> >>>>> > >> >>>>> > | Field Name | Field Type | >> >>>>> > |------------|------------| >> >>>>> > | int64 | int64 | >> >>>>> > | uint64 | uint64 | >> >>>>> > | float64 | float64 | >> >>>>> > | binary | binary | >> >>>>> > >> >>>>> > If a column is an int32 column, it uses int64 for >> >>>>> > "max"/"min". We don't provide all types here. Users should >> >>>>> > use a compatible type (int64 for a int32 column) instead. >> >>>>> > >> >>>>> > >> >>>>> > Thanks, >> >>>>> > -- >> >>>>> > kou >> >>>>> > >> >>>>> > In <a3ce5e96-176c-4226-9d74-6a458317a...@python.org> >> >>>>> > "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 >>>>> > May >> >>>> 2024 17:04:57 +0200, >> >>>>> > Antoine Pitrou <anto...@python.org> wrote: >> >>>>> > >> >>>>> > > >> >>>>> > > Hi Kou, >> >>>>> > > >> >>>>> > > I agree that Dewey that this is overstretching the capabilities of the >> >>>>> > > C Data Interface. In particular, stuffing a pointer as metadata value >> >>>>> > > and decreeing it immortal doesn't sound like a good design decision. >> >>>>> > > >> >>>>> > > Why not simply pass the statistics ArrowArray separately in your >> >>>>> > > producer API of choice (Dewey mentioned ADBC but it is of course just >> >>>>> > > a possible API among others)? >> >>>>> > > >> >>>>> > > Regards >> >>>>> > > >> >>>>> > > Antoine. >> >>>>> > > >> >>>>> > > >> >>>>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit : >> >>>>> > >> Hi, >> >>>>> > >> We're discussing how to provide statistics through the C >> >>>>> > >> data interface at: >> >>>>> > >> https://github.com/apache/arrow/issues/38837 >> >>>>> > >> If you're interested in this feature, could you share your >> >>>>> > >> comments? >> >>>>> > >> Motivation: >> >>>>> > >> We can interchange Apache Arrow data by the C data interface >> >>>>> > >> in the same process. For example, we can pass Apache Arrow >> >>>>> > >> data read by Apache Arrow C++ (provider) to DuckDB >> >>>>> > >> (consumer) through the C data interface. >> >>>>> > >> A provider may know Apache Arrow data statistics. For >> >>>>> > >> example, a provider can know statistics when it reads Apache >> >>>>> > >> Parquet data because Apache Parquet may provide statistics. >> >>>>> > >> But a consumer can't know statistics that are known by a >> >>>>> > >> producer. Because there isn't a standard way to provide >> >>>>> > >> statistics through the C data interface. If a consumer can >> >>>>> > >> know statistics, it can process Apache Arrow data faster >> >>>>> > >> based on statistics. >> >>>>> > >> Proposal: >> >>>>> > >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> >>>>> > >> How about providing statistics as a metadata in ArrowSchema? >> >>>>> > >> We reserve "ARROW" namespace for internal Apache Arrow use: >> >>>>> > >> >> >>>> >https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >>>>> > >> >> >>>>> > >>> The ARROW pattern is a reserved namespace for internal >> >>>>> > >>> Arrow use in the custom_metadata fields. For example, >> >>>>> > >>> ARROW:extension:name. >> >>>>> > >> So we can use "ARROW:statistics" for the metadata key. >> >>>>> > >> We can represent statistics as a ArrowArray like ADBC does. >> >>>>> > >> Here is an example ArrowSchema that is for a record batch >> >>>>> > >> that has "int32 column1" and "string column2": >> >>>>> > >> ArrowSchema { >> >>>>> > >> .format = "+siu", >> >>>>> > >> .metadata = { >> >>>>> > >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >such >> >>>as >> >>>>> > >> row count */ >> >>>>> > >> }, >> >>>>> > >> .children = { >> >>>>> > >> ArrowSchema { >> >>>>> > >> .name = "column1", >> >>>>> > >> .format = "i", >> >>>>> > >> .metadata = { >> >>>>> > >> "ARROW:statistics" => ArrowArray*, /* column-level >statistics >> >>>> such as >> >>>>> > >> count distinct */ >> >>>>> > >> }, >> >>>>> > >> }, >> >>>>> > >> ArrowSchema { >> >>>>> > >> .name = "column2", >> >>>>> > >> .format = "u", >> >>>>> > >> .metadata = { >> >>>>> > >> "ARROW:statistics" => ArrowArray*, /* column-level >statistics >> >>>> such as >> >>>>> > >> count distinct */ >> >>>>> > >> }, >> >>>>> > >> }, >> >>>>> > >> }, >> >>>>> > >> } >> >>>>> > >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> >>>>> > >> => ArrowArray*' is a base 10 string of the address of the >> >>>>> > >> ArrowArray. Because we can use only string for metadata >> >>>>> > >> value. You can't release the statistics ArrowArray*. (Its >> >>>>> > >> release is a no-op function.) It follows >> >>>>> > >> >> >>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> >>>>> > >> semantics. (The base ArrowSchema owns statistics >> >>>>> > >> ArrowArray*.) >> >>>>> > >> ArrowArray* for statistics use the following schema: >> >>>>> > >> | Field Name | Field Type | Comments | >> >>>>> > >> |----------------|----------------------------------| -------- | >> >>>>> > >> | key | string not null | (1) | >> >>>>> > >> | value | `VALUE_SCHEMA` not null | | >> >>>>> > >> | is_approximate | bool not null | (2) | >> >>>>> > >> 1. We'll provide pre-defined keys such as "max", "min", >> >>>>> > >> "byte_width" and "distinct_count" but users can also use >> >>>>> > >> application specific keys. >> >>>>> > >> 2. If true, then the value is approximate or best-effort. >> >>>>> > >> VALUE_SCHEMA is a dense union with members: >> >>>>> > >> | Field Name | Field Type | Comments | >> >>>>> > >> |------------|----------------------------------| -------- | >> >>>>> > >> | int64 | int64 | | >> >>>>> > >> | uint64 | uint64 | | >> >>>>> > >> | float64 | float64 | | >> >>>>> > >> | value | The same type of the ArrowSchema | (3) | >> >>>>> > >> | | that is belonged to. | | >> >>>>> > >> 3. If the ArrowSchema's type is string, this type is also string. >> >>>>> > >> TODO: Is "value" good name? If we refer it from the >> >>>>> > >> top-level statistics schema, we need to use >> >>>>> > >> "value.value". It's a bit strange... >> >>>>> > >> What do you think about this proposal? Could you share your >> >>>>> > >> comments? >> >>>>> > >> Thanks, >> >>>> >> >>>> >>