drin commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2126435173
Coming from the mailing list, I caught up on some of the comments above. I agree that a separate API call is nicer than packing statistics into the schema. Packing into the schema doesn't seem bad to me, but it certainly seems more limited. An additional case to consider is that a separate API call provides additional flexibility of where to get the statistics in case they may come from a different source than the schema (in the case of a table view or a distributed table). I also interpret [Dewey's comment](https://lists.apache.org/thread/dsvc0zo7q5gk3f8smn1q82ton0wpk1rd) as: the schema should describe the "structure" of the data, whereas the statistics describes the "content" of the data. This aligns with Weston's point that the statistics are essentially "just another record batch." I agree with both of these. Below are some relevant points I tried considering, and I see no downside to using an additional API call: * duckdb does eager binding. This only requires the schema to determine number of columns and their types ([arrow.cpp#238-254](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L238-L254)). * statistics are only considered at optimization time. Relative chronology is: binding -> logical plan in hand -> invoke optimizer. * duckdb's optimization workflow calls `EstimateCardinality` on `LogicalOperator` ([join_order_optimizer.cpp#L68-75](https://github.com/duckdb/duckdb/blob/v0.10.3/src/optimizer/join_order/join_order_optimizer.cpp#L68-L75)). Delegating through an API call to arrow is trivial and there's no need for it to go through the schema * as of now, duckdb seems to set cardinality statistics as recordbatches are processed ([arrow.cpp#L389-L400](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L389-L400)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
