drin commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2126435173

   Coming from the mailing list, I caught up on some of the comments above. I 
agree that a separate API call is nicer than packing statistics into the 
schema. Packing into the schema doesn't seem bad to me, but it certainly seems 
more limited. An additional case to consider is that a separate API call 
provides additional flexibility of where to get the statistics in case they may 
come from a different source than the schema (in the case of a table view or a 
distributed table).
   
   I also interpret [Dewey's 
comment](https://lists.apache.org/thread/dsvc0zo7q5gk3f8smn1q82ton0wpk1rd) as: 
the schema should describe the "structure" of the data, whereas the statistics 
describes the "content" of the data. This aligns with Weston's point that the 
statistics are essentially "just another record batch." I agree with both of 
these.
   
   Below are some relevant points I tried considering, and I see no downside to 
using an additional API call:
   * duckdb does eager binding. This only requires the schema to determine 
number of columns and their types 
([arrow.cpp#238-254](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L238-L254)).
   * statistics are only considered at optimization time. Relative chronology 
is: binding -> logical plan in hand -> invoke optimizer.
   * duckdb's optimization workflow calls `EstimateCardinality` on 
`LogicalOperator` 
([join_order_optimizer.cpp#L68-75](https://github.com/duckdb/duckdb/blob/v0.10.3/src/optimizer/join_order/join_order_optimizer.cpp#L68-L75)).
 Delegating through an API call to arrow is trivial and there's no need for it 
to go through the schema
   * as of now, duckdb seems to set cardinality statistics as recordbatches are 
processed 
([arrow.cpp#L389-L400](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L389-L400))
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to