Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Thu, 23 May 2024 00:39:46 -0700


drin commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2126435173

Coming from the mailing list, I caught up on some of the comments above. I
agree that a separate API call is nicer than packing statistics into the
schema. Packing into the schema doesn't seem bad to me, but it certainly seems
more limited. An additional case to consider is that a separate API call
provides additional flexibility of where to get the statistics in case they may
come from a different source than the schema (in the case of a table view or a
distributed table).

I also interpret [Dewey's
comment](https://lists.apache.org/thread/dsvc0zo7q5gk3f8smn1q82ton0wpk1rd) as:
the schema should describe the "structure" of the data, whereas the statistics
describes the "content" of the data. This aligns with Weston's point that the
statistics are essentially "just another record batch." I agree with both of
these.

Below are some relevant points I tried considering, and I see no downside to
using an additional API call:
* duckdb does eager binding. This only requires the schema to determine
number of columns and their types
([arrow.cpp#238-254](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L238-L254)).
* statistics are only considered at optimization time. Relative chronology
is: binding -> logical plan in hand -> invoke optimizer.
* duckdb's optimization workflow calls `EstimateCardinality` on
`LogicalOperator`
([join_order_optimizer.cpp#L68-75](https://github.com/duckdb/duckdb/blob/v0.10.3/src/optimizer/join_order/join_order_optimizer.cpp#L68-L75)).
Delegating through an API call to arrow is trivial and there's no need for it
to go through the schema
* as of now, duckdb seems to set cardinality statistics as recordbatches are
processed
([arrow.cpp#L389-L400](https://github.com/duckdb/duckdb/blob/v0.10.3/src/function/table/arrow.cpp#L389-L400))

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to