kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2126352420
Thanks for joining this discussion. My understanding of the "up front" is the same as you explained. https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 uses `ArrowSchema` not `ArrowArray`/`ArrowArrayStream`. So it doesn't contain both data and statistics. I think the following flow is used. (`ArrowSchema` is used before we use `ArrowArrayStream`.) ```text left_stats = get_statistics(get_schema(left_table), left_table.join_id, filter) right_stats = get_statistics(get_schema(right_table), right_table.join_id, filter) if left_stats.approx_count_distinct < right_stats.approx_count_distinct: hashtable = make_hashtable(load_array(left_table.join_id)) probe(hashtable, right_table) else: hashtable = make_hashtable(load_array(right_table.join_id)) probe(hashtable, left_table) ``` Anyway, it seems that `ArrowSchema` based approach isn't good on the mailing list discussion: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx We'll provide a separate API that provides a statistics `ArrowArray` like https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
