Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Wed, 22 May 2024 23:46:24 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2126352420


   Thanks for joining this discussion.
   
   My understanding of the "up front" is the same as you explained.
   
   https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 uses 
`ArrowSchema` not `ArrowArray`/`ArrowArrayStream`. So it doesn't contain both 
data and statistics. I think the following flow is used. (`ArrowSchema` is used 
before we use `ArrowArrayStream`.)
   
   ```text
   left_stats = get_statistics(get_schema(left_table), left_table.join_id, 
filter)
   right_stats = get_statistics(get_schema(right_table), right_table.join_id, 
filter)
   if left_stats.approx_count_distinct < right_stats.approx_count_distinct:
     hashtable = make_hashtable(load_array(left_table.join_id))
     probe(hashtable, right_table)
   else:
     hashtable = make_hashtable(load_array(right_table.join_id))
     probe(hashtable, left_table)
   ```
   
   Anyway, it seems that `ArrowSchema` based approach isn't good on the mailing 
list discussion: 
https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
   We'll provide a separate API that provides a statistics `ArrowArray` like 
https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to