[GitHub] [arrow-datafusion] xudong963 commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

GitBox Tue, 15 Feb 2022 18:14:18 -0800


xudong963 commented on pull request #1831:
URL: 
https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1041021293



   > The number of rows might be more often available as statistic than the 
total size in bytes. Both in external metadata but currently also inside our 
own statistics.
   
   Yep, I agree.
   
   > Also it might be also good to have a look at the time it takes to 
construct the hash table. Having a smaller table on the left with 1M rows might 
be slower to construct than a bigger (in bytes) table of 1K rows (e.g. a table 
containing some larger documents such as JSONs, lots of columns, etc.). From 
that perspective, number of rows might be more useful as long as it fits in 
memory.
   
   Thanks, @Dandandan. The case makes sense to me, I didn't think of that 
before.
   
   > So I think we should look at the size in bytes if it is available and 
otherwise estimate the size based on the number of rows and data types involved 
(e.g. int32 -> 4 * number of rows)
   
   Nice idea, specifically, if the statistic doesn't have the information about 
size in bytes, we need to get all data types of the row in the table, then to 
do a product computation, but, this result is actually the bytes size of the 
table, not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] xudong963 commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Reply via email to