[ 
https://issues.apache.org/jira/browse/ARROW-10885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10885:
-----------------------------------
    Labels: pull-request-available  (was: )

> [Rust][DataFusion] Optimize join build vs probe based on statistics on row 
> number
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-10885
>                 URL: https://issues.apache.org/jira/browse/ARROW-10885
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Daniël Heres
>            Assignee: Daniël Heres
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on number of rows in a datasource we can optimize which table should be 
> part of the build phase and which part of the probe phase in a hash join. We 
> should make the (approximately) smallest datasource. This can have a large 
> effect on performance if one of the two tables is much bigger than the other, 
> as we can skip building a large lookup table.
> Recently we are adding statistics to data sources in DataFusion, so this 
> seems something we can add relatively easily. We can approximate the number 
> of rows based on underlying statistics in datasources, but it should at least 
> work for simple cases first.
> When swapping the order a left join has to be changed to a right join and 
> vice versa, inner joins remain the same. Probably it is easier to start with 
> inner joins and then add left / right joins.
> Maybe we should also rename some internals to make clear that e.g. the left 
> part is part of the build and the right part of the probe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to