[
https://issues.apache.org/jira/browse/ARROW-10885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-10885:
-----------------------------------
Labels: pull-request-available (was: )
> [Rust][DataFusion] Optimize join build vs probe based on statistics on row
> number
> ---------------------------------------------------------------------------------
>
> Key: ARROW-10885
> URL: https://issues.apache.org/jira/browse/ARROW-10885
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Daniël Heres
> Assignee: Daniël Heres
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Based on number of rows in a datasource we can optimize which table should be
> part of the build phase and which part of the probe phase in a hash join. We
> should make the (approximately) smallest datasource. This can have a large
> effect on performance if one of the two tables is much bigger than the other,
> as we can skip building a large lookup table.
> Recently we are adding statistics to data sources in DataFusion, so this
> seems something we can add relatively easily. We can approximate the number
> of rows based on underlying statistics in datasources, but it should at least
> work for simple cases first.
> When swapping the order a left join has to be changed to a right join and
> vice versa, inner joins remain the same. Probably it is easier to start with
> inner joins and then add left / right joins.
> Maybe we should also rename some internals to make clear that e.g. the left
> part is part of the build and the right part of the probe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)