alamb opened a new issue, #7957:
URL: https://github.com/apache/arrow-datafusion/issues/7957
### Is your feature request related to a problem or challenge?
While looking at TPCH query performance for #6782 I noticed several times
that `CoalesceBatchesExec` takes non trivial amounts of time (like 5% of the
overall query time)
Here is a specific examples
#### Create Data:
```shell
cd arrow-datafusion/benchmarks
./bench.sh data tpch10
```
#### Run query with datafusion-cli:
```shell
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select o_year, sum(case when nation = 'BRAZIL'
then volume else 0 end) / sum(volume) as mkt_share from ( select
extract(year from o_orderdate) as o_year, l_extendedprice * (1 -
l_discount) as volume, n2.n_name as nation from part, supplier,
lineitem, orders, customer, nation n1, nation n2, region
where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey
= o_orderkey and o_custkey = c_custkey and c_nationkey =
n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name =
'AMERICA' and s_nationkey = n2.n_nationkey and o_orderdate between date
'1995-01-01' and date '1996-12-31' and p_type = 'ECONOMY ANODIZED STEEL' ) as
all_nations group by o_year order by o_year;"
```
```sql
select o_year,
sum(case when nation = 'BRAZIL' then volume else 0 end) /
sum(volume) as mkt_share
from (
select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'AMERICA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31' and
p_type = 'ECONOMY ANODIZED STEEL' ) as all_nations
group by
o_year
order by
o_year;"
```
Here is the full `EXPLAIN ANLAYZE` output:
[explan-analyze-q8.txt](https://github.com/apache/arrow-datafusion/files/13191730/explan-analyze-q8.txt)
A small subset shows there is a single `CoalesceBatchesExec` that takes 3
seconds (`elapsed_compute=3.066514072s`):
```
CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=59986052,
elapsed_compute=3.066514072s]
|
RepartitionExec: partitioning=Hash([l_partkey@1], 16),
input_partitions=16, metrics=[fetch_time=16.321190026s,
repart_time=10.382230411s, send_time=4.650058274s]
|
ParquetExec: file_groups={16 groups:
[[Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10
```
I profiled the query and confirmed that `CoalesceBatchExec` takes 5% of the
overall time, as shown in this screen shot

### Describe the solution you'd like
I think we can avoid this overhead by combining the behavior of
`CoalesceBatchesExec` into the operators that make small batches
(`FilterExec`, `JoinExec`, and `RepartitionExec`)
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]