XiangpengHao commented on issue #11628:
URL: https://github.com/apache/datafusion/issues/11628#issuecomment-2246432690
Got some time to think about this and want to share my thoughts here:
#### Implementation
The goal is to reduce copying string, specifically, only copying string once
and only constructing view array once.
I implemented the gc in `concat_batches` on my local branch, the code looks
like this:
```rust
for i in 0..field_num {
let data_type = schema.field(i).data_type();
match data_type {
&arrow_schema::DataType::Utf8View => {
let mut string_view_builder =
StringViewBuilder::with_capacity(row_count)
.with_block_size(1024 * 1024 * 2);
for b in batches.iter() {
let array = b.column(i).as_string_view();
for v in array.iter() {
string_view_builder.append_option(v);
}
}
let array = string_view_builder.finish();
arrays.push(Arc::new(array) as ArrayRef);
}
_ => {
let array = arrow::compute::concat(
&batches
.iter()
.map(|batch| batch.column(i).as_ref())
.collect::<Vec<_>>(),
)?;
arrays.push(array);
}
}
}
```
Benchmark this implement on ClickBench Q20:
```sql
SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';
```
The performance is slower by about 20%.
#### Profiling
I checked the flamegraph and found the new implementation takes
significantly more time on page fault.
<img width="792" alt="image"
src="https://github.com/user-attachments/assets/ac4cae52-d521-49fb-aedd-781edd636ea2">
Then I ran heaptrack (need to disable `mimalloc`) and found the peak RSS
(Peak Resident Set Size) increased from 1.3GB to 5.2GB.
I believe the performance regression is due to **late GC**. Previously, we
called `GC` immediately after we ran the filter. Now, we call GC only after we
accumulate enough values in the buffer, which can hold the underlying buffer
for an excessively long time.
#### Solution?
I think the discussion around reduce copying can be divide into two sub
questions:
1. what is the current overhead of StringView in `CoalesceBatchesExec`? The
current implementation does not copy string data. The overhead (extra steps)
comes from that we constructed the view three times (one in the filter step,
one in the coalesce gc, one in the `concate_batches`). The implementation above
gets rid of the second one, but it is done in an improper timing.
2. should we refactor filter-then-coalesce into one operator? In that way,
we don't have intermediate small batches, thus reduce copy. I believe this is a
bigger project and can potentially solve the first problem along the way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]