goldmedal commented on issue #11268:
URL: https://github.com/apache/datafusion/issues/11268#issuecomment-2211819196
> `make_map_batch` version is way faster, we should go with that one.
>
> Upd: I tried to remove clone in MapBuilder version, it still seems slower
than manually constructed one
Here's the revised version of the text with improved grammar:
Thanks for your effort. I think `make_map_batch` is faster because it
accepts two known-type arrays. We only need to rearrange the layout for the
`MapArray`, which requires some `Arc` cloning. It's a good way to implement the
DuckDB syntax:
```
SELECT MAP(['a', 'b', 'c'], [1, 2, 3]);
```
However, I found it hard to provide a function like `make_map(k1, v1, k2, v2
...)` that is both efficient and simple. I tried two solutions for it:
1. Aggregate the keys and values, then pass them to `make_map_batch`:
```rust
fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let mut key_buffer = VecDeque::new();
let mut value_buffer = VecDeque::new();
args.chunks_exact(2)
.enumerate()
.for_each(|(_, chunk)| {
let key = chunk[0].clone();
let value = chunk[1].clone();
key_buffer.push_back(key);
value_buffer.push_back(value);
});
let key: Vec<_> = key_buffer.into();
let value: Vec<_> = value_buffer.into();
let key = ColumnarValue::values_to_arrays(&key)?;
let value = ColumnarValue::values_to_arrays(&value)?;
make_map_batch(&[key[0].clone(), value[0].clone()])
}
```
The codebase is simple, but the performance is terrible. The bottleneck
is `ColumnarValue::values_to_arrays`. This function is convenient but slow.
Actually, this version is slower than the `MapBuilder` one. :(
2. MapBuilder:
It's the faster solution (compared to the above one) for `make_map(k1,
v1, k2, v2 ...)`, but the codebase would become complex and large. In my
testing code, I hardcoded the types of keys and values to get the value and
create the builder. In a real scenario, we would need to match all types and
builders to use MapBuilder. I'm not sure if it's a good approach.
```rust
fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let string_builder = StringBuilder::new();
let int_builder = Int32Builder::with_capacity(args.len() / 2);
let mut builder =
MapBuilder::new(None, string_builder, int_builder);
for (_, chunk) in args.chunks_exact(2).enumerate() {
let key = chunk[0].clone();
let value = chunk[1].clone();
match key {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(key_scalar)))
=> {
builder.keys().append_value(key_scalar)
}
_ => {}
}
match value {
ColumnarValue::Scalar(ScalarValue::Int32(Some(value_scalar))) => {
builder.values().append_value(value_scalar)
}
_ => {}
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
}
```
In conclusion, I will provide two functions:
- For performance purposes: `map(['a', 'b', 'c'], [1, 2, 3])`
- For user-friendly purposes: `make_map(k1, v1, k2, v2 ...)`
For the user-friendly one, I'll choose the first solution to keep the
codebase simpler. I found that `named_struct` also uses
`ColumnarValue::values_to_arrays` to do something similar.
https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/functions/src/core/named_struct.rs#L60
We could suggest that users who intend to get better performance or create a
large map use the `map()` function.
Some appendixs for the benchmark result:
```
// first run for warm-up
Generating 1 random key-value pairs
Time elapsed in make_map() is: 391.618µs
Time elapsed in make_map_from() is: 224.568µs
Time elapsed in make_map_batch() is: 18.173µs
Generating 2 random key-value pairs
Time elapsed in make_map() is: 19.848µs
Time elapsed in make_map_from() is: 27.726µs
Time elapsed in make_map_batch() is: 32.628µs
Generating 4 random key-value pairs
Time elapsed in make_map() is: 18.495µs
Time elapsed in make_map_from() is: 37.102µs
Time elapsed in make_map_batch() is: 16.399µs
Generating 8 random key-value pairs
Time elapsed in make_map() is: 20.785µs
Time elapsed in make_map_from() is: 48.596µs
Time elapsed in make_map_batch() is: 42.98µs
Generating 10 random key-value pairs
Time elapsed in make_map() is: 26.3µs
Time elapsed in make_map_from() is: 84.601µs
Time elapsed in make_map_batch() is: 15.55µs
// it's not so common for a map use case
Generating 50 random key-value pairs
Time elapsed in make_map() is: 35.751µs
Time elapsed in make_map_from() is: 332.336µs
Time elapsed in make_map_batch() is: 52.425µs
Generating 100 random key-value pairs
Time elapsed in make_map() is: 178.775µs
Time elapsed in make_map_from() is: 508.017µs
Time elapsed in make_map_batch() is: 184.452µs
Generating 1000 random key-value pairs
Time elapsed in make_map() is: 386.535µs
Time elapsed in make_map_from() is: 3.125679ms
Time elapsed in make_map_batch() is: 45.549µs
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]