Re: [I] Implement a scalar function for creating ScalarValue::Map [datafusion]

via GitHub Sat, 06 Jul 2024 09:59:18 -0700


goldmedal commented on issue #11268:
URL: https://github.com/apache/datafusion/issues/11268#issuecomment-2211819196


   > `make_map_batch` version is way faster, we should go with that one.
   > 
   > Upd: I tried to remove clone in MapBuilder version, it still seems slower 
than manually constructed one
   Here's the revised version of the text with improved grammar:
   
   Thanks for your effort. I think `make_map_batch` is faster because it 
accepts two known-type arrays. We only need to rearrange the layout for the 
`MapArray`, which requires some `Arc` cloning. It's a good way to implement the 
DuckDB syntax:
   
   ```
   SELECT MAP(['a', 'b', 'c'], [1, 2, 3]);
   ```
   
   However, I found it hard to provide a function like `make_map(k1, v1, k2, v2 
...)` that is both efficient and simple. I tried two solutions for it:
   
   1. Aggregate the keys and values, then pass them to `make_map_batch`:
       ```rust
       fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
           let mut key_buffer = VecDeque::new();
           let mut value_buffer = VecDeque::new();
   
           args.chunks_exact(2)
               .enumerate()
               .for_each(|(_, chunk)| {
                   let key = chunk[0].clone();
                   let value = chunk[1].clone();
                   key_buffer.push_back(key);
                   value_buffer.push_back(value);
               });
   
           let key: Vec<_> = key_buffer.into();
           let value: Vec<_> = value_buffer.into();
   
           let key = ColumnarValue::values_to_arrays(&key)?;
           let value = ColumnarValue::values_to_arrays(&value)?;
   
           make_map_batch(&[key[0].clone(), value[0].clone()])
       }
       ```
       The codebase is simple, but the performance is terrible. The bottleneck 
is `ColumnarValue::values_to_arrays`. This function is convenient but slow. 
Actually, this version is slower than the `MapBuilder` one. :(
   
   2. MapBuilder:
       It's the faster solution (compared to the above one) for `make_map(k1, 
v1, k2, v2 ...)`, but the codebase would become complex and large. In my 
testing code, I hardcoded the types of keys and values to get the value and 
create the builder. In a real scenario, we would need to match all types and 
builders to use MapBuilder. I'm not sure if it's a good approach.
       ```rust
       fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
           let string_builder = StringBuilder::new();
           let int_builder = Int32Builder::with_capacity(args.len() / 2);
           let mut builder =
               MapBuilder::new(None, string_builder, int_builder);
   
           for (_, chunk) in args.chunks_exact(2).enumerate() {
               let key = chunk[0].clone();
               let value = chunk[1].clone();
               match key {
                   ColumnarValue::Scalar(ScalarValue::Utf8(Some(key_scalar))) 
=> {
                       builder.keys().append_value(key_scalar)
                   }
                   _ => {}
               }
   
               match value {
                   
ColumnarValue::Scalar(ScalarValue::Int32(Some(value_scalar))) => {
                       builder.values().append_value(value_scalar)
                   }
                   _ => {}
               }
           }
   
           Ok(ColumnarValue::Array(Arc::new(builder.finish())))
       }
       ```
   
   In conclusion, I will provide two functions:
   - For performance purposes: `map(['a', 'b', 'c'], [1, 2, 3])`
   - For user-friendly purposes: `make_map(k1, v1, k2, v2 ...)`
   
   For the user-friendly one, I'll choose the first solution to keep the 
codebase simpler. I found that `named_struct` also uses 
`ColumnarValue::values_to_arrays` to do something similar.
   
https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/functions/src/core/named_struct.rs#L60
   
   We could suggest that users who intend to get better performance or create a 
large map use the `map()` function.
   
   Some appendixs for the benchmark result:
   ```
   // first run for warm-up
   Generating 1 random key-value pairs
   Time elapsed in make_map() is: 391.618µs
   Time elapsed in make_map_from() is: 224.568µs
   Time elapsed in make_map_batch() is: 18.173µs
   
   Generating 2 random key-value pairs
   Time elapsed in make_map() is: 19.848µs
   Time elapsed in make_map_from() is: 27.726µs
   Time elapsed in make_map_batch() is: 32.628µs
   
   Generating 4 random key-value pairs
   Time elapsed in make_map() is: 18.495µs
   Time elapsed in make_map_from() is: 37.102µs
   Time elapsed in make_map_batch() is: 16.399µs
   
   Generating 8 random key-value pairs
   Time elapsed in make_map() is: 20.785µs
   Time elapsed in make_map_from() is: 48.596µs
   Time elapsed in make_map_batch() is: 42.98µs
   
   Generating 10 random key-value pairs
   Time elapsed in make_map() is: 26.3µs
   Time elapsed in make_map_from() is: 84.601µs
   Time elapsed in make_map_batch() is: 15.55µs
   
   // it's not so common for a map use case
   Generating 50 random key-value pairs
   Time elapsed in make_map() is: 35.751µs
   Time elapsed in make_map_from() is: 332.336µs
   Time elapsed in make_map_batch() is: 52.425µs
   
   Generating 100 random key-value pairs
   Time elapsed in make_map() is: 178.775µs
   Time elapsed in make_map_from() is: 508.017µs
   Time elapsed in make_map_batch() is: 184.452µs
   
   Generating 1000 random key-value pairs
   Time elapsed in make_map() is: 386.535µs
   Time elapsed in make_map_from() is: 3.125679ms
   Time elapsed in make_map_batch() is: 45.549µs
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Implement a scalar function for creating ScalarValue::Map [datafusion]

Reply via email to