js8544 opened a new issue, #13981: URL: https://github.com/apache/arrow/issues/13981
Hi folks, my team is trying out Arrow in our feature engineering pipeline. We really enjoyed coding with the C++ `arrow::compute` library thanks to its user friendly interfaces and comprehensive functions offered. Before we introduce it to our production code, I wrote a simple benchmark testing the performace of Compute functions vs raw operations on `arrow::Array`. To our surprise, `arrow::compute` is 4 times slower than the corresponding raw operations. The benchmark logic is: I have two arrays `keys` and `values` representing a kv map. I have another array `query_keys` representing the keys I want to query in the kv map. I want to get an array `result`, where `result[i]` is the value for `query_keys[i]` if the query key exists in the map and its value is greater than or equal to 100. The way I implement this logic is: 1. First create a boolean array `filter` by `GreaterEqual(values, 100)`. 2. Filter both `keys` and `values` using `Filter(keys, filter)` and `Filter(values, filter)`. 3. Call `IndexIn(query_keys, filted_keys)` to check `query_keys` in `filtered_keys` and get an array `index`. 4. Call `Take(filter_values, index)` to get the result. The benchmark code is attached below. [benchmark.tar.gz](https://github.com/apache/arrow/files/9433267/benchmark.tar.gz) Here is the benchmark result: | ns/op | op/s | err% | total | Benchmarking simple feature |--------------------:|--------------------:|--------:|----------:|:---------------------------- | 163,837.00 | 6,103.63 | 0.6% | 0.01 | `arrow_compute` | 31,675.56 | 31,570.09 | 0.5% | 0.01 | `arrow_raw` It shows `arrow::compute` is 4 times slower than using raw operations on `arrow::Array`s. I understand that `arrow::compute` functions need to check the underlying types of passed `arrow::Datum`s and `arrow::Array`s and dynamically dispatch the function. But is such large difference expected, or is my benchmark code wrong? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
