drin commented on issue #13981: URL: https://github.com/apache/arrow/issues/13981#issuecomment-1228554377
That amount of overhead is definitely unexpected. For reference, I am working on a compute function for hashing and I see 15-25% overhead ([benchmark results](https://docs.google.com/presentation/d/1cUU_F3jB6LsOLbClhl34YdQiodtbz7l76l3juHTsC5k/edit#slide=id.g13e9d117f47_0_63)). That being said, some improvements are in the pipeline, such as [ARROW-16756](https://issues.apache.org/jira/browse/ARROW-16756) and others, to address some overheads. I also think you'll measure more overhead if your arrays are small and overhead can be amplified with many invocations (many chunks). Generally speaking, what version of arrow are you on and what is the layout of the data? It sounds like you just have Arrays (contiguous) so maybe some of the overheads I have seen are not affecting you much. Loosely related, you could consider using a [MapArray](https://arrow.apache.org/docs/cpp/api/array.html#_CPPv4N5arrow8MapArrayE) instead of 2 separate arrays. I think this should reduce steps by a bit because the validity map is shared. Separately, this seems like a case where an array version of [map_lookup](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-vector-structural-transforms) would be nice. Also, for convenience, would you mind putting your code in a repo or a gist, instead of a tarball? I haven't looked at it yet because of the extra steps involved -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
