js8544 commented on issue #13981:
URL: https://github.com/apache/arrow/issues/13981#issuecomment-1228632578

   > That amount of overhead is definitely unexpected. For reference, I am 
working on a compute function for hashing and I see 15-25% overhead ([benchmark 
results](https://docs.google.com/presentation/d/1cUU_F3jB6LsOLbClhl34YdQiodtbz7l76l3juHTsC5k/edit#slide=id.g13e9d117f47_0_63)).
 That being said, some improvements are in the pipeline, such as 
[ARROW-16756](https://issues.apache.org/jira/browse/ARROW-16756) and others, to 
address some overheads. I also think you'll measure more overhead if your 
arrays are small and overhead can be amplified with many invocations (many 
chunks).
   
   @drin Thank you for your reply! Unfortunately increasing array size doesn't 
help. The data I posted was with key/value size 150 and query size 50. I 
increased them to 1500 and 500, `arrow::compute` is still 4x slower:
   
   |               ns/op |                op/s |    err% |     total | 
Benchmarking simple feature
   
|--------------------:|--------------------:|--------:|----------:|:----------------------------
   |          478,698.00 |            2,089.00 |    1.6% |      0.01 | 
`arrow_compute`
   |           93,844.36 |           10,655.94 |    0.3% |      0.01 | 
`arrow_raw`
   
   > Generally speaking, what version of arrow are you on and what is the 
layout of the data? It sounds like you just have Arrays (contiguous) so maybe 
some of the overheads I have seen are not affecting you much.
   >
   
   I am using arrow 9.0.0 and yes I'm using only arrays.
   
   > Loosely related, you could consider using a 
[MapArray](https://arrow.apache.org/docs/cpp/api/array.html#_CPPv4N5arrow8MapArrayE)
 instead of 2 separate arrays. I think this should reduce steps by a bit 
because the validity map is shared.
   > 
   > Separately, this seems like a case where an array version of 
[map_lookup](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-vector-structural-transforms)
 would be nice.
   > 
   
   Thanks for the suggestion! Would you mind sharing some examples of using 
`MapArray` or `Map`? I thought about using them before writing the benchmark. 
But as far as I can tell, there are no documentation/examples of `MapArray` or 
`Map` other than api references on the official doc. I also checked the source 
code, but couldn't infer the idiomatic way of using them. So I sticked with 
arrays. 
   
   > Also, for convenience, would you mind putting your code in a repo or a 
gist, instead of a tarball?
   >
   
   Here it is:  https://gist.github.com/js8544/8569c0e0bb810f1254904e4584def167
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to