js8544 commented on issue #38372: URL: https://github.com/apache/arrow/issues/38372#issuecomment-1776936362
I wrote a simple proof of comcept for `IndexIn` [1] with a slightly tweaked version of `absl::flat_hash_map`. (Tweak is [2], a couple of operations can be skipped because our `MemoTable` doesn't need to support deletion). And the existing `IndexIn` benchmarks (results below) show about 20% ~ 60% improvement depending on the `value_set` size, the larger the better. The benchmarks are run on my Mac M1 so the results may vary on a x86 machine. But I think the diff is compelling enough for us to proceed with a Swiss table implementation. [1] https://github.com/apache/arrow/compare/main...js8544:jinshang/swiss_proof_of_concept?diff=split#diff-d288c253c3a5959aa1b6840a649a8dafe42d14ac49d1837b19dd3a44c58f7fd9R93 [2] https://github.com/apache/arrow/commit/7d14d10d67826f4be7588b8b36d7aef7655dbcb3 ## Benchmarks ### Before <img width="1208" alt="image" src="https://user-images.githubusercontent.com/14072670/277618419-a320424a-7c79-4956-9d04-4acc199c1211.png"> ### After <img width="1204" alt="image" src="https://user-images.githubusercontent.com/14072670/277618495-0896179c-0df8-4c10-83e8-3dff95f0b339.png"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
