Rich-T-kid commented on issue #10119: URL: https://github.com/apache/arrow-rs/issues/10119#issuecomment-4682922551
@JakeDern were you planning on making a new benchmark or updating the existing benchmarks? FWIW I think it'd be worth also isolating it into its own benchmark (writer & reader). This has to do with what you mentioned: > Dictionaries have a lot of special handling in IPC writer code, which we want to optimize. Since there is so much other logic, it'd make sense to have benchmarks that focus on small sections, for example `_encode_dictionaries()`, `encode_dictionaries()`, and the `DictionaryTracker` struct. I also think the dictionary-focused benchmarks could expand the structure of the benchmarks to cover different patterns such as the streaming behavior that the dictionary format was built around, [arrow-ipc docs](https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages). It would be nice to have benchmarks that validate/check that the buffer space used to track dictionary mappings is reused instead of repeatedly allocated and destroyed. (from the docs) > Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID. like I mentioned before I haven't looked to closely at the dictionary path, but feel free to tag me in the benchmarks PR & ill be happy to take a look! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
