JakeDern commented on issue #10119: URL: https://github.com/apache/arrow-rs/issues/10119#issuecomment-4683557693
> [@JakeDern](https://github.com/JakeDern) were you planning on making a new benchmark or updating the existing benchmarks? FWIW I think it'd be worth also isolating it into its own benchmark (writer & reader). This has to do with what you mentioned: > > > Dictionaries have a lot of special handling in IPC writer code, which we want to optimize. > > Since there is so much other logic, it'd make sense to have benchmarks that focus on small sections, for example `_encode_dictionaries()`, `encode_dictionaries()`, and the `DictionaryTracker` struct. I also think the dictionary-focused benchmarks could expand the structure of the benchmarks to cover different patterns such as the streaming behavior that the dictionary format was built around, [arrow-ipc docs](https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages). It would be nice to have benchmarks that validate/check that the buffer space used to track dictionary mappings is reused instead of repeatedly allocated and destroyed. (from the docs) > > > Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID. > > like I mentioned before I haven't looked to closely at the dictionary path, but feel free > to tag me in the benchmarks PR & ill be happy to take a look! I think we're on the same page! I added new benchmarks for the writer side with different focuses: - StreamWriter dictionary messages - StreamWriter delta dictionary messages - FileWriter delta dictionary messages The delta and non-delta paths definitely exercise different paths and have different perf characteristics. I didn't go quite to the level of granularity of functions like `encode_dictionary` or the `DictionaryTracker` struct, but my opinion is as long as any perf improvements/regressions are detectable in the current set of benchmarks then the coverage is ok. The topic of making sure that dictionary mappings are re-used instead of repeatedly allocated and destroyed is sadly tricky, but something that I ensured the benchmarks. I think we can simultaneously (1) more reliably emit delta dictionaries and (2) make the emission of delta dictionaries more efficient, but IMO both are more of an API level problem that I tried to describe in #8134. On the subject of API issues for the writer - I think `IpcDataGenerator` should also be a future target which has public apis like `encode` which is tied to the old `EncodedData` pattern and likely shouldn't be used. There are probably other APIs worth deprecating in the module too: https://github.com/apache/arrow-rs/blob/826b808b2792235c27a22ba917a7abe93cfbb221/arrow-ipc/src/writer.rs#L546-L553. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
