JakeDern commented on issue #10119:
URL: https://github.com/apache/arrow-rs/issues/10119#issuecomment-4683557693

   > [@JakeDern](https://github.com/JakeDern) were you planning on making a new 
benchmark or updating the existing benchmarks? FWIW I think it'd be worth also 
isolating it into its own benchmark (writer & reader). This has to do with what 
you mentioned:
   > 
   > > Dictionaries have a lot of special handling in IPC writer code, which we 
want to optimize.
   > 
   > Since there is so much other logic, it'd make sense to have benchmarks 
that focus on small sections, for example `_encode_dictionaries()`, 
`encode_dictionaries()`, and the `DictionaryTracker` struct. I also think the 
dictionary-focused benchmarks could expand the structure of the benchmarks to 
cover different patterns such as the streaming behavior that the dictionary 
format was built around, [arrow-ipc 
docs](https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages). 
It would be nice to have benchmarks that validate/check that the buffer space 
used to track dictionary mappings is reused instead of repeatedly allocated and 
destroyed. (from the docs)
   > 
   > > Alternatively, if isDelta is set to false, then the dictionary replaces 
the existing dictionary for the same ID.
   > 
   > like I mentioned before I haven't looked to closely at the dictionary 
path, but feel free > to tag me in the benchmarks PR & ill be happy to take a 
look!
   
   I think we're on the same page! I added new benchmarks for the writer side 
with different focuses:
   
   - StreamWriter dictionary messages
   - StreamWriter delta dictionary messages
   - FileWriter delta dictionary messages
   
   The delta and non-delta paths definitely exercise different paths and have 
different perf characteristics. I didn't go quite to the level of granularity 
of functions like `encode_dictionary` or the `DictionaryTracker` struct, but my 
opinion is as long as any perf improvements/regressions are detectable in the 
current set of benchmarks then the coverage is ok.
   
   The topic of making sure that dictionary mappings are re-used instead of 
repeatedly allocated and destroyed is sadly tricky, but something that I 
ensured the benchmarks. I think we can simultaneously (1) more reliably emit 
delta dictionaries and (2) make the emission of delta dictionaries more 
efficient, but IMO both are more of an API level problem that I tried to 
describe in #8134.
   
   On the subject of API issues for the writer - I think `IpcDataGenerator` 
should also be a future target which has public apis like `encode` which is 
tied to the old `EncodedData` pattern and likely shouldn't be used. There are 
probably other APIs worth deprecating in the module too: 
https://github.com/apache/arrow-rs/blob/826b808b2792235c27a22ba917a7abe93cfbb221/arrow-ipc/src/writer.rs#L546-L553.
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to