alamb commented on issue #7200: URL: https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1679274520
@wiedld, @tustvold @crepererum @JayjeetAtGithub and I had a discussion and here are some notes: The proposal is to look at the input *before* starting to do the merge or convert any rows and change how the row converter works for high cardinality dictionaries The assumption is that for low cardinality dictionaries (a small number of distinct values), using [`preserve_dictionaries`] is important for performance but for high cardinality dictionaries (with a large number of distinct values) using [`preserve_dictionaries`] not only consumes large amounts of memory as described in this ticket, but also will be slower as the size of the interned keys will be substantial. If we do not use [`preserve_dictionaries`] the `RowInterner`will no longer keep a mapping and thus the memory consumption will not grow. So specificially this would look like: 1. Based on some heuristic, if the dictionary is high cardinality then use the normal string encoding (set [`preserve_dictionaries`] false) 2. if the dictionary is low cardinality then use the dictionaries encoding (set [`preserve_dictionaries`] true, the default) Open questions: 1. What heuristic to use to determine high cardinality (The heuristic needs to be reasonably fast / memory efficient to compute) 2. Can we improve the performance of `preserve_dictionaries=false`, conversion (andrew to file ticket) 3. How to verify this doesn't cause a performance regressions [`preserve_dictionaries`]: https://docs.rs/arrow-row/45.0.0/arrow_row/struct.SortField.html#method.preserve_dictionaries Other options we discussed: 1. Try update the state of `RowConverter` to prune out unused entries (not clear we could make this work) 2. Recreate the `RowConverter` 3. Use the "non dictionary encoding" mode (what is described above) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
