[GitHub] [arrow-datafusion] alamb commented on issue #7200: RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields

via GitHub Tue, 15 Aug 2023 09:43:02 -0700


alamb commented on issue #7200:
URL: 
https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1679274520

@wiedld, @tustvold @crepererum @JayjeetAtGithub and I had a discussion
and here are some notes:

The proposal is to look at the input *before* starting to do the merge or
convert any rows and change how the row converter works for high cardinality
dictionaries

The assumption is that for low cardinality dictionaries (a small
number of distinct values), using [`preserve_dictionaries`] is
important for performance but for high cardinality dictionaries (with
a large number of distinct values) using [`preserve_dictionaries`] not
only consumes large amounts of memory as described in this ticket, but
also will be slower as the size of the interned keys will be substantial.

If we do not use [`preserve_dictionaries`] the `RowInterner`will no
longer keep a mapping and thus the memory consumption will not grow.

So specificially this would look like:
1. Based on some heuristic, if the dictionary is high cardinality then use
the normal string encoding (set [`preserve_dictionaries`] false)
2. if the dictionary is low cardinality then use the dictionaries encoding
(set [`preserve_dictionaries`] true, the default)

Open questions:
1. What heuristic to use to determine high cardinality (The heuristic needs
to be reasonably fast / memory efficient to compute)
2. Can we improve the performance of `preserve_dictionaries=false`,
conversion (andrew to file ticket)
3. How to verify this doesn't cause a performance regressions

[`preserve_dictionaries`]:
https://docs.rs/arrow-row/45.0.0/arrow_row/struct.SortField.html#method.preserve_dictionaries

Other options we discussed:
1. Try update the state of `RowConverter` to prune out unused entries (not
clear we could make this work)
2. Recreate the `RowConverter`
3. Use the "non dictionary encoding" mode (what is described above)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #7200: RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields

Reply via email to