alamb commented on issue #7200:
URL: 
https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1679274520

   @wiedld,  @tustvold  @crepererum  @JayjeetAtGithub  and I had a discussion 
and here are some notes:
   
   The proposal is to look at the input *before* starting to do the merge or 
convert any rows and change how the row converter works for high cardinality 
dictionaries
   
   The assumption is that for low cardinality dictionaries (a small
   number of distinct values), using [`preserve_dictionaries`] is
   important for performance but for high cardinality dictionaries (with
   a large number of distinct values) using [`preserve_dictionaries`] not
   only consumes large amounts of memory as described in this ticket, but
   also will be slower as the size of the interned keys will be substantial.
   
   
   If we do not use [`preserve_dictionaries`] the `RowInterner`will no
   longer keep a mapping and thus the memory consumption will not grow.
   
   So specificially this would look like:
   1. Based on some heuristic, if the dictionary is high cardinality then use 
the normal string encoding (set [`preserve_dictionaries`] false)
   2. if the dictionary is low cardinality then use the dictionaries encoding 
(set [`preserve_dictionaries`] true, the default)
   
   Open questions:
   1. What heuristic to use to determine high cardinality (The heuristic needs 
to be reasonably fast / memory efficient to compute)
   2. Can we improve the performance of `preserve_dictionaries=false`, 
conversion  (andrew to file ticket)
   3. How to verify this doesn't cause a performance regressions
   
   
   [`preserve_dictionaries`]: 
https://docs.rs/arrow-row/45.0.0/arrow_row/struct.SortField.html#method.preserve_dictionaries
   
   
   Other options we discussed:
   1. Try update the state of `RowConverter` to prune out unused entries (not 
clear we could make this work)
   2. Recreate the `RowConverter`
   3. Use the "non dictionary encoding" mode (what is described above)     
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to