egolearner commented on issue #47151:
URL: https://github.com/apache/arrow/issues/47151#issuecomment-3168346283

   From my perspective, there are two options:
   
   Option 1: `InsertMemoValues` disallows duplicate input.
   Since `InsertMemoValues` is designed for scenarios with known dictionary 
values, it makes sense to only allow unique input values. For duplicate inputs, 
users should use `Append` or `AppendArray` instead.
   
https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/cpp/src/arrow/array/builder_dict.h#L403-L410
   
   Option 2: `InsertMemoValues` performs deduplication, and `AppendIndices` 
maps user input indices to the actual indices.
   The DictionaryBuilder already handles deduplication for both `Append` and 
`AppendArray`. IMHO, it should also deduplicate inputs for `InsertMemoValues`, 
as storing duplicate values is inefficient from both memory and storage 
perspectives.
   For each `InsertMemoValues` call, the deduplicated indices would be stored 
in memory, and later, `AppendIndices` would map the user-provided indices to 
the actual indices.
   
   Personal preference: I lean toward Option 1.
   
   @kdkavanagh @raulcd 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to