egolearner commented on issue #47151: URL: https://github.com/apache/arrow/issues/47151#issuecomment-3168346283
From my perspective, there are two options: Option 1: `InsertMemoValues` disallows duplicate input. Since `InsertMemoValues` is designed for scenarios with known dictionary values, it makes sense to only allow unique input values. For duplicate inputs, users should use `Append` or `AppendArray` instead. https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/cpp/src/arrow/array/builder_dict.h#L403-L410 Option 2: `InsertMemoValues` performs deduplication, and `AppendIndices` maps user input indices to the actual indices. The DictionaryBuilder already handles deduplication for both `Append` and `AppendArray`. IMHO, it should also deduplicate inputs for `InsertMemoValues`, as storing duplicate values is inefficient from both memory and storage perspectives. For each `InsertMemoValues` call, the deduplicated indices would be stored in memory, and later, `AppendIndices` would map the user-provided indices to the actual indices. Personal preference: I lean toward Option 1. @kdkavanagh @raulcd -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org