jayzhan211 commented on issue #6981: URL: https://github.com/apache/arrow-datafusion/issues/6981#issuecomment-1773968815
> performing deduplication after would require to pattern match the internal type of the array We may not need pattern matching for Internal type of array. Type coercion should had been done in https://github.com/apache/arrow-datafusion/blob/9fde5c4282fd9f0e3332fb40998bf1562c17fcda/datafusion/optimizer/src/analyzer/type_coercion.rs#L582-L601 Therefore after concatenation for each arrays, they have the same data type and just add them to HashSet and construct back from it. > performing deduplication upon creation would require modifying the MutableArrayData I think we can maintain HashSet for each row, convert array to primitives `Vec<I32>` or scalars `Vec<ScalarValue>`, extend all the values on the same row to the same hash_set then construct the final array back from the HashSets. No need `concat_internal` or `arrow::compute::concat`. Introduce `arrow::compute::deduplication` that extend concatenation to use an HashSet internally might also be a good idea, but I am not sure how can we have HashSet like internally with `MutableArrayData` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
