jayzhan211 commented on issue #6981:
URL: 
https://github.com/apache/arrow-datafusion/issues/6981#issuecomment-1773968815

   > performing deduplication after would require to pattern match the internal 
type of the array
   
   We may not need pattern matching for Internal type of array.  Type coercion 
should had been done in 
https://github.com/apache/arrow-datafusion/blob/9fde5c4282fd9f0e3332fb40998bf1562c17fcda/datafusion/optimizer/src/analyzer/type_coercion.rs#L582-L601
   Therefore after concatenation for each arrays, they have the same data type 
and just add them to HashSet and construct back from it.
   
   > performing deduplication upon creation would require modifying the 
MutableArrayData
   
   I think we can maintain HashSet for each row, convert array to primitives 
`Vec<I32>` or scalars `Vec<ScalarValue>`, extend all the values on the same row 
to the same hash_set then construct the final array back from the HashSets. No 
need `concat_internal` or `arrow::compute::concat`. 
   
   Introduce `arrow::compute::deduplication` that extend concatenation to use 
an HashSet internally might also be a good idea, but I am not sure how can we 
have HashSet like internally with `MutableArrayData`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to