edmondop commented on issue #6981: URL: https://github.com/apache/arrow-datafusion/issues/6981#issuecomment-1773837359
@jayzhan211 I looked deeper in the code, it seems that: - performing deduplication after would require to pattern match the internal type of the array - performing deduplication upon creation would require modifying the MutableArrayData The latter is here: https://github.com/apache/arrow-rs/blob/03d0505fc864c09e6dcd208d3cdddeecefb90345/arrow-select/src/concat.rs#L111 and would require a separate release of arrow-rs to extend concatenation to use an HashSet internally. On the other side, in the current arrow-datafusion, I can't find any sign of deduplication. I created a draft PR here https://github.com/apache/arrow-datafusion/pull/7897/files but I am stuck at the moment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
