Dandandan commented on issue #998: URL: https://github.com/apache/arrow-datafusion/issues/998#issuecomment-922975719
`Union distinct` effectively means building a structure to deduplicate the same rows. This means: * Sort and deduplicate based on equivalent consecutive rows (might be fast if data is already sorted) * Hash all column keys, store in hash table. My proposal to implement `Union` is to translate `[a] union [b]` into the equivalent `select distinct * from ([a] union all [b])`. This currently uses the hash aggregate based implementation for deduplication. Later it could switch to whatever implementation is more efficient for the particular query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
