Dandandan commented on issue #998:
URL: 
https://github.com/apache/arrow-datafusion/issues/998#issuecomment-922975719


   `Union distinct` effectively means building a structure to deduplicate the 
same rows. This means:
   
   * Sort and deduplicate based on equivalent consecutive rows (might be fast 
if data is already sorted)
   * Hash all column keys, store in hash table.
   
   My proposal to implement `Union` is to translate `[a] union [b]` into the 
equivalent `select distinct * from ([a] union all [b])`. This currently uses 
the hash aggregate based implementation for deduplication. Later it could 
switch to whatever implementation is more efficient for the particular query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to