[PR] Deduplicating recursive CTE implementation [datafusion]

via GitHub Thu, 23 Oct 2025 13:00:15 -0700


Tpt opened a new pull request, #18254:
URL: https://github.com/apache/datafusion/pull/18254


   Rely on aggregate GroupValues abstraction to build a hash table of the 
emitted rows that is used to deduplicate
   
   We might make things a bit more efficient by rewriting a hash table wrapper 
just for deduplication, but this implementation should give a fair baseline
   
   ## Which issue does this PR close?
   
   - Closes #18140.
   
   ## Rationale for this change
   
   Implements deduplicating recursive CTE (i.e. `UNION` inside of `WITH 
RECURSIVE`) using a hash table. I reuse the one from aggregates to avoid 
rebuilding a full wrapper and specialization for types. Each time a batch is 
returned by the static or the recursive terms of the CTE, the hash table is 
used to remove already seen rows before emitting the rows and keeping them in 
memory for the next recursion step.
   
   ## What changes are included in this PR?
   
   
   ## Are these changes tested?
   
   Yes, some sqllogictests have been added
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Deduplicating recursive CTE implementation [datafusion]

Reply via email to