HeartSaVioR commented on PR #39931:
URL: https://github.com/apache/spark/pull/39931#issuecomment-1441343786

   > One thing that is not clear is how 'join + dedup()' is not allwowd, but 
'join + groupBy,count' is allowed.
   
   It's not because the latter operator is dedup. It's because dedup operator 
gets two event time columns from an input DataFrame, and current late record 
filtering/eviction logic only handles one event time column per input 
DataFrame. It has been picking the first occurrence (simply assuming there is 
only one event time column which is actually incorrect), but we should have 
considered all event time columns. e.g. et1 00:10 et2 00:30 W=00:20 -> the row 
should not be evicted as further row can be matched with this row (et2 is still 
earlier than W), but we evict it.
   
   We may be able to address this in (near) future, but I'd like to let it be 
out of scope for this effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to