HeartSaVioR commented on PR #39931: URL: https://github.com/apache/spark/pull/39931#issuecomment-1441343786
> One thing that is not clear is how 'join + dedup()' is not allwowd, but 'join + groupBy,count' is allowed. It's not because the latter operator is dedup. It's because dedup operator gets two event time columns from an input DataFrame, and current late record filtering/eviction logic only handles one event time column per input DataFrame. It has been picking the first occurrence (simply assuming there is only one event time column which is actually incorrect), but we should have considered all event time columns. e.g. et1 00:10 et2 00:30 W=00:20 -> the row should not be evicted as further row can be matched with this row (et2 is still earlier than W), but we evict it. We may be able to address this in (near) future, but I'd like to let it be out of scope for this effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
