HeartSaVioR edited a comment on pull request #35521: URL: https://github.com/apache/spark/pull/35521#issuecomment-1039951780
> TTL may be a solution here. Just watermark seems more commonly used in Structured Streaming operators, do we have any stateful operators with TTL? Or we need to introduce a state TTL mechanism for this? This depends on whether we want to bring the functionality globally, or only specific to dropDuplicate. If we assume only on dropDuplicate, I could roughly sketch the high-level idea (DISCLAIMER: not guaranteed to work). Applying TTL with event time could be considered as updating the event time of state row with maximum event time among duplicated rows so far, plus specified TTL. (Yes, this is very similar with session window, except we don't do merging windows.) Since we have to make the API be same between batch and streaming, it may need to be a config instead of parameter of the API. I don't like it but producing the same API between batch and streaming is the top level concern. I'm not pushing aggressively on my idea. I'm definitely open to better idea which semantically makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
