Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/21617 Well, "clear" is relative. Since we're trying to provide functionality in the Dataframe API, it's perfectly alright for the RDD graph to end up looking a bit weird. It seems feasible to do something like: * Have a stream reader RDD write side output to some special shuffle partition (set of partitions?) which the main query knows not to read. * Have a stream writer RDD with two heterogeneous sets of partitions: one to write the main query to the sink, and another to apply the specified action to the side output. I agree that watermarks should be applied immediately after the data reader - other streaming systems generally require this, and Spark does not seem to be getting any benefits from having a more general watermark concept. I haven't had time to push for this change, but I think it's known that the current Spark watermark model is flawed - I'd support fixing it for sure.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org