Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/21617
  
    Well, "clear" is relative. Since we're trying to provide functionality in 
the Dataframe API, it's perfectly alright for the RDD graph to end up looking a 
bit weird. It seems feasible to do something like:
    
    * Have a stream reader RDD write side output to some special shuffle 
partition (set of partitions?) which the main query knows not to read.
    * Have a stream writer RDD with two heterogeneous sets of partitions: one 
to write the main query to the sink, and another to apply the specified action 
to the side output.
    
    I agree that watermarks should be applied immediately after the data reader 
- other streaming systems generally require this, and Spark does not seem to be 
getting any benefits from having a more general watermark concept. I haven't 
had time to push for this change, but I think it's known that the current Spark 
watermark model is flawed - I'd support fixing it for sure.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to