GitHub user nikunjb opened a pull request:

    https://github.com/apache/spark/pull/22423

    [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo…

    …wDStream so as to cut the lineage completely to parent
    
      ## What changes were proposed in this pull request?
    
      Dstream.reduceByKeyAndWindow() with inverse reduce functions eventually 
creates
      a ReducedWindowDStream but did not checkpoint it.
      When combined with the issue described in SPARK-25303,
      it results in the problem described in the post here:
    
      
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
    
      This change will checkpoint reducedStream inside to cut the lineage.
      A separate patch is being created for the issue in SPARK-25303
    
      ## How was this patch tested?
    
      Running the existing Unit tests. A couple of existing unit test classes 
were
      written assuming they will not be used in serialization. Had to declare
      SparkContext as transient like other tests in order to make them work
      due to new checkpointing causing serialization to occur.
    
      ## What improvement does this patch make?
    
      Cuts the lineage to the parent DStream. When combined with the fix
      in SPARK-25303, unpersits the intermediate RDDs and Input DStreams that
      were being remembered for too long, resulting in much lower memory usage
      on the Executors. Observed from the DAG chart differences and
      data from the Storage tab on the Driver UI.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nikunjb/spark spark-25302

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22423
    
----
commit 36929ad2424861e1a66bb8c867b240e1aec8febc
Author: Nikunj Bansal <nikunj.bansal@...>
Date:   2018-09-14T22:37:36Z

    [SPARK-25302][STREAMING] Checkpoint the reducedStream in 
ReducedWindowDStream so as to cut the lineage completely to parent
    
      ## What changes were proposed in this pull request?
    
      Dstream.reduceByKeyAndWindow() with inverse reduce functions eventually 
creates
      a ReducedWindowDStream but did not checkpoint it.
      When combined with the issue described in SPARK-25303,
      it results in the problem described in the post here:
    
      
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
    
      This change will checkpoint reducedStream inside to cut the lineage.
      A separate patch is being created for the issue in SPARK-25303
    
      ## How was this patch tested?
    
      Running the existing Unit tests. A couple of unit test classes were
      written assuming they will not be used in serialization. Had to declare
      SparkContext as transient like other tests in order to make them work
      due to new checkpointing causing serialization to occur.
    
      ## What improvement does this patch make?
    
      Cuts the lineage to the parent DStream. When combined with the fix
      in SPARK-25303, unpersits the intermediate RDDs and Input DStreams that
      were being remembered for too long, resulting in much lower memory usage
      on the Executors. Observed from the DAG chart differences and
      data from the Storage tab on the Driver UI.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to