GitHub user nikunjb opened a pull request:
https://github.com/apache/spark/pull/22424
[SPARK-25303][STREAMING] For checkpointed DStreams, remove the parentâ¦
â¦RemeberDuration - this will be recalculated to a minimum value
## How was this patch tested?
## What changes were proposed in this pull request?
When a DStream gets checkpointed, there is no need to remember
parentRDDs (unless indicated by other DStreams from that parent).
This change sets the parentRememberDuration to null for checkpointed
DStreams. Please note that this does get recalculated during
validation
to a minimum value in the parent as expected for any DStream as
usual.
Before this change, even after the fix for SPARK-25302 that cut the
lineage to the parent, the parentDStreams and all the ones before
that
were being remembered for long durations. This was unnecessary and
resulted in input RDDs being persisted for much longer than needed.
Please see post below for original issue and a reply to it about
resolution
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
A separate patch is being created for the issue in SPARK-25302
## How was this patch tested?
Running the existing Unit tests.
## What improvement does this patch make?
When combined with the fix in SPARK-25302,
unpersits the intermediate RDDs and Input DStreams that
were being remembered for too long, resulting in much lower memory
usage
on the Executors. Observed from the DAG chart differences and
data from the Storage tab on the Driver UI.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nikunjb/spark spark-25303
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22424.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22424
----
commit 578c4c6beec4f9e26d9f0fed4f0083336ac498a7
Author: Nikunj Bansal <nikunj.bansal@...>
Date: 2018-09-14T22:34:47Z
[SPARK-25303][STREAMING] For checkpointed DStreams, remove the
parentRemeberDuration - this will be recalculated to a minimum value
## What changes were proposed in this pull request?
When a DStream gets checkpointed, there is no need to remember
parentRDDs (unless indicated by other DStreams from that parent).
This change sets the parentRememberDuration to null for checkpointed
DStreams. Please note that this does get recalculated during
validation
to a minimum value in the parent as expected for any DStream as
usual.
Before this change, even after the fix for SPARK-25302 that cut the
lineage to the parent, the parentDStreams and all the ones before
that
were being remembered for long durations. This was unnecessary and
resulted in input RDDs being persisted for much longer than needed.
Please see post below for original issue and a reply to it about
resolution:
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
A separate patch is being created for the issue in SPARK-25302
## How was this patch tested?
Running the existing Unit tests.
## What improvement does this patch make?
When combined with the fix in SPARK-25302,
unpersits the intermediate RDDs and Input DStreams that
were being remembered for too long, resulting in much lower memory
usage
on the Executors. Observed from the DAG chart differences and
data from the Storage tab on the Driver UI.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]