dongjoon-hyun commented on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751060200


   Thank you for your advice. I added my replies.
   
   > If the idea is we enable this for master, and evaluate the impact over the 
next 6 months and revisit at the end, I am fine with that
   
   Yes, it does.
   
   > but an evaluation would need to be done before this goes out or perhaps 
identify the subset of conditions where it makes sense to enable it by default.
   
   Sure, I'll make it sure to identify the desirable and problematic conditions 
and try to document it officially at Apache Spark 3.2.0 timeframe.
   
   BTW, for the following comments (@mridulm and @HyukjinKwon ),
   > One option would be to do this by default for k8s as @HyukjinKwon suggested
   > for other resource managers, this does not necessarily apply.
   
   The following questions come to me.
   1. The SSD disks of YARN/Mesos cluster nodes sometime wear out due to the 
Spark's heavy access pattern. The SSD disk failures happen theoretically. I'm 
aware of this SSD disk issue and, AFAIK, LinkedIn's external shuffle service 
also tried to attack those kind of disk issues.
   2. YARN/Mesos cluster node also can go offline for the failure or the 
regular maintenances.
   3. YARN/Mesos clusters without external shuffle service exist. Especially, 
Mesos clusters frequently are running without it.
   4. In YARN/Mesos clusters with external shuffle service, external shuffle 
service itself is also not invincible. YARN service sometimes can crash and 
Apache Spark has the configurations to mitigate it.
   
   This is a kind of self-healing feature. So, I believe it can help (1) ~ (4).
   
   For the followings, it's unclear to me.
   
   > It is quite common for an application to have references to a persisted 
RDD even after its use - with the loss of the RDD blocks having little to no 
functional impact.
   > This is similar to loss of blocks for an unreplicated persisted RDD - we 
do not proactively recompute the lost blocks; but do so on demand.
   
   Where is the location of the `a persisted RDD` in this context? Is it in 
external storage like S3 or HDFS?
   
   In this PR, although Apache Spark RDD maintains the lineages, the number of 
replicas and proactive recovery is a matter of  `performance` trade-off, not a 
functional impact. The loss of data re-trigger the previous stages and its 
ancestor stages partially and I'm observing that this retries can hurt the 
performance severely.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to