dongjoon-hyun edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751942186


   Thank you, @mridulm . I really appreciate your replies. Let me follow your 
thoughts.
   
   1. For this question, I answered at the beginning that this is a kind of 
self-healing feature 
[here](https://github.com/apache/spark/pull/30876#discussion_r547031257)
   > Making it default will impact all applications which have replication > 1: 
given this PR is proposing to make it the default, I would like to know if 
there was any motivating reason to make this change ?
   
   2. For the following question, I asked your evidence first because I'm not 
aware of. :)
   > If the cost of proactive replication is close to zero now (my experiments 
were from a while back), ofcourse the discussion is moot - did we have any 
results for this ?
   
   3. For the following question, it seems that you assume that the current 
Spark's behavior is the best. I don't think this question justifies that the 
loss of data inside Spark side is good.
   > What is the ongoing cost when application holds RDD references, but they 
are not in active use for rest of the application (not all references can be 
cleared by gc) - resulting in replication of blocks for an RDD which is 
legitimately not going to be used again ?
   
   4. For the following, yes, but `exacerbates` doesn't look like a proper term 
here because we had better make Spark smarter to handle those cases as I 
replied at 
[here](https://github.com/apache/spark/pull/30876#discussion_r547421217) 
already.
   > Note that the above is orthogonal to DRA evicting an executor via storage 
timeout configuration. That just exacerbates the problem : since a larger 
number of executors could be lost.
   
   5. For the following, I didn't make this PR for that specific use case. I 
made this PR to improve this feature in various environment in Apache Spark 
3.2.0 timeframe 
[here](https://github.com/apache/spark/pull/30876#issuecomment-749953223). 
   > Specifically for this usecase, we dont need to make it a spark default 
right ? ...
   
   6. For the following, I replied that YARN environment also can suffer from 
disk loss or executor loss 
[here](https://github.com/apache/spark/pull/30876#issuecomment-751060200) 
because you insisted that YARN doesn't need this feature from the beginning. 
I'm still not sure that YARN environment is so invincible like that.
   > But this feels sufficiently narrow enough not to require a global default, 
right ? It feels more like a deployment/application default and not a platform 
level default ?
   
   7. For `replication == 1`, please see `spark.storage.replication.proactive` 
code. It only tries to replicate when there exists at least live data. So, 
replication doesn't occur.
   > Shuffle ? Replicated RDD where replication == 1 ?
   
   8. I'm trying to utilize all features from Apache Spark and open for that 
too . We are developing this and Spark is not a bible written on the rock.
   > Perhaps better tuning for (c) might help more holistically ?
   
   
   I know that this is a holiday season and I'm really grateful about your 
opinions. If you don't mind, can we have a Zoom meeting when you are available, 
@mridulm ? I think we have different ideas on the open source development and 
about the scope of this work. I want to make a progress in this area in Apache 
Spark 3.2.0 by completing a document or a better implementation or anything 
more. Please let me know if you can have a Zoom meeting. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to