mridulm commented on a change in pull request #30876:
URL: https://github.com/apache/spark/pull/30876#discussion_r548191318



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -384,7 +384,7 @@ package object config {
         "get the replication level of the block to the initial number")
       .version("2.2.0")
       .booleanConf
-      .createWithDefault(false)
+      .createWithDefault(true)

Review comment:
       In the past, I found this to be noisy for the cases where replication 
was enabled - but this was a while back, and I would like to understand better 
what the 'cost' of enabling this for nontrivial usecases is for master : 
disabled by default means only developers who specifically test for it pay the 
price; not everyone.
   It is quite common for an application to have references to a persisted RDD 
even after its use - with the loss of the RDD blocks having little to no 
functional impact.
   This is similar to loss of blocks for an unreplicated persisted RDD - we do 
not proactively recompute the lost blocks; but do so on demand.
   
   If the idea is we enable this for master, and evaluate the impact over the 
next 6 months and revisit at the end, I am fine with that: but an evaluation 
would need to be done before this goes out - else anyone using replicated 
storage will also get hit with the impact of proactive replication as well, and 
will need to disable this for their applications.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to