mridulm commented on a change in pull request #30876:
URL: https://github.com/apache/spark/pull/30876#discussion_r548191318
##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -384,7 +384,7 @@ package object config {
"get the replication level of the block to the initial number")
.version("2.2.0")
.booleanConf
- .createWithDefault(false)
+ .createWithDefault(true)
Review comment:
In the past, I found this to be noisy for the cases where replication
was enabled - but this was a while back, and I would like to understand better
what the 'cost' of enabling this for nontrivial usecases is for master :
disabled by default means only developers who specifically test for it pay the
price; not everyone.
It is quite common for an application to have references to a persisted RDD
even after its use - with the loss of the RDD blocks having little to no
functional impact.
This is similar to loss of blocks for an unreplicated persisted RDD - we do
not proactively recompute the lost blocks; but do so on demand.
If the idea is we enable this for master, and evaluate the impact over the
next 6 months and revisit at the end, I am fine with that: but an evaluation
would need to be done before this goes out - else anyone using replicated
storage will also get hit with the impact of proactive replication as well, and
will need to disable this for their applications.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]