Github user yhuai commented on a diff in the pull request:
https://github.com/apache/spark/pull/8687#discussion_r39190845
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
---
@@ -979,8 +976,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
hadoopConf.set("mapred.output.compression.type",
CompressionType.BLOCK.toString)
}
- // Use configured output committer if already set
- if (conf.getOutputCommitter == null) {
+ // Use configured output committer if already set and speculation is
not enabled.
+ val speculationEnabled = self.conf.getBoolean("spark.speculation",
false)
+ if (speculationEnabled || conf.getOutputCommitter == null) {
--- End diff --
ok. I see. Now, I think it will be hard to always change the output
committer when speculation is enabled. I feel this problem is hard to have a
mechanism to automatically make any conf change on output committer setting.
How about this. If `mapred.output.format.class` is set and speculation is
enabled, we we check if the class name contains `DirectOutputCommitter`. If so,
we log a warning message to say that `DirectOutputCommitter` may cause data
loss (a case can be found in https://issues.apache.org/jira/browse/SPARK-10063)
when speculation is enabled and ask users to set it to other output committer.
I think having a warning message is better to silently change the setting,
which may break some use case.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]