Guys I fixed this by adding j...@apache.org to the mailing list, no more moderation required.
Cheers, Chris -----Original Message----- From: "ASF GitHub Bot (JIRA)" <j...@apache.org> Reply-To: "dev@spark.apache.org" <dev@spark.apache.org> Date: Saturday, March 29, 2014 10:14 AM To: "dev@spark.apache.org" <dev@spark.apache.org> Subject: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates > > [ >https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.pl >ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954345#comm >ent-13954345 ] > >ASF GitHub Bot commented on SPARK-732: >-------------------------------------- > >Github user AmplabJenkins commented on the pull request: > > https://github.com/apache/spark/pull/228#issuecomment-39002053 > > Merged build finished. Build is starting -or- tests failed to >complete. > > >> Recomputation of RDDs may result in duplicated accumulator updates >> ------------------------------------------------------------------ >> >> Key: SPARK-732 >> URL: https://issues.apache.org/jira/browse/SPARK-732 >> Project: Apache Spark >> Issue Type: Bug >> Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, >>0.9.0, 0.8.2 >> Reporter: Josh Rosen >> Assignee: Nan Zhu >> Fix For: 1.0.0 >> >> >> Currently, Spark doesn't guard against duplicated updates to the same >>accumulator due to recomputations of an RDD. For example: >> {code} >> val acc = sc.accumulator(0) >> data.map(x => acc += 1; f(x)) >> data.count() >> // acc should equal data.count() here >> data.foreach{...} >> // Now, acc = 2 * data.count() because the map() was recomputed. >> {code} >> I think that this behavior is incorrect, especially because this >>behavior allows the additon or removal of a cache() call to affect the >>outcome of a computation. >> There's an old TODO to fix this duplicate update issue in the >>[DAGScheduler >>code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9 >>d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. >> I haven't tested whether recomputation due to blocks being dropped from >>the cache can trigger duplicate accumulator updates. >> Hypothetically someone could be relying on the current behavior to >>implement performance counters that track the actual number of >>computations performed (including recomputations). To be safe, we could >>add an explicit warning in the release notes that documents the change >>in behavior when we fix this. >> Ignoring duplicate updates shouldn't be too hard, but there are a few >>subtleties. Currently, we allow accumulators to be used in multiple >>transformations, so we'd need to detect duplicate updates at the >>per-transformation level. I haven't dug too deeply into the scheduler >>internals, but we might also run into problems where pipelining causes >>what is logically one set of accumulator updates to show up in two >>different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; >>...).count() may cause what's logically the same accumulator update to >>be applied from two different contexts, complicating the detection of >>duplicate updates). > > > >-- >This message was sent by Atlassian JIRA >(v6.2#6252)