[
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736126#comment-14736126
]
koert kuipers commented on SPARK-603:
-------------------------------------
we use counters a lot in scalding (to verify records counts mostly at different
stages, for certain criteria).
i do not think it is easy at all to recreate counters with accumulators. in
fact with the current behavior of accumulators (they do not account for task
failure, leading to double counting) i think its nearly impossible to implement
counters.
> add simple Counter API
> ----------------------
>
> Key: SPARK-603
> URL: https://issues.apache.org/jira/browse/SPARK-603
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: Imran Rashid
> Priority: Minor
>
> Users need a very simple way to create counters in their jobs. Accumulators
> provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its
> hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
> val filterCount = sc.accumulator(0)
> val filtered = rdd.filter{r =>
> if (isOK(r)) true else {filterCount += 1; false}
> }
> println("removed " + filterCount.value + " records)
> filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed
> before anything has actually run. I could print out the value later on, but
> note that it would destroy the modularity of the method -- kinda ugly to
> return the accumulator just so that it can get printed later on. (and of
> course, the caller in turn might not know when the filter is going to get
> applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a
> stage has been run, and also with some api to get them back. I realize this
> is tricky b/c a stage can get re-computed, so maybe you should only increment
> the counters once.
> Maybe a more general way to do this is to provide some callback for whenever
> an RDD is computed -- by default, you would just print the counters, but the
> user could replace w/ a custom handler.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]