[ 
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345204#comment-14345204
 ] 

Imran Rashid commented on SPARK-603:
------------------------------------

Hi [~srowen]

I don't think anyone is actively working on this, and probably won't for a 
while -- I suppose that means it should be closed for now.

I disagree that its easy to do this with accumulators.  Its certainly possible, 
but it makes it quite complicated to do something that is use very common and 
should be dead-simple.  (or at least, its harder than most people realize to 
use accumulators to do this *correctly*.)   i guess it will be confusing to 
have counters & accumulators in the api, but it might only serve to highlight 
some of the intricacies of the accumulator api which aren't obvious (and can't 
be fixed w/out breaking changes).

> add simple Counter API
> ----------------------
>
>                 Key: SPARK-603
>                 URL: https://issues.apache.org/jira/browse/SPARK-603
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Priority: Minor
>
> Users need a very simple way to create counters in their jobs.  Accumulators 
> provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its 
> hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
>   val filterCount = sc.accumulator(0)
>   val filtered = rdd.filter{r =>
>     if (isOK(r)) true else {filterCount += 1; false}
>   }
>   println("removed " + filterCount.value + " records)
>   filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed 
> before anything has actually run.  I could print out the value later on, but 
> note that it would destroy the modularity of the method -- kinda ugly to 
> return the accumulator just so that it can get printed later on.  (and of 
> course, the caller in turn might not know when the filter is going to get 
> applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a 
> stage has been run, and also with some api to get them back.  I realize this 
> is tricky b/c a stage can get re-computed, so maybe you should only increment 
> the counters once.
> Maybe a more general way to do this is to provide some callback for whenever 
> an RDD is computed -- by default, you would just print the counters, but the 
> user could replace w/ a custom handler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to