[ 
https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214038#comment-14214038
 ] 

Imran Rashid commented on SPARK-664:
------------------------------------

Hi [~aash] ,

thanks for taking another look at this -- sorry I have been aloof for a little 
while.  I didn't know about SPARK-2380 , obviously this was created long before 
that.  Honestly, I'm not a big fan of SPARK-2380, it seems to really limit what 
we can do with accumulators.  We could really use them to expose a completely 
different model of computation.

Let me give an example use case.  Accumulators are in principle general enough 
that they let you compute lots of different things in one pass.  Eg., by using 
accumulators, you could:

* create a bloom filter of records that meet some criteria
* assign records to different buckets, and count how many are in each bucket, 
even up to 100K buckets (eg., by having accumulator of {{Array<Long>}})
* use hyperloglog to count how many distinct ids you have
* filter down to only those records with some parsing error, for a closer look  
(just by using plain old {{rdd.filter()}}

You could do all that in one pass, if the first 3 were done w/ accumulators.  
When I started using spark, I actually wrote a bunch of code to do exactly that 
kind of thing.  But it performed really poorly -- after some profiling & 
investigating how accumulators work, I saw why.  Those big accumulators I was 
creating just put a lot of work on the driver.  Accumulators provide the right 
API to do that kind of thing, but the implementation would have to change.

I definitely agree that if the results get merged on the executor before 
getting sent to the executor, it increases the latency of the *per-task* 
results, but does that matter?  I would prefer that we have something that 
supports the more general computation model, and the important thing is only 
the latency of the *overall* result.  It feels like we're moving to 
accumulators being treated just like counters (but with an awkward api).

> Accumulator updates should get locally merged before sent to the driver
> -----------------------------------------------------------------------
>
>                 Key: SPARK-664
>                 URL: https://issues.apache.org/jira/browse/SPARK-664
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Priority: Minor
>
> Whenever a task finishes, the accumulator updates from that task are 
> immediately sent back to the driver.  When the accumulator updates are big, 
> this is inefficient because (a) a lot more data has to be sent to the driver 
> and (b) the driver has to do all the work of merging the updates together.
> Probably doesn't matter for small accumulators / low number of tasks, but if 
> both are big, this could be a big bottleneck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to