Accumulators / Accumulables : thread-local, task-local, executor-local ?

Guillaume Pitel Thu, 18 Jun 2015 04:37:48 -0700

Hi,

I'm trying to figure out the smartest way to implement a globalcount-min-sketch on accumulators. For now, we are doing that with RDDs.It works well, but with one sketch per partition, merging takes too long.

As you probably know, a count-min sketch is a big mutable array of arrayof ints. To distribute it, all sketches must have the same size. Sinceit can be big, and since merging is not free, I would like to minimizethe number of sketches and maximize reuse and conccurent use of thesketches. Ideally, I would like to just have one sketch per worker.

I think accumulables might be the right structures for that, but itseems that they are not shared between executors, or even between tasks.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Accumulators.scala(289)

/**

* This thread-local map holds per-task copies of accumulators; it isused to collect the set

* of accumulator updates to send back to the driver when taskscomplete. After tasks complete,


        * this map is cleared by `Accumulators.clear()` (see Executor.scala).

        */

private val localAccums = new ThreadLocal[Map[Long, Accumulable[_,_]]]() {


        override protected def initialValue() = Map[Long, Accumulable[_, _]]()

        }

The localAccums stores an accumulator for each task (it's thread local,so I assume each task have a unique thread on executors)

If I understand correctly, each time a task starts, an accumulator isinitialized locally, updated, then sent back to the driver for merging ?


So I guess, accumulators may not be the way to go, finally.

Any advice ?
Guillaume
--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Accumulators / Accumulables : thread-local, task-local, executor-local ?

Reply via email to