Re: Thread and Container locality

Munagala Ramanath Mon, 28 Sep 2015 11:37:26 -0700

I wrote a quick benchmark program appended below; here are the results of
running it on
my laptop:


ram@ram-laptop:threads: time java Volatile 1
nThreads = 1
MAX_VALUE reached, exiting

real    0m13.834s
user    0m13.829s
sys    0m0.024s


ram@ram-laptop:threads: time java Volatile 2
nThreads = 2
MAX_VALUE reached, exiting
MAX_VALUE reached, exiting

real    1m5.072s
user    2m10.186s
sys    0m0.032s

------------------------------------------------------
// test performance impact of 2 threads sharing a volatile int
public class Volatile {
    static volatile int count;

    public static class MyThread extends Thread {
        public void run() {
          for ( ; ; ) {
            int c = count;
            if (Integer.MAX_VALUE == c || Integer.MAX_VALUE == c + 1) {
              System.out.println("MAX_VALUE reached, exiting");
              return;
            }
            ++count;
          }
        }  // run
    }  // MyThread

    public static void main( String[] argv ) throws Exception {
      final int nThreads = Integer.parseInt(argv[0]);
      System.out.format("nThreads = %d%n", nThreads);
      Thread[] threads = new Thread[nThreads];
      for (int i = 0; i < nThreads; ++i) {
        threads[i] = new MyThread();
        threads[i].start();
      }
      for (Thread t : threads) t.join();
    }  // main

}  // Threads
--------------------------------------------------------

Ram

On Mon, Sep 28, 2015 at 11:07 AM, Timothy Farkas <[email protected]>
wrote:

> Hi Vlad,
>
> Could you share your benchmarking applications? I'd like to test a change I
> made to the Circular Buffer
>
>
> https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java
>
> Thanks,
> Tim
>
> On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <[email protected]>
> wrote:
>
> > Vlad what was your mode of interaction/ordering between the two threads
> for
> > the 3rd test.
> >
> > On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]>
> > wrote:
> >
> > > I created a simple test to check how quickly java can count to
> > > Integer.MAX_INTEGER. The result that I see is consistent with
> > > CONTAINER_LOCAL behavior:
> > >
> > > counting long in a single thread: 0.9 sec
> > > counting volatile long in a single thread: 17.7 sec
> > > counting volatile long shared between two threads: 186.3 sec
> > >
> > > I suggest that we look into
> > >
> >
> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
> > > or similar algorithm.
> > >
> > > Thank you,
> > >
> > > Vlad
> > >
> > >
> > >
> > > On 9/28/15 08:19, Vlad Rozov wrote:
> > >
> > >> Ram,
> > >>
> > >> The stream between operators in case of CONTAINER_LOCAL is
> InlineStream.
> > >> InlineStream extends DefaultReservoir that extends CircularBuffer.
> > >> CircularBuffer does not use synchronized methods or locks, it uses
> > >> volatile. I guess that using volatile causes CPU cache invalidation
> and
> > >> along with memory locality (in thread local case tuple is always local
> > to
> > >> both threads, while in container local case the second operator thread
> > may
> > >> see data significantly later after the first thread produced it) these
> > two
> > >> factors negatively impact CONTAINER_LOCAL performance. It is still
> quite
> > >> surprising that the impact is so significant.
> > >>
> > >> Thank you,
> > >>
> > >> Vlad
> > >>
> > >> On 9/27/15 16:45, Munagala Ramanath wrote:
> > >>
> > >>> Vlad,
> > >>>
> > >>> That's a fascinating and counter-intuitive result. I wonder if some
> > >>> internal synchronization is happening
> > >>> (maybe the stream between them is a shared data structure that is
> lock
> > >>> protected) to
> > >>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both
> > >>> going as fast as possible
> > >>> it is likely that they will be frequently blocked by the lock. If
> that
> > >>> is indeed the case, some sort of lock
> > >>> striping or a near-lockless protocol for stream access should tilt
> the
> > >>> balance in favor of CONTAINER_LOCAL.
> > >>>
> > >>> In the thread-local case of course there is no need for such locking.
> > >>>
> > >>> Ram
> > >>>
> > >>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <
> [email protected]
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>>     Changed subject to reflect shift of discussion.
> > >>>
> > >>>     After I recompiled netlet and hardcoded 0 wait time in the
> > >>>     CircularBuffer.put() method, I still see the same difference even
> > >>>     when I increased operator memory to 10 GB and set "-D
> > >>>     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
> > >>>     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
> > >>>     is close to 100% both for thread and container local locality
> > >>>     settings. Note that in thread local two operators share 100% CPU,
> > >>>     while in container local each gets its own 100% load. It sounds
> > >>>     that container local will outperform thread local only when
> > >>>     number of emitted tuples is (relatively) low, for example when it
> > >>>     is CPU costly to produce tuples (hash computations,
> > >>>     compression/decompression, aggregations, filtering with complex
> > >>>     expressions). In cases where operator may emit 5 or more million
> > >>>     tuples per second, thread local may outperform container local
> > >>>     even when both operators are CPU intensive.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>     Thank you,
> > >>>
> > >>>     Vlad
> > >>>
> > >>>     On 9/26/15 22:52, Timothy Farkas wrote:
> > >>>
> > >>>>     Hi Vlad,
> > >>>>
> > >>>>     I just took a look at the CircularBuffer. Why are threads
> polling
> > >>>> the state
> > >>>>     of the buffer before doing operations? Couldn't polling be
> avoided
> > >>>> entirely
> > >>>>     by using something like Condition variables to signal when the
> > >>>> buffer is
> > >>>>     ready for an operation to be performed?
> > >>>>
> > >>>>     Tim
> > >>>>
> > >>>>     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
> > >>>> [email protected]> <mailto:[email protected]>
> > >>>>     wrote:
> > >>>>
> > >>>>     After looking at few stack traces I think that in the benchmark
> > >>>>>     application operators compete for the circular buffer that
> passes
> > >>>>> slices
> > >>>>>     from the emitter output to the consumer input and sleeps that
> > >>>>> avoid busy
> > >>>>>     wait are too long for the benchmark operators. I don't see the
> > >>>>> stack
> > >>>>>     similar to the one below all the time I take the threads dump,
> > but
> > >>>>> still
> > >>>>>     quite often to suspect that sleep is the root cause. I'll
> > >>>>> recompile with
> > >>>>>     smaller sleep time and see how this will affect performance.
> > >>>>>
> > >>>>>     ----
> > >>>>>     "1/wordGenerator:RandomWordInputModule" prio=10
> > >>>>> tid=0x00007f78c8b8c000
> > >>>>>     nid=0x780f waiting on condition [0x00007f78abb17000]
> > >>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>>>          at java.lang.Thread.sleep(Native Method)
> > >>>>>          at
> > >>>>>
> > >>>>>
> > com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
> > >>>>>          at
> > >>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
> > >>>>>          at
> > >>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
> > >>>>>          at
> > >>>>>
> > >>>>>
> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
> > >>>>>          at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
> > >>>>>          at
> > >>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
> > >>>>>          at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
> > >>>>>
> > >>>>>     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
> > >>>>> nid=0x780d
> > >>>>>     waiting on condition [0x00007f78abc18000]
> > >>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>>>          at java.lang.Thread.sleep(Native Method)
> > >>>>>          at
> > >>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
> > >>>>>          at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
> > >>>>>
> > >>>>>     ----
> > >>>>>
> > >>>>>
> > >>>>>     On 9/26/15 20:59, Amol Kekre wrote:
> > >>>>>
> > >>>>>     A good read -
> > >>>>>>
> > http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
> > >>>>>>
> > >>>>>>     Though it does not explain order of magnitude difference.
> > >>>>>>
> > >>>>>>     Amol
> > >>>>>>
> > >>>>>>
> > >>>>>>     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
> > >>>>>> [email protected]> <mailto:[email protected]>
> > >>>>>>     wrote:
> > >>>>>>
> > >>>>>>     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL
> > by
> > >>>>>> an order
> > >>>>>>
> > >>>>>>>     of magnitude and both operators compete for CPU. I'll take a
> > >>>>>>> closer look
> > >>>>>>>     why.
> > >>>>>>>
> > >>>>>>>     Thank you,
> > >>>>>>>
> > >>>>>>>     Vlad
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>     On 9/26/15 14:52, Thomas Weise wrote:
> > >>>>>>>
> > >>>>>>>     THREAD_LOCAL - operators share thread
> > >>>>>>>
> > >>>>>>>>     CONTAINER_LOCAL - each operator has its own thread
> > >>>>>>>>
> > >>>>>>>>     So as long as operators utilize the CPU sufficiently
> > (compete),
> > >>>>>>>> the
> > >>>>>>>>     latter
> > >>>>>>>>     will perform better.
> > >>>>>>>>
> > >>>>>>>>     There will be cases where a single thread can accommodate
> > >>>>>>>> multiple
> > >>>>>>>>     operators. For example, a socket reader (mostly waiting for
> > IO)
> > >>>>>>>> and a
> > >>>>>>>>     decompress (CPU hungry) can share a thread.
> > >>>>>>>>
> > >>>>>>>>     But to get back to the original question, stream locality
> does
> > >>>>>>>> generally
> > >>>>>>>>     not reduce the total memory requirement. If you add multiple
> > >>>>>>>> operators
> > >>>>>>>>     into
> > >>>>>>>>     one container, that container will also require more memory
> > and
> > >>>>>>>> that's
> > >>>>>>>>     how
> > >>>>>>>>     the container size is calculated in the physical plan. You
> may
> > >>>>>>>> get some
> > >>>>>>>>     extra mileage when multiple operators share the same heap
> but
> > >>>>>>>> the need
> > >>>>>>>>     to
> > >>>>>>>>     identify the memory requirement per operator does not go
> away.
> > >>>>>>>>
> > >>>>>>>>     Thomas
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
> > >>>>>>>>     [email protected] <mailto:[email protected]>>
> > >>>>>>>>     wrote:
> > >>>>>>>>
> > >>>>>>>>     Would CONTAINER_LOCAL achieve the same thing and perform a
> > >>>>>>>> little better
> > >>>>>>>>
> > >>>>>>>>     on
> > >>>>>>>>>     a multi-core box ?
> > >>>>>>>>>
> > >>>>>>>>>     Ram
> > >>>>>>>>>
> > >>>>>>>>>     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
> > >>>>>>>>>     [email protected] <mailto:[email protected]>>
> > >>>>>>>>>     wrote:
> > >>>>>>>>>
> > >>>>>>>>>     Yes, with this approach only two containers are required:
> one
> > >>>>>>>>> for stram
> > >>>>>>>>>     and
> > >>>>>>>>>
> > >>>>>>>>>     another for all operators. You can easily fit around 10
> > >>>>>>>>> operators in
> > >>>>>>>>>
> > >>>>>>>>>>     less
> > >>>>>>>>>>     than 1GB.
> > >>>>>>>>>>     On 27 Sep 2015 00:32, "Timothy Farkas"<
> [email protected]>
> > >>>>>>>>>> <mailto:[email protected]>  wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>     Hi Ram,
> > >>>>>>>>>>
> > >>>>>>>>>>     You could make all the operators thread local. This cuts
> > down
> > >>>>>>>>>>> on the
> > >>>>>>>>>>>     overhead of separate containers and maximizes the memory
> > >>>>>>>>>>> available to
> > >>>>>>>>>>>
> > >>>>>>>>>>>     each
> > >>>>>>>>>>>
> > >>>>>>>>>>     operator.
> > >>>>>>>>>>
> > >>>>>>>>>>>     Tim
> > >>>>>>>>>>>
> > >>>>>>>>>>>     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
> > >>>>>>>>>>>
> > >>>>>>>>>>>     [email protected] <mailto:[email protected]>
> > >>>>>>>>>>>
> > >>>>>>>>>>     wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>         Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>>     I was running into memory issues when deploying my  app
> on
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     sandbox
> > >>>>>>>>>>>>
> > >>>>>>>>>>>     where all the operators were stuck forever in the PENDING
> > >>>>>>>>>> state
> > >>>>>>>>>>
> > >>>>>>>>>>     because
> > >>>>>>>>>>>
> > >>>>>>>>>>>     they were being continually aborted and restarted because
> > of
> > >>>>>>>>>> the
> > >>>>>>>>>>
> > >>>>>>>>>>     limited
> > >>>>>>>>>>>     memory on the sandbox. After some experimentation, I
> found
> > >>>>>>>>>>> that the
> > >>>>>>>>>>>
> > >>>>>>>>>>>     following config values seem to work:
> > >>>>>>>>>>>>     ------------------------------------------
> > >>>>>>>>>>>>     <
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > https://datatorrent.slack.com/archives/engineering/p1443263607000010
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     <value>500</value>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>         </property>  <property>
> > >>>>>>>>>>> <name>dt.application..operator.*
> > >>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     *.attr.MEMORY_MB</name>    <value>200</value>
> > </property>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     <property>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> >
> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
> > >>>>>>>>>
> > >>>>>>>>>           <value>512</value>  </property>*
> > >>>>>>>>>
> > >>>>>>>>>>     ------------------------------------------------
> > >>>>>>>>>>>
> > >>>>>>>>>>>>     Are these reasonable values ? Is there a more systematic
> > >>>>>>>>>>>> way of
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     coming
> > >>>>>>>>>>>>
> > >>>>>>>>>>>     up
> > >>>>>>>>>>
> > >>>>>>>>>>     with these values than trial-and-error ? Most of my
> > operators
> > >>>>>>>>>> -- with
> > >>>>>>>>>>
> > >>>>>>>>>>>     the
> > >>>>>>>>>>>     exception of fileWordCount -- need very little memory; is
> > >>>>>>>>>>> there a way
> > >>>>>>>>>>>     to
> > >>>>>>>>>>>     cut all values down to the bare minimum and maximize
> > >>>>>>>>>>> available memory
> > >>>>>>>>>>>     for
> > >>>>>>>>>>>     this one operator ?
> > >>>>>>>>>>>
> > >>>>>>>>>>>     Thanks.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>     Ram
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: Thread and Container locality

Reply via email to