I wrote a quick benchmark program appended below; here are the results of
running it on
my laptop:
ram@ram-laptop:threads: time java Volatile 1
nThreads = 1
MAX_VALUE reached, exiting
real 0m13.834s
user 0m13.829s
sys 0m0.024s
ram@ram-laptop:threads: time java Volatile 2
nThreads = 2
MAX_VALUE reached, exiting
MAX_VALUE reached, exiting
real 1m5.072s
user 2m10.186s
sys 0m0.032s
------------------------------------------------------
// test performance impact of 2 threads sharing a volatile int
public class Volatile {
static volatile int count;
public static class MyThread extends Thread {
public void run() {
for ( ; ; ) {
int c = count;
if (Integer.MAX_VALUE == c || Integer.MAX_VALUE == c + 1) {
System.out.println("MAX_VALUE reached, exiting");
return;
}
++count;
}
} // run
} // MyThread
public static void main( String[] argv ) throws Exception {
final int nThreads = Integer.parseInt(argv[0]);
System.out.format("nThreads = %d%n", nThreads);
Thread[] threads = new Thread[nThreads];
for (int i = 0; i < nThreads; ++i) {
threads[i] = new MyThread();
threads[i].start();
}
for (Thread t : threads) t.join();
} // main
} // Threads
--------------------------------------------------------
Ram
On Mon, Sep 28, 2015 at 11:07 AM, Timothy Farkas <[email protected]>
wrote:
> Hi Vlad,
>
> Could you share your benchmarking applications? I'd like to test a change I
> made to the Circular Buffer
>
>
> https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java
>
> Thanks,
> Tim
>
> On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <[email protected]>
> wrote:
>
> > Vlad what was your mode of interaction/ordering between the two threads
> for
> > the 3rd test.
> >
> > On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]>
> > wrote:
> >
> > > I created a simple test to check how quickly java can count to
> > > Integer.MAX_INTEGER. The result that I see is consistent with
> > > CONTAINER_LOCAL behavior:
> > >
> > > counting long in a single thread: 0.9 sec
> > > counting volatile long in a single thread: 17.7 sec
> > > counting volatile long shared between two threads: 186.3 sec
> > >
> > > I suggest that we look into
> > >
> >
> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
> > > or similar algorithm.
> > >
> > > Thank you,
> > >
> > > Vlad
> > >
> > >
> > >
> > > On 9/28/15 08:19, Vlad Rozov wrote:
> > >
> > >> Ram,
> > >>
> > >> The stream between operators in case of CONTAINER_LOCAL is
> InlineStream.
> > >> InlineStream extends DefaultReservoir that extends CircularBuffer.
> > >> CircularBuffer does not use synchronized methods or locks, it uses
> > >> volatile. I guess that using volatile causes CPU cache invalidation
> and
> > >> along with memory locality (in thread local case tuple is always local
> > to
> > >> both threads, while in container local case the second operator thread
> > may
> > >> see data significantly later after the first thread produced it) these
> > two
> > >> factors negatively impact CONTAINER_LOCAL performance. It is still
> quite
> > >> surprising that the impact is so significant.
> > >>
> > >> Thank you,
> > >>
> > >> Vlad
> > >>
> > >> On 9/27/15 16:45, Munagala Ramanath wrote:
> > >>
> > >>> Vlad,
> > >>>
> > >>> That's a fascinating and counter-intuitive result. I wonder if some
> > >>> internal synchronization is happening
> > >>> (maybe the stream between them is a shared data structure that is
> lock
> > >>> protected) to
> > >>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both
> > >>> going as fast as possible
> > >>> it is likely that they will be frequently blocked by the lock. If
> that
> > >>> is indeed the case, some sort of lock
> > >>> striping or a near-lockless protocol for stream access should tilt
> the
> > >>> balance in favor of CONTAINER_LOCAL.
> > >>>
> > >>> In the thread-local case of course there is no need for such locking.
> > >>>
> > >>> Ram
> > >>>
> > >>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <
> [email protected]
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>> Changed subject to reflect shift of discussion.
> > >>>
> > >>> After I recompiled netlet and hardcoded 0 wait time in the
> > >>> CircularBuffer.put() method, I still see the same difference even
> > >>> when I increased operator memory to 10 GB and set "-D
> > >>> dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
> > >>> dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
> > >>> is close to 100% both for thread and container local locality
> > >>> settings. Note that in thread local two operators share 100% CPU,
> > >>> while in container local each gets its own 100% load. It sounds
> > >>> that container local will outperform thread local only when
> > >>> number of emitted tuples is (relatively) low, for example when it
> > >>> is CPU costly to produce tuples (hash computations,
> > >>> compression/decompression, aggregations, filtering with complex
> > >>> expressions). In cases where operator may emit 5 or more million
> > >>> tuples per second, thread local may outperform container local
> > >>> even when both operators are CPU intensive.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Thank you,
> > >>>
> > >>> Vlad
> > >>>
> > >>> On 9/26/15 22:52, Timothy Farkas wrote:
> > >>>
> > >>>> Hi Vlad,
> > >>>>
> > >>>> I just took a look at the CircularBuffer. Why are threads
> polling
> > >>>> the state
> > >>>> of the buffer before doing operations? Couldn't polling be
> avoided
> > >>>> entirely
> > >>>> by using something like Condition variables to signal when the
> > >>>> buffer is
> > >>>> ready for an operation to be performed?
> > >>>>
> > >>>> Tim
> > >>>>
> > >>>> On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
> > >>>> [email protected]> <mailto:[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>> After looking at few stack traces I think that in the benchmark
> > >>>>> application operators compete for the circular buffer that
> passes
> > >>>>> slices
> > >>>>> from the emitter output to the consumer input and sleeps that
> > >>>>> avoid busy
> > >>>>> wait are too long for the benchmark operators. I don't see the
> > >>>>> stack
> > >>>>> similar to the one below all the time I take the threads dump,
> > but
> > >>>>> still
> > >>>>> quite often to suspect that sleep is the root cause. I'll
> > >>>>> recompile with
> > >>>>> smaller sleep time and see how this will affect performance.
> > >>>>>
> > >>>>> ----
> > >>>>> "1/wordGenerator:RandomWordInputModule" prio=10
> > >>>>> tid=0x00007f78c8b8c000
> > >>>>> nid=0x780f waiting on condition [0x00007f78abb17000]
> > >>>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>>> at java.lang.Thread.sleep(Native Method)
> > >>>>> at
> > >>>>>
> > >>>>>
> > com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
> > >>>>> at
> > >>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
> > >>>>> at
> > >>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
> > >>>>> at
> > >>>>>
> > >>>>>
> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
> > >>>>> at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
> > >>>>> at
> > >>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
> > >>>>> at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
> > >>>>>
> > >>>>> "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
> > >>>>> nid=0x780d
> > >>>>> waiting on condition [0x00007f78abc18000]
> > >>>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>>> at java.lang.Thread.sleep(Native Method)
> > >>>>> at
> > >>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
> > >>>>> at
> > >>>>>
> > >>>>>
> >
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
> > >>>>>
> > >>>>> ----
> > >>>>>
> > >>>>>
> > >>>>> On 9/26/15 20:59, Amol Kekre wrote:
> > >>>>>
> > >>>>> A good read -
> > >>>>>>
> > http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
> > >>>>>>
> > >>>>>> Though it does not explain order of magnitude difference.
> > >>>>>>
> > >>>>>> Amol
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
> > >>>>>> [email protected]> <mailto:[email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL
> > by
> > >>>>>> an order
> > >>>>>>
> > >>>>>>> of magnitude and both operators compete for CPU. I'll take a
> > >>>>>>> closer look
> > >>>>>>> why.
> > >>>>>>>
> > >>>>>>> Thank you,
> > >>>>>>>
> > >>>>>>> Vlad
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 9/26/15 14:52, Thomas Weise wrote:
> > >>>>>>>
> > >>>>>>> THREAD_LOCAL - operators share thread
> > >>>>>>>
> > >>>>>>>> CONTAINER_LOCAL - each operator has its own thread
> > >>>>>>>>
> > >>>>>>>> So as long as operators utilize the CPU sufficiently
> > (compete),
> > >>>>>>>> the
> > >>>>>>>> latter
> > >>>>>>>> will perform better.
> > >>>>>>>>
> > >>>>>>>> There will be cases where a single thread can accommodate
> > >>>>>>>> multiple
> > >>>>>>>> operators. For example, a socket reader (mostly waiting for
> > IO)
> > >>>>>>>> and a
> > >>>>>>>> decompress (CPU hungry) can share a thread.
> > >>>>>>>>
> > >>>>>>>> But to get back to the original question, stream locality
> does
> > >>>>>>>> generally
> > >>>>>>>> not reduce the total memory requirement. If you add multiple
> > >>>>>>>> operators
> > >>>>>>>> into
> > >>>>>>>> one container, that container will also require more memory
> > and
> > >>>>>>>> that's
> > >>>>>>>> how
> > >>>>>>>> the container size is calculated in the physical plan. You
> may
> > >>>>>>>> get some
> > >>>>>>>> extra mileage when multiple operators share the same heap
> but
> > >>>>>>>> the need
> > >>>>>>>> to
> > >>>>>>>> identify the memory requirement per operator does not go
> away.
> > >>>>>>>>
> > >>>>>>>> Thomas
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
> > >>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Would CONTAINER_LOCAL achieve the same thing and perform a
> > >>>>>>>> little better
> > >>>>>>>>
> > >>>>>>>> on
> > >>>>>>>>> a multi-core box ?
> > >>>>>>>>>
> > >>>>>>>>> Ram
> > >>>>>>>>>
> > >>>>>>>>> On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
> > >>>>>>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Yes, with this approach only two containers are required:
> one
> > >>>>>>>>> for stram
> > >>>>>>>>> and
> > >>>>>>>>>
> > >>>>>>>>> another for all operators. You can easily fit around 10
> > >>>>>>>>> operators in
> > >>>>>>>>>
> > >>>>>>>>>> less
> > >>>>>>>>>> than 1GB.
> > >>>>>>>>>> On 27 Sep 2015 00:32, "Timothy Farkas"<
> [email protected]>
> > >>>>>>>>>> <mailto:[email protected]> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Ram,
> > >>>>>>>>>>
> > >>>>>>>>>> You could make all the operators thread local. This cuts
> > down
> > >>>>>>>>>>> on the
> > >>>>>>>>>>> overhead of separate containers and maximizes the memory
> > >>>>>>>>>>> available to
> > >>>>>>>>>>>
> > >>>>>>>>>>> each
> > >>>>>>>>>>>
> > >>>>>>>>>> operator.
> > >>>>>>>>>>
> > >>>>>>>>>>> Tim
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
> > >>>>>>>>>>>
> > >>>>>>>>>>> [email protected] <mailto:[email protected]>
> > >>>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I was running into memory issues when deploying my app
> on
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> sandbox
> > >>>>>>>>>>>>
> > >>>>>>>>>>> where all the operators were stuck forever in the PENDING
> > >>>>>>>>>> state
> > >>>>>>>>>>
> > >>>>>>>>>> because
> > >>>>>>>>>>>
> > >>>>>>>>>>> they were being continually aborted and restarted because
> > of
> > >>>>>>>>>> the
> > >>>>>>>>>>
> > >>>>>>>>>> limited
> > >>>>>>>>>>> memory on the sandbox. After some experimentation, I
> found
> > >>>>>>>>>>> that the
> > >>>>>>>>>>>
> > >>>>>>>>>>> following config values seem to work:
> > >>>>>>>>>>>> ------------------------------------------
> > >>>>>>>>>>>> <
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > https://datatorrent.slack.com/archives/engineering/p1443263607000010
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *<property> <name>dt.attr.MASTER_MEMORY_MB</name>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> <value>500</value>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> </property> <property>
> > >>>>>>>>>>> <name>dt.application..operator.*
> > >>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *.attr.MEMORY_MB</name> <value>200</value>
> > </property>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> <property>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> >
> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
> > >>>>>>>>>
> > >>>>>>>>> <value>512</value> </property>*
> > >>>>>>>>>
> > >>>>>>>>>> ------------------------------------------------
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Are these reasonable values ? Is there a more systematic
> > >>>>>>>>>>>> way of
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> coming
> > >>>>>>>>>>>>
> > >>>>>>>>>>> up
> > >>>>>>>>>>
> > >>>>>>>>>> with these values than trial-and-error ? Most of my
> > operators
> > >>>>>>>>>> -- with
> > >>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>> exception of fileWordCount -- need very little memory; is
> > >>>>>>>>>>> there a way
> > >>>>>>>>>>> to
> > >>>>>>>>>>> cut all values down to the bare minimum and maximize
> > >>>>>>>>>>> available memory
> > >>>>>>>>>>> for
> > >>>>>>>>>>> this one operator ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ram
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>
> > >>>
> > >>
> > >
> >
>