Re: Thread and Container locality

Vlad Rozov Mon, 28 Sep 2015 11:32:02 -0700

both threads increment static volatile long in a loop until it is lessthan Integer.MAX_VALUE.


Thank you,


Vlad

On 9/28/15 10:56, Pramod Immaneni wrote:

Vlad what was your mode of interaction/ordering between the two threads for
the 3rd test.

On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]>
wrote:

I created a simple test to check how quickly java can count to
Integer.MAX_INTEGER. The result that I see is consistent with
CONTAINER_LOCAL behavior:

counting long in a single thread: 0.9 sec
counting volatile long in a single thread: 17.7 sec
counting volatile long shared between two threads: 186.3 sec

I suggest that we look into
https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
or similar algorithm.

Thank you,

Vlad



On 9/28/15 08:19, Vlad Rozov wrote:

Ram,

The stream between operators in case of CONTAINER_LOCAL is InlineStream.
InlineStream extends DefaultReservoir that extends CircularBuffer.
CircularBuffer does not use synchronized methods or locks, it uses
volatile. I guess that using volatile causes CPU cache invalidation and
along with memory locality (in thread local case tuple is always local to
both threads, while in container local case the second operator thread may
see data significantly later after the first thread produced it) these two
factors negatively impact CONTAINER_LOCAL performance. It is still quite
surprising that the impact is so significant.

Thank you,

Vlad

On 9/27/15 16:45, Munagala Ramanath wrote:

Vlad,

That's a fascinating and counter-intuitive result. I wonder if some
internal synchronization is happening
(maybe the stream between them is a shared data structure that is lock
protected) to
slow down the 2 threads in the CONTAINER_LOCAL case. If they are both
going as fast as possible
it is likely that they will be frequently blocked by the lock. If that
is indeed the case, some sort of lock
striping or a near-lockless protocol for stream access should tilt the
balance in favor of CONTAINER_LOCAL.

In the thread-local case of course there is no need for such locking.

Ram

On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <[email protected]
<mailto:[email protected]>> wrote:

     Changed subject to reflect shift of discussion.

     After I recompiled netlet and hardcoded 0 wait time in the
     CircularBuffer.put() method, I still see the same difference even
     when I increased operator memory to 10 GB and set "-D
     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
     is close to 100% both for thread and container local locality
     settings. Note that in thread local two operators share 100% CPU,
     while in container local each gets its own 100% load. It sounds
     that container local will outperform thread local only when
     number of emitted tuples is (relatively) low, for example when it
     is CPU costly to produce tuples (hash computations,
     compression/decompression, aggregations, filtering with complex
     expressions). In cases where operator may emit 5 or more million
     tuples per second, thread local may outperform container local
     even when both operators are CPU intensive.




     Thank you,

     Vlad

     On 9/26/15 22:52, Timothy Farkas wrote:

     Hi Vlad,

     I just took a look at the CircularBuffer. Why are threads polling
the state
     of the buffer before doing operations? Couldn't polling be avoided
entirely
     by using something like Condition variables to signal when the
buffer is
     ready for an operation to be performed?

     Tim

     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
     wrote:

     After looking at few stack traces I think that in the benchmark

     application operators compete for the circular buffer that passes
slices
     from the emitter output to the consumer input and sleeps that
avoid busy
     wait are too long for the benchmark operators. I don't see the
stack
     similar to the one below all the time I take the threads dump, but
still
     quite often to suspect that sleep is the root cause. I'll
recompile with
     smaller sleep time and see how this will affect performance.

     ----
     "1/wordGenerator:RandomWordInputModule" prio=10
tid=0x00007f78c8b8c000
     nid=0x780f waiting on condition [0x00007f78abb17000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
          at java.lang.Thread.sleep(Native Method)
          at

com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
          at
com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
          at
com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
          at

com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
          at

com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
          at
com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
          at

com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)

     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
nid=0x780d
     waiting on condition [0x00007f78abc18000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
          at java.lang.Thread.sleep(Native Method)
          at
com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
          at

com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)

     ----


     On 9/26/15 20:59, Amol Kekre wrote:

     A good read -

     http://preshing.com/20111118/locks-arent-slow-lock-contention-is/

     Though it does not explain order of magnitude difference.

     Amol


     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
     wrote:

     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by
an order

     of magnitude and both operators compete for CPU. I'll take a
closer look
     why.

     Thank you,

     Vlad


     On 9/26/15 14:52, Thomas Weise wrote:

     THREAD_LOCAL - operators share thread

     CONTAINER_LOCAL - each operator has its own thread

     So as long as operators utilize the CPU sufficiently (compete),
the
     latter
     will perform better.

     There will be cases where a single thread can accommodate
multiple
     operators. For example, a socket reader (mostly waiting for IO)
and a
     decompress (CPU hungry) can share a thread.

     But to get back to the original question, stream locality does
generally
     not reduce the total memory requirement. If you add multiple
operators
     into
     one container, that container will also require more memory and
that's
     how
     the container size is calculated in the physical plan. You may
get some
     extra mileage when multiple operators share the same heap but
the need
     to
     identify the memory requirement per operator does not go away.

     Thomas


     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
     [email protected] <mailto:[email protected]>>
     wrote:

     Would CONTAINER_LOCAL achieve the same thing and perform a
little better

     on

     a multi-core box ?

     Ram

     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
     [email protected] <mailto:[email protected]>>
     wrote:

     Yes, with this approach only two containers are required: one
for stram
     and

     another for all operators. You can easily fit around 10
operators in

     less
     than 1GB.
     On 27 Sep 2015 00:32, "Timothy Farkas"<[email protected]>
<mailto:[email protected]>  wrote:

     Hi Ram,

     You could make all the operators thread local. This cuts down

on the
     overhead of separate containers and maximizes the memory
available to

     each

     operator.

     Tim

     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <

     [email protected] <mailto:[email protected]>

     wrote:

         Hi,

     I was running into memory issues when deploying my  app on

the

     sandbox

     where all the operators were stuck forever in the PENDING

state

     because

     they were being continually aborted and restarted because of

the

     limited

     memory on the sandbox. After some experimentation, I found
that the

     following config values seem to work:

     ------------------------------------------
     <


https://datatorrent.slack.com/archives/engineering/p1443263607000010

     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>

     <value>500</value>

         </property>  <property>
<name>dt.application..operator.*

     *.attr.MEMORY_MB</name>    <value>200</value>  </property>

     <property>

<name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>

           <value>512</value>  </property>*

     ------------------------------------------------

     Are these reasonable values ? Is there a more systematic
way of

     coming

up

     with these values than trial-and-error ? Most of my operators
-- with

     the
     exception of fileWordCount -- need very little memory; is
there a way
     to
     cut all values down to the bare minimum and maximize
available memory
     for
     this one operator ?

     Thanks.

Ram

Re: Thread and Container locality

Reply via email to