Re: Thread and Container locality

Vlad Rozov Mon, 28 Sep 2015 10:53:41 -0700

I created a simple test to check how quickly java can count toInteger.MAX_INTEGER. The result that I see is consistent withCONTAINER_LOCAL behavior:


counting long in a single thread: 0.9 sec
counting volatile long in a single thread: 17.7 sec
counting volatile long shared between two threads: 186.3 sec

I suggest that we look intohttps://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdfor similar algorithm.


Thank you,

Vlad


On 9/28/15 08:19, Vlad Rozov wrote:

Ram,

The stream between operators in case of CONTAINER_LOCAL isInlineStream. InlineStream extends DefaultReservoir that extendsCircularBuffer. CircularBuffer does not use synchronized methods orlocks, it uses volatile. I guess that using volatile causes CPU cacheinvalidation and along with memory locality (in thread local casetuple is always local to both threads, while in container local casethe second operator thread may see data significantly later after thefirst thread produced it) these two factors negatively impactCONTAINER_LOCAL performance. It is still quite surprising that theimpact is so significant.


Thank you,

Vlad

On 9/27/15 16:45, Munagala Ramanath wrote:

Vlad,

That's a fascinating and counter-intuitive result. I wonder if someinternal synchronization is happening(maybe the stream between them is a shared data structure that islock protected) toslow down the 2 threads in the CONTAINER_LOCAL case. If they are bothgoing as fast as possibleit is likely that they will be frequently blocked by the lock. Ifthat is indeed the case, some sort of lockstriping or a near-lockless protocol for stream access should tiltthe balance in favor of CONTAINER_LOCAL.


In the thread-local case of course there is no need for such locking.

Ram

On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <[email protected]<mailto:[email protected]>> wrote:


    Changed subject to reflect shift of discussion.

    After I recompiled netlet and hardcoded 0 wait time in the
    CircularBuffer.put() method, I still see the same difference even
    when I increased operator memory to 10 GB and set "-D
    dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
    dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
    is close to 100% both for thread and container local locality
    settings. Note that in thread local two operators share 100% CPU,
    while in container local each gets its own 100% load. It sounds
    that container local will outperform thread local only when
    number of emitted tuples is (relatively) low, for example when it
    is CPU costly to produce tuples (hash computations,
    compression/decompression, aggregations, filtering with complex
    expressions). In cases where operator may emit 5 or more million
    tuples per second, thread local may outperform container local
    even when both operators are CPU intensive.




    Thank you,

    Vlad

    On 9/26/15 22:52, Timothy Farkas wrote:

    Hi Vlad,

    I just took a look at the CircularBuffer. Why are threads polling the state
    of the buffer before doing operations? Couldn't polling be avoided entirely
    by using something like Condition variables to signal when the buffer is
    ready for an operation to be performed?

    Tim

    On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<[email protected]> 
<mailto:[email protected]>
    wrote:

    After looking at few stack traces I think that in the benchmark
    application operators compete for the circular buffer that passes slices
    from the emitter output to the consumer input and sleeps that avoid busy
    wait are too long for the benchmark operators. I don't see the stack
    similar to the one below all the time I take the threads dump, but still
    quite often to suspect that sleep is the root cause. I'll recompile with
    smaller sleep time and see how this will affect performance.

    ----
    "1/wordGenerator:RandomWordInputModule" prio=10 tid=0x00007f78c8b8c000
    nid=0x780f waiting on condition [0x00007f78abb17000]
        java.lang.Thread.State: TIMED_WAITING (sleeping)
         at java.lang.Thread.sleep(Native Method)
         at
    com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
         at com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
         at com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
         at
    com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
         at
    
com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
         at com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
         at
    
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)

    "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800 nid=0x780d
    waiting on condition [0x00007f78abc18000]
        java.lang.Thread.State: TIMED_WAITING (sleeping)
         at java.lang.Thread.sleep(Native Method)
         at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
         at
    
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)

    ----


    On 9/26/15 20:59, Amol Kekre wrote:

    A good read -
    http://preshing.com/20111118/locks-arent-slow-lock-contention-is/

    Though it does not explain order of magnitude difference.

    Amol


    On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<[email protected]> 
<mailto:[email protected]>
    wrote:

    In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by an order

    of magnitude and both operators compete for CPU. I'll take a closer look
    why.

    Thank you,

    Vlad


    On 9/26/15 14:52, Thomas Weise wrote:

    THREAD_LOCAL - operators share thread

    CONTAINER_LOCAL - each operator has its own thread

    So as long as operators utilize the CPU sufficiently (compete), the
    latter
    will perform better.

    There will be cases where a single thread can accommodate multiple
    operators. For example, a socket reader (mostly waiting for IO) and a
    decompress (CPU hungry) can share a thread.

    But to get back to the original question, stream locality does generally
    not reduce the total memory requirement. If you add multiple operators
    into
    one container, that container will also require more memory and that's
    how
    the container size is calculated in the physical plan. You may get some
    extra mileage when multiple operators share the same heap but the need
    to
    identify the memory requirement per operator does not go away.

    Thomas


    On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
    [email protected] <mailto:[email protected]>>
    wrote:

    Would CONTAINER_LOCAL achieve the same thing and perform a little better

    on
    a multi-core box ?

    Ram

    On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
    [email protected] <mailto:[email protected]>>
    wrote:

    Yes, with this approach only two containers are required: one for stram
    and

    another for all operators. You can easily fit around 10 operators in

    less
    than 1GB.
    On 27 Sep 2015 00:32, "Timothy Farkas"<[email protected]> 
<mailto:[email protected]>  wrote:

    Hi Ram,

    You could make all the operators thread local. This cuts down on the
    overhead of separate containers and maximizes the memory available to

    each

    operator.

    Tim

    On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <

    [email protected] <mailto:[email protected]>

    wrote:

Hi,

    I was running into memory issues when deploying my  app on the

    sandbox

    where all the operators were stuck forever in the PENDING state

    because

    they were being continually aborted and restarted because of the

    limited
    memory on the sandbox. After some experimentation, I found that the

    following config values seem to work:
    ------------------------------------------
    <

    https://datatorrent.slack.com/archives/engineering/p1443263607000010

    *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>

    <value>500</value>

        </property>  <property>    <name>dt.application..operator.*


    *.attr.MEMORY_MB</name>    <value>200</value>  </property>

    <property>

    
<name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>

          <value>512</value>  </property>*

    ------------------------------------------------

    Are these reasonable values ? Is there a more systematic way of

    coming

    up

    with these values than trial-and-error ? Most of my operators -- with

    the
    exception of fileWordCount -- need very little memory; is there a way
    to
    cut all values down to the bare minimum and maximize available memory
    for
    this one operator ?

    Thanks.

    Ram

Re: Thread and Container locality

Reply via email to