Re: Thread and Container locality

Vlad Rozov Wed, 30 Sep 2015 08:32:55 -0700

Blog and presentation on algorithms behind JCTools:

http://psy-lob-saw.blogspot.com/p/lock-free-queues.html
https://vimeo.com/100197431


Thank you,

Vlad

On 9/29/15 21:14, Vlad Rozov wrote:

I guess yes, it should show improvement every time there isconsumer/producer contention on a resource from two different threads,so we should see improvements in the buffer server as well. Thecurrent prototype does not support containers on different nodes.
Thank you,

Vlad

On 9/29/15 20:47, Pramod Immaneni wrote:
Would it show any improvement in the case where the containers are on
different nodes.
On Tue, Sep 29, 2015 at 7:17 PM, Vlad Rozov <[email protected]>wrote:
By changing QUEUE_CAPACITY to 1200000 I can get around 62 mil tuplesfor
the case when wordGenerator emits the same tuple and 34 mil when it
generates new tuples each time.

Thank you,

Vlad


On 9/29/15 17:08, Vlad Rozov wrote:
3 mil for container local and 55 mil for thread local.

Thank you,

Vlad



On 9/29/15 16:57, Chetan Narsude wrote:
Vlad, what was the number without this fix?

--
Chetan

On Tue, Sep 29, 2015 at 4:48 PM, Vlad Rozov <[email protected]>
wrote:
I did a quick prototype that uses http://jctools.github.io/JCToolsSPSC
bounded queue instead of CircularBuffer. For container local Inow see
13
mil tuples per second.

Thank you,

Vlad <http://jctools.github.io/JCTools>


On 9/28/15 12:58, Chetan Narsude wrote:

Let me shed some light on THREAD_LOCAL and CONTAINER_LOCAL.
THREAD_LOCAL at the core is nothing but a function call. When an
operator
does emit(tuple), it gets translated in  downstream ports
"process(tuple)"
call which immediately gets invoked in the same thread. Soobviously
the
performance is going to be a lot faster. The only thing that's
happening
in
between is setting up the stack and invoking the function.

With CONTAINER_LOCAL - there is a producer thread and a consumer
thread
involved. Producer produces (emit(tuple)) and consumer
consumes(process(tuple)). This scheme is the most optimal when the
rate at
which producer produces is equal to the rate at which consumer
consumes.
Often that's not the case - so we have a bounded memory buffer in
between
(the implementation is CircularBuffer). Now in addition to thethings
that
THREAD_LOCAL does, CONTAINER_LOCAL pattern requires managing the
circular
buffer *and* thread context switch. The most expensive of thethreadcontext switch is the memory synchronization. As you all havepointed
out
how expensive it is to use volatile, I need not get into detailsof how
expensive memory synchronization can get.
Long story short - no matter which pattern you use, when you usemore
than
1 thread there are certain memory synchronization penaltieswhich are
unavoidable and slow the things down considerably. In 2012, I had
benchmarked atomic, volatile, synchronized and for the benchmark (I
think
there are unit tests for it), I found volatile to be leastexpensive atthat time. Synchronized was not too much behind (it's veryefficient
when
the contention is likely to be amongst a single digit number of
threads).
Not sure how those benchmark will look today but you get the idea.
In a data intensive app, most of the time is spent in IO andthere is a
lot
of CPU idling at individual operator so you will not see thedifferencewhen you change CONTAINER_LOCAL to THREAD_LOCAL yet you will seesomememory optimization as you are taking away intermediate memorybased
buffer
*and* delayed garbage collection of the objects held by thisbuffer.
Recommendation: Do not bother with these micro optimizationsunless you
notice a problem. Use THREAD_LOCAL for processing
low-throughput/infrequent
streams. Use CONTAINER_LOCAL to avoidserialization/deserialization ofobjects. Leave the rest to the platform. I expect that as itmatures it
will make most of these decisions automatically.

HTH.

--
Chetan
On Mon, Sep 28, 2015 at 11:44 AM, Vlad Rozov<[email protected]>
wrote:

Hi Tim,
I use benchmark application that is part of Apache Malhar project.
Please
let me know if you need help with compiling or running the
application.

Thank you,

Vlad


On 9/28/15 11:09, Timothy Farkas wrote:

Also sharing a diff
https://github.com/DataTorrent/Netlet/compare/master...ilooner:condVarBuffer
Thanks,
Tim

On Mon, Sep 28, 2015 at 10:07 AM, Timothy Farkas <
[email protected]>
wrote:

Hi Vlad,
Could you share your benchmarking applications? I'd like totest a
change
I made to the Circular Buffer
https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java
Thanks,
Tim

On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <
[email protected]
wrote:

Vlad what was your mode of interaction/ordering between the two
threads

for
the 3rd test.

On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <
[email protected]
wrote:

I created a simple test to check how quickly java can count to

Integer.MAX_INTEGER. The result that I see is consistent with
CONTAINER_LOCAL behavior:

counting long in a single thread: 0.9 sec
counting volatile long in a single thread: 17.7 sec
counting volatile long shared between two threads: 186.3 sec

I suggest that we look into
https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
or similar algorithm.
Thank you,

Vlad



On 9/28/15 08:19, Vlad Rozov wrote:

Ram,

The stream between operators in case of CONTAINER_LOCAL is
InlineStream.
InlineStream extends DefaultReservoir that extendsCircularBuffer.
CircularBuffer does not use synchronized methods or locks,it uses
volatile. I guess that using volatile causes CPU cache
invalidation
and
along with memory locality (in thread local case tuple isalways
local

to
both threads, while in container local case the secondoperator
thread
may
see data significantly later after the first threadproduced it)
these
two
factors negatively impact CONTAINER_LOCAL performance. Itis still
quite
surprising that the impact is so significant.

Thank you,
Vlad

On 9/27/15 16:45, Munagala Ramanath wrote:

Vlad,
That's a fascinating and counter-intuitive result. Iwonder if
some
internal synchronization is happening
(maybe the stream between them is a shared data structurethat
is
lock
protected) to
slow down the 2 threads in the CONTAINER_LOCAL case. Ifthey are
both
going as fast as possible
it is likely that they will be frequently blocked by thelock.
If
that
is indeed the case, some sort of lock
striping or a near-lockless protocol for stream accessshould
tilt
the
balance in favor of CONTAINER_LOCAL.

In the thread-local case of course there is no need for such
locking.

Ram

On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <
[email protected]
<mailto:[email protected]>> wrote:

        Changed subject to reflect shift of discussion.
After I recompiled netlet and hardcoded 0 waittime in
the
        CircularBuffer.put() method, I still see the same
difference
even
when I increased operator memory to 10 GB and set"-D
dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000".
CPU %
        is close to 100% both for thread and container local
locality
settings. Note that in thread local two operatorsshare
100%
CPU,
while in container local each gets its own 100%load. It
sounds
that container local will outperform thread localonly
when
number of emitted tuples is (relatively) low, forexample
when
it
        is CPU costly to produce tuples (hash computations,
compression/decompression, aggregations,filtering with
complex
expressions). In cases where operator may emit 5or more
million
tuples per second, thread local may outperformcontainer
local
        even when both operators are CPU intensive.




        Thank you,

        Vlad

        On 9/26/15 22:52, Timothy Farkas wrote:

        Hi Vlad,
I just took a look at the CircularBuffer. Why arethreads
polling
the state
of the buffer before doing operations? Couldn'tpolling
be

avoided
entirely
        by using something like Condition variables to signal
when the
buffer is
        ready for an operation to be performed?

        Tim

        On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
        wrote:
After looking at few stack traces I think thatin the
benchmark
application operators compete for the circularbuffer
that
passes
slices
from the emitter output to the consumer input andsleeps
that
avoid busy
wait are too long for the benchmark operators.I don't
see
the
stack
        similar to the one below all the time I take the
threads
dump,

but
still
quite often to suspect that sleep is the rootcause. I'll
recompile with
smaller sleep time and see how this will affect
performance.

        ----
"1/wordGenerator:RandomWordInputModule" prio=10
tid=0x00007f78c8b8c000
nid=0x780f waiting on condition[0x00007f78abb17000]java.lang.Thread.State: TIMED_WAITING(sleeping)
             at java.lang.Thread.sleep(Native Method)
             at
com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
             at
com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
             at
com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
             at
com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
             at
com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
             at
com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
             at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
"2/counter:WordCountOperator" prio=10tid=0x00007f78c8c98800
nid=0x780d
        waiting on condition [0x00007f78abc18000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
             at java.lang.Thread.sleep(Native Method)
             at
com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
             at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
        ----
        On 9/26/15 20:59, Amol Kekre wrote:
        A good read -
http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
        Though it does not explain order of magnitude
difference.
        Amol
        On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
        wrote:

        In the benchmark test THREAD_LOCAL outperforms
CONTAINER_LOCAL

by
an order
of magnitude and both operators compete for CPU.I'll
take a
closer look
why.

        Thank you,

        Vlad


        On 9/26/15 14:52, Thomas Weise wrote:

        THREAD_LOCAL - operators share thread
CONTAINER_LOCAL - each operator has its ownthread
So as long as operators utilize the CPUsufficiently
(compete),
the
        latter
        will perform better.
There will be cases where a single thread can
accommodate
multiple
        operators. For example, a socket reader (mostly
waiting
for

IO)
and a
        decompress (CPU hungry) can share a thread.
        But to get back to the original question, stream
locality

does
generally
not reduce the total memory requirement. If youadd
multiple
operators
        into
one container, that container will alsorequire more
memory

and
that's
        how
the container size is calculated in the physicalplan.
You

may
get some
extra mileage when multiple operators share thesame
heap
but
the need
        to
identify the memory requirement per operatordoes
not go

away.
        Thomas
On Sat, Sep 26, 2015 at 12:41 PM, MunagalaRamanath <
[email protected] <mailto:[email protected]>>
        wrote:
Would CONTAINER_LOCAL achieve the same thingand
perform a
little better

        on

        a multi-core box ?
Ram
On Sat, Sep 26, 2015 at 12:18 PM, SandeepDeshmukh
<
[email protected] <mailto:
[email protected]>>
        wrote:
Yes, with this approach only two containersare
required:

one
for stram
        and
another for all operators. You can easily fit around 10
operators in

        less

than 1GB.
On 27 Sep 2015 00:32, "Timothy Farkas"<
[email protected]
<mailto:[email protected]> wrote:

        Hi Ram,
You could make all the operators threadlocal.
This
cuts

down
on the
overhead of separate containers and maximizesthe
memory
available to
each

        operator.

        Tim
On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
[email protected] <mailto:[email protected]>

        wrote:

            Hi,
I was running into memory issues whendeploying my
app
on

the
sandbox
where all the operators were stuckforever in
the

PENDING
state
        because
they were being continually aborted andrestarted
because

of
the
        limited
memory on the sandbox. After some experimentation, I
found
that the

        following config values seem to work:

------------------------------------------
<
https://datatorrent.slack.com/archives/engineering/p1443263607000010
        *<property> <name>dt.attr.MASTER_MEMORY_MB</name>
<value>500</value>
</property> <property>
<name>dt.application.  .operator.*
* .attr.MEMORY_MB</name> <value>200</value>
</property>

        <property>
<name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
              <value>512</value> </property>*
------------------------------------------------
        Are these reasonable values ? Is there a more
systematic

way of
coming

        up
with these values than trial-and-error ?Most
of my
operators
-- with
        the
exception of fileWordCount -- need very little
memory;
is
there a way
        to
        cut all values down to the bare minimum and
maximize
available memory
        for
        this one operator ?

        Thanks.

        Ram

Re: Thread and Container locality

Reply via email to