for
the 3rd test.
On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]
wrote:
I created a simple test to check how quickly java can count to
Integer.MAX_INTEGER. The result that I see is consistent with
CONTAINER_LOCAL behavior:
counting long in a single thread: 0.9 sec
counting volatile long in a single thread: 17.7 sec
counting volatile long shared between two threads: 186.3 sec
I suggest that we look into
https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
or similar algorithm.
Thank you,
Vlad
On 9/28/15 08:19, Vlad Rozov wrote:
Ram,
The stream between operators in case of CONTAINER_LOCAL is
InlineStream.
InlineStream extends DefaultReservoir that extends CircularBuffer.
CircularBuffer does not use synchronized methods or locks, it uses
volatile. I guess that using volatile causes CPU cache invalidation
and
along with memory locality (in thread local case tuple is always
local
to
both threads, while in container local case the second operator
thread
may
see data significantly later after the first thread produced it)
these
two
factors negatively impact CONTAINER_LOCAL performance. It is still
quite
surprising that the impact is so significant.
Thank you,
Vlad
On 9/27/15 16:45, Munagala Ramanath wrote:
Vlad,
That's a fascinating and counter-intuitive result. I wonder if some
internal synchronization is happening
(maybe the stream between them is a shared data structure that is
lock
protected) to
slow down the 2 threads in the CONTAINER_LOCAL case. If they are
both
going as fast as possible
it is likely that they will be frequently blocked by the lock. If
that
is indeed the case, some sort of lock
striping or a near-lockless protocol for stream access should tilt
the
balance in favor of CONTAINER_LOCAL.
In the thread-local case of course there is no need for such
locking.
Ram
On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <
[email protected]
<mailto:[email protected]>> wrote:
Changed subject to reflect shift of discussion.
After I recompiled netlet and hardcoded 0 wait time in the
CircularBuffer.put() method, I still see the same difference
even
when I increased operator memory to 10 GB and set "-D
dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000".
CPU %
is close to 100% both for thread and container local locality
settings. Note that in thread local two operators share 100%
CPU,
while in container local each gets its own 100% load. It
sounds
that container local will outperform thread local only when
number of emitted tuples is (relatively) low, for example
when
it
is CPU costly to produce tuples (hash computations,
compression/decompression, aggregations, filtering with
complex
expressions). In cases where operator may emit 5 or more
million
tuples per second, thread local may outperform container
local
even when both operators are CPU intensive.
Thank you,
Vlad
On 9/26/15 22:52, Timothy Farkas wrote:
Hi Vlad,
I just took a look at the CircularBuffer. Why are threads
polling
the state
of the buffer before doing operations? Couldn't polling be
avoided
entirely
by using something like Condition variables to signal when the
buffer is
ready for an operation to be performed?
Tim
On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
wrote:
After looking at few stack traces I think that in the
benchmark
application operators compete for the circular buffer that
passes
slices
from the emitter output to the consumer input and sleeps that
avoid busy
wait are too long for the benchmark operators. I don't see
the
stack
similar to the one below all the time I take the threads
dump,
but
still
quite often to suspect that sleep is the root cause. I'll
recompile with
smaller sleep time and see how this will affect
performance.
----
"1/wordGenerator:RandomWordInputModule" prio=10
tid=0x00007f78c8b8c000
nid=0x780f waiting on condition [0x00007f78abb17000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
at
com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
at
com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
at
com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
at
com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
at
com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
"2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
nid=0x780d
waiting on condition [0x00007f78abc18000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
----
On 9/26/15 20:59, Amol Kekre wrote:
A good read -
http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
Though it does not explain order of magnitude difference.
Amol
On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
[email protected]> <mailto:[email protected]>
wrote:
In the benchmark test THREAD_LOCAL outperforms
CONTAINER_LOCAL
by
an order
of magnitude and both operators compete for CPU. I'll take a
closer look
why.
Thank you,
Vlad
On 9/26/15 14:52, Thomas Weise wrote:
THREAD_LOCAL - operators share thread
CONTAINER_LOCAL - each operator has its own thread
So as long as operators utilize the CPU sufficiently
(compete),
the
latter
will perform better.
There will be cases where a single thread can
accommodate
multiple
operators. For example, a socket reader (mostly waiting
for
IO)
and a
decompress (CPU hungry) can share a thread.
But to get back to the original question, stream
locality
does
generally
not reduce the total memory requirement. If you add
multiple
operators
into
one container, that container will also require more
memory
and
that's
how
the container size is calculated in the physical plan.
You
may
get some
extra mileage when multiple operators share the same heap
but
the need
to
identify the memory requirement per operator does not go
away.
Thomas
On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
[email protected] <mailto:[email protected]>>
wrote:
Would CONTAINER_LOCAL achieve the same thing and
perform a
little better
on
a multi-core box ?
Ram
On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
[email protected] <mailto:
[email protected]>>
wrote:
Yes, with this approach only two containers are
required:
one
for stram
and
another for all operators. You can easily fit around 10
operators in
less
than 1GB.
On 27 Sep 2015 00:32, "Timothy Farkas"<
[email protected]
<mailto:[email protected]> wrote:
Hi Ram,
You could make all the operators thread local. This
cuts
down
on the
overhead of separate containers and maximizes the memory
available to
each
operator.
Tim
On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
[email protected] <mailto:[email protected]>
wrote:
Hi,
I was running into memory issues when deploying my
app
on
the
sandbox
where all the operators were stuck forever in the
PENDING
state
because
they were being continually aborted and restarted
because
of
the
limited
memory on the sandbox. After some experimentation, I
found
that the
following config values seem to work:
------------------------------------------
<
https://datatorrent.slack.com/archives/engineering/p1443263607000010
*<property> <name>dt.attr.MASTER_MEMORY_MB</name>
<value>500</value>
</property> <property>
<name>dt.application..operator.*
*.attr.MEMORY_MB</name> <value>200</value>
</property>
<property>
<name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
<value>512</value> </property>*
------------------------------------------------
Are these reasonable values ? Is there a more
systematic
way of
coming
up
with these values than trial-and-error ? Most of my
operators
-- with
the
exception of fileWordCount -- need very little
memory;
is
there a way
to
cut all values down to the bare minimum and maximize
available memory
for
this one operator ?
Thanks.
Ram