Re: Thread and Container locality

Pramod Immaneni Mon, 28 Sep 2015 10:58:46 -0700

Vlad what was your mode of interaction/ordering between the two threads for
the 3rd test.


On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]>
wrote:

> I created a simple test to check how quickly java can count to
> Integer.MAX_INTEGER. The result that I see is consistent with
> CONTAINER_LOCAL behavior:
>
> counting long in a single thread: 0.9 sec
> counting volatile long in a single thread: 17.7 sec
> counting volatile long shared between two threads: 186.3 sec
>
> I suggest that we look into
> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
> or similar algorithm.
>
> Thank you,
>
> Vlad
>
>
>
> On 9/28/15 08:19, Vlad Rozov wrote:
>
>> Ram,
>>
>> The stream between operators in case of CONTAINER_LOCAL is InlineStream.
>> InlineStream extends DefaultReservoir that extends CircularBuffer.
>> CircularBuffer does not use synchronized methods or locks, it uses
>> volatile. I guess that using volatile causes CPU cache invalidation and
>> along with memory locality (in thread local case tuple is always local to
>> both threads, while in container local case the second operator thread may
>> see data significantly later after the first thread produced it) these two
>> factors negatively impact CONTAINER_LOCAL performance. It is still quite
>> surprising that the impact is so significant.
>>
>> Thank you,
>>
>> Vlad
>>
>> On 9/27/15 16:45, Munagala Ramanath wrote:
>>
>>> Vlad,
>>>
>>> That's a fascinating and counter-intuitive result. I wonder if some
>>> internal synchronization is happening
>>> (maybe the stream between them is a shared data structure that is lock
>>> protected) to
>>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both
>>> going as fast as possible
>>> it is likely that they will be frequently blocked by the lock. If that
>>> is indeed the case, some sort of lock
>>> striping or a near-lockless protocol for stream access should tilt the
>>> balance in favor of CONTAINER_LOCAL.
>>>
>>> In the thread-local case of course there is no need for such locking.
>>>
>>> Ram
>>>
>>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Changed subject to reflect shift of discussion.
>>>
>>>     After I recompiled netlet and hardcoded 0 wait time in the
>>>     CircularBuffer.put() method, I still see the same difference even
>>>     when I increased operator memory to 10 GB and set "-D
>>>     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
>>>     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
>>>     is close to 100% both for thread and container local locality
>>>     settings. Note that in thread local two operators share 100% CPU,
>>>     while in container local each gets its own 100% load. It sounds
>>>     that container local will outperform thread local only when
>>>     number of emitted tuples is (relatively) low, for example when it
>>>     is CPU costly to produce tuples (hash computations,
>>>     compression/decompression, aggregations, filtering with complex
>>>     expressions). In cases where operator may emit 5 or more million
>>>     tuples per second, thread local may outperform container local
>>>     even when both operators are CPU intensive.
>>>
>>>
>>>
>>>
>>>     Thank you,
>>>
>>>     Vlad
>>>
>>>     On 9/26/15 22:52, Timothy Farkas wrote:
>>>
>>>>     Hi Vlad,
>>>>
>>>>     I just took a look at the CircularBuffer. Why are threads polling
>>>> the state
>>>>     of the buffer before doing operations? Couldn't polling be avoided
>>>> entirely
>>>>     by using something like Condition variables to signal when the
>>>> buffer is
>>>>     ready for an operation to be performed?
>>>>
>>>>     Tim
>>>>
>>>>     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
>>>> [email protected]> <mailto:[email protected]>
>>>>     wrote:
>>>>
>>>>     After looking at few stack traces I think that in the benchmark
>>>>>     application operators compete for the circular buffer that passes
>>>>> slices
>>>>>     from the emitter output to the consumer input and sleeps that
>>>>> avoid busy
>>>>>     wait are too long for the benchmark operators. I don't see the
>>>>> stack
>>>>>     similar to the one below all the time I take the threads dump, but
>>>>> still
>>>>>     quite often to suspect that sleep is the root cause. I'll
>>>>> recompile with
>>>>>     smaller sleep time and see how this will affect performance.
>>>>>
>>>>>     ----
>>>>>     "1/wordGenerator:RandomWordInputModule" prio=10
>>>>> tid=0x00007f78c8b8c000
>>>>>     nid=0x780f waiting on condition [0x00007f78abb17000]
>>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>>          at java.lang.Thread.sleep(Native Method)
>>>>>          at
>>>>>
>>>>> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>>>>>          at
>>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>>>>>          at
>>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>>>>>          at
>>>>>
>>>>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>>>>>          at
>>>>>
>>>>> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>>>>>          at
>>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>>>>>          at
>>>>>
>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>>
>>>>>     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
>>>>> nid=0x780d
>>>>>     waiting on condition [0x00007f78abc18000]
>>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>>          at java.lang.Thread.sleep(Native Method)
>>>>>          at
>>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>>>>>          at
>>>>>
>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>>
>>>>>     ----
>>>>>
>>>>>
>>>>>     On 9/26/15 20:59, Amol Kekre wrote:
>>>>>
>>>>>     A good read -
>>>>>>     http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>>>>>>
>>>>>>     Though it does not explain order of magnitude difference.
>>>>>>
>>>>>>     Amol
>>>>>>
>>>>>>
>>>>>>     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
>>>>>> [email protected]> <mailto:[email protected]>
>>>>>>     wrote:
>>>>>>
>>>>>>     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL by
>>>>>> an order
>>>>>>
>>>>>>>     of magnitude and both operators compete for CPU. I'll take a
>>>>>>> closer look
>>>>>>>     why.
>>>>>>>
>>>>>>>     Thank you,
>>>>>>>
>>>>>>>     Vlad
>>>>>>>
>>>>>>>
>>>>>>>     On 9/26/15 14:52, Thomas Weise wrote:
>>>>>>>
>>>>>>>     THREAD_LOCAL - operators share thread
>>>>>>>
>>>>>>>>     CONTAINER_LOCAL - each operator has its own thread
>>>>>>>>
>>>>>>>>     So as long as operators utilize the CPU sufficiently (compete),
>>>>>>>> the
>>>>>>>>     latter
>>>>>>>>     will perform better.
>>>>>>>>
>>>>>>>>     There will be cases where a single thread can accommodate
>>>>>>>> multiple
>>>>>>>>     operators. For example, a socket reader (mostly waiting for IO)
>>>>>>>> and a
>>>>>>>>     decompress (CPU hungry) can share a thread.
>>>>>>>>
>>>>>>>>     But to get back to the original question, stream locality does
>>>>>>>> generally
>>>>>>>>     not reduce the total memory requirement. If you add multiple
>>>>>>>> operators
>>>>>>>>     into
>>>>>>>>     one container, that container will also require more memory and
>>>>>>>> that's
>>>>>>>>     how
>>>>>>>>     the container size is calculated in the physical plan. You may
>>>>>>>> get some
>>>>>>>>     extra mileage when multiple operators share the same heap but
>>>>>>>> the need
>>>>>>>>     to
>>>>>>>>     identify the memory requirement per operator does not go away.
>>>>>>>>
>>>>>>>>     Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>>     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>>>>>>>>     [email protected] <mailto:[email protected]>>
>>>>>>>>     wrote:
>>>>>>>>
>>>>>>>>     Would CONTAINER_LOCAL achieve the same thing and perform a
>>>>>>>> little better
>>>>>>>>
>>>>>>>>     on
>>>>>>>>>     a multi-core box ?
>>>>>>>>>
>>>>>>>>>     Ram
>>>>>>>>>
>>>>>>>>>     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
>>>>>>>>>     [email protected] <mailto:[email protected]>>
>>>>>>>>>     wrote:
>>>>>>>>>
>>>>>>>>>     Yes, with this approach only two containers are required: one
>>>>>>>>> for stram
>>>>>>>>>     and
>>>>>>>>>
>>>>>>>>>     another for all operators. You can easily fit around 10
>>>>>>>>> operators in
>>>>>>>>>
>>>>>>>>>>     less
>>>>>>>>>>     than 1GB.
>>>>>>>>>>     On 27 Sep 2015 00:32, "Timothy Farkas"<[email protected]>
>>>>>>>>>> <mailto:[email protected]>  wrote:
>>>>>>>>>>
>>>>>>>>>>     Hi Ram,
>>>>>>>>>>
>>>>>>>>>>     You could make all the operators thread local. This cuts down
>>>>>>>>>>> on the
>>>>>>>>>>>     overhead of separate containers and maximizes the memory
>>>>>>>>>>> available to
>>>>>>>>>>>
>>>>>>>>>>>     each
>>>>>>>>>>>
>>>>>>>>>>     operator.
>>>>>>>>>>
>>>>>>>>>>>     Tim
>>>>>>>>>>>
>>>>>>>>>>>     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
>>>>>>>>>>>
>>>>>>>>>>>     [email protected] <mailto:[email protected]>
>>>>>>>>>>>
>>>>>>>>>>     wrote:
>>>>>>>>>>
>>>>>>>>>>         Hi,
>>>>>>>>>>>
>>>>>>>>>>>     I was running into memory issues when deploying my  app on
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>>     sandbox
>>>>>>>>>>>>
>>>>>>>>>>>     where all the operators were stuck forever in the PENDING
>>>>>>>>>> state
>>>>>>>>>>
>>>>>>>>>>     because
>>>>>>>>>>>
>>>>>>>>>>>     they were being continually aborted and restarted because of
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>     limited
>>>>>>>>>>>     memory on the sandbox. After some experimentation, I found
>>>>>>>>>>> that the
>>>>>>>>>>>
>>>>>>>>>>>     following config values seem to work:
>>>>>>>>>>>>     ------------------------------------------
>>>>>>>>>>>>     <
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://datatorrent.slack.com/archives/engineering/p1443263607000010
>>>>>>>>>>>>
>>>>>>>>>>>>     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
>>>>>>>>>>>>
>>>>>>>>>>>>     <value>500</value>
>>>>>>>>>>>>
>>>>>>>>>>>         </property>  <property>
>>>>>>>>>>> <name>dt.application..operator.*
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>     *.attr.MEMORY_MB</name>    <value>200</value>  </property>
>>>>>>>>>>>>
>>>>>>>>>>>>     <property>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>>>>>>>>>
>>>>>>>>>           <value>512</value>  </property>*
>>>>>>>>>
>>>>>>>>>>     ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>>     Are these reasonable values ? Is there a more systematic
>>>>>>>>>>>> way of
>>>>>>>>>>>>
>>>>>>>>>>>>     coming
>>>>>>>>>>>>
>>>>>>>>>>>     up
>>>>>>>>>>
>>>>>>>>>>     with these values than trial-and-error ? Most of my operators
>>>>>>>>>> -- with
>>>>>>>>>>
>>>>>>>>>>>     the
>>>>>>>>>>>     exception of fileWordCount -- need very little memory; is
>>>>>>>>>>> there a way
>>>>>>>>>>>     to
>>>>>>>>>>>     cut all values down to the bare minimum and maximize
>>>>>>>>>>> available memory
>>>>>>>>>>>     for
>>>>>>>>>>>     this one operator ?
>>>>>>>>>>>
>>>>>>>>>>>     Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>     Ram
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>
>>>
>>
>

Re: Thread and Container locality

Reply via email to