Re: Thread and Container locality

Timothy Farkas Mon, 28 Sep 2015 11:11:29 -0700

Also sharing a diff

https://github.com/DataTorrent/Netlet/compare/master...ilooner:condVarBuffer


Thanks,
Tim

On Mon, Sep 28, 2015 at 10:07 AM, Timothy Farkas <[email protected]>
wrote:

> Hi Vlad,
>
> Could you share your benchmarking applications? I'd like to test a change
> I made to the Circular Buffer
>
>
> https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java
>
> Thanks,
> Tim
>
> On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <[email protected]>
> wrote:
>
>> Vlad what was your mode of interaction/ordering between the two threads
>> for
>> the 3rd test.
>>
>> On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]>
>> wrote:
>>
>> > I created a simple test to check how quickly java can count to
>> > Integer.MAX_INTEGER. The result that I see is consistent with
>> > CONTAINER_LOCAL behavior:
>> >
>> > counting long in a single thread: 0.9 sec
>> > counting volatile long in a single thread: 17.7 sec
>> > counting volatile long shared between two threads: 186.3 sec
>> >
>> > I suggest that we look into
>> >
>> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
>> > or similar algorithm.
>> >
>> > Thank you,
>> >
>> > Vlad
>> >
>> >
>> >
>> > On 9/28/15 08:19, Vlad Rozov wrote:
>> >
>> >> Ram,
>> >>
>> >> The stream between operators in case of CONTAINER_LOCAL is
>> InlineStream.
>> >> InlineStream extends DefaultReservoir that extends CircularBuffer.
>> >> CircularBuffer does not use synchronized methods or locks, it uses
>> >> volatile. I guess that using volatile causes CPU cache invalidation and
>> >> along with memory locality (in thread local case tuple is always local
>> to
>> >> both threads, while in container local case the second operator thread
>> may
>> >> see data significantly later after the first thread produced it) these
>> two
>> >> factors negatively impact CONTAINER_LOCAL performance. It is still
>> quite
>> >> surprising that the impact is so significant.
>> >>
>> >> Thank you,
>> >>
>> >> Vlad
>> >>
>> >> On 9/27/15 16:45, Munagala Ramanath wrote:
>> >>
>> >>> Vlad,
>> >>>
>> >>> That's a fascinating and counter-intuitive result. I wonder if some
>> >>> internal synchronization is happening
>> >>> (maybe the stream between them is a shared data structure that is lock
>> >>> protected) to
>> >>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both
>> >>> going as fast as possible
>> >>> it is likely that they will be frequently blocked by the lock. If that
>> >>> is indeed the case, some sort of lock
>> >>> striping or a near-lockless protocol for stream access should tilt the
>> >>> balance in favor of CONTAINER_LOCAL.
>> >>>
>> >>> In the thread-local case of course there is no need for such locking.
>> >>>
>> >>> Ram
>> >>>
>> >>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <[email protected]
>> >>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>     Changed subject to reflect shift of discussion.
>> >>>
>> >>>     After I recompiled netlet and hardcoded 0 wait time in the
>> >>>     CircularBuffer.put() method, I still see the same difference even
>> >>>     when I increased operator memory to 10 GB and set "-D
>> >>>     dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
>> >>>     dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU %
>> >>>     is close to 100% both for thread and container local locality
>> >>>     settings. Note that in thread local two operators share 100% CPU,
>> >>>     while in container local each gets its own 100% load. It sounds
>> >>>     that container local will outperform thread local only when
>> >>>     number of emitted tuples is (relatively) low, for example when it
>> >>>     is CPU costly to produce tuples (hash computations,
>> >>>     compression/decompression, aggregations, filtering with complex
>> >>>     expressions). In cases where operator may emit 5 or more million
>> >>>     tuples per second, thread local may outperform container local
>> >>>     even when both operators are CPU intensive.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>     Thank you,
>> >>>
>> >>>     Vlad
>> >>>
>> >>>     On 9/26/15 22:52, Timothy Farkas wrote:
>> >>>
>> >>>>     Hi Vlad,
>> >>>>
>> >>>>     I just took a look at the CircularBuffer. Why are threads polling
>> >>>> the state
>> >>>>     of the buffer before doing operations? Couldn't polling be
>> avoided
>> >>>> entirely
>> >>>>     by using something like Condition variables to signal when the
>> >>>> buffer is
>> >>>>     ready for an operation to be performed?
>> >>>>
>> >>>>     Tim
>> >>>>
>> >>>>     On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
>> >>>> [email protected]> <mailto:[email protected]>
>> >>>>     wrote:
>> >>>>
>> >>>>     After looking at few stack traces I think that in the benchmark
>> >>>>>     application operators compete for the circular buffer that
>> passes
>> >>>>> slices
>> >>>>>     from the emitter output to the consumer input and sleeps that
>> >>>>> avoid busy
>> >>>>>     wait are too long for the benchmark operators. I don't see the
>> >>>>> stack
>> >>>>>     similar to the one below all the time I take the threads dump,
>> but
>> >>>>> still
>> >>>>>     quite often to suspect that sleep is the root cause. I'll
>> >>>>> recompile with
>> >>>>>     smaller sleep time and see how this will affect performance.
>> >>>>>
>> >>>>>     ----
>> >>>>>     "1/wordGenerator:RandomWordInputModule" prio=10
>> >>>>> tid=0x00007f78c8b8c000
>> >>>>>     nid=0x780f waiting on condition [0x00007f78abb17000]
>> >>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >>>>>          at java.lang.Thread.sleep(Native Method)
>> >>>>>          at
>> >>>>>
>> >>>>>
>> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>> >>>>>          at
>> >>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>> >>>>>          at
>> >>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>> >>>>>          at
>> >>>>>
>> >>>>>
>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>> >>>>>          at
>> >>>>>
>> >>>>>
>> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>> >>>>>          at
>> >>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>> >>>>>          at
>> >>>>>
>> >>>>>
>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>> >>>>>
>> >>>>>     "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
>> >>>>> nid=0x780d
>> >>>>>     waiting on condition [0x00007f78abc18000]
>> >>>>>         java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >>>>>          at java.lang.Thread.sleep(Native Method)
>> >>>>>          at
>> >>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>> >>>>>          at
>> >>>>>
>> >>>>>
>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>> >>>>>
>> >>>>>     ----
>> >>>>>
>> >>>>>
>> >>>>>     On 9/26/15 20:59, Amol Kekre wrote:
>> >>>>>
>> >>>>>     A good read -
>> >>>>>>
>> http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>> >>>>>>
>> >>>>>>     Though it does not explain order of magnitude difference.
>> >>>>>>
>> >>>>>>     Amol
>> >>>>>>
>> >>>>>>
>> >>>>>>     On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
>> >>>>>> [email protected]> <mailto:[email protected]>
>> >>>>>>     wrote:
>> >>>>>>
>> >>>>>>     In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL
>> by
>> >>>>>> an order
>> >>>>>>
>> >>>>>>>     of magnitude and both operators compete for CPU. I'll take a
>> >>>>>>> closer look
>> >>>>>>>     why.
>> >>>>>>>
>> >>>>>>>     Thank you,
>> >>>>>>>
>> >>>>>>>     Vlad
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>     On 9/26/15 14:52, Thomas Weise wrote:
>> >>>>>>>
>> >>>>>>>     THREAD_LOCAL - operators share thread
>> >>>>>>>
>> >>>>>>>>     CONTAINER_LOCAL - each operator has its own thread
>> >>>>>>>>
>> >>>>>>>>     So as long as operators utilize the CPU sufficiently
>> (compete),
>> >>>>>>>> the
>> >>>>>>>>     latter
>> >>>>>>>>     will perform better.
>> >>>>>>>>
>> >>>>>>>>     There will be cases where a single thread can accommodate
>> >>>>>>>> multiple
>> >>>>>>>>     operators. For example, a socket reader (mostly waiting for
>> IO)
>> >>>>>>>> and a
>> >>>>>>>>     decompress (CPU hungry) can share a thread.
>> >>>>>>>>
>> >>>>>>>>     But to get back to the original question, stream locality
>> does
>> >>>>>>>> generally
>> >>>>>>>>     not reduce the total memory requirement. If you add multiple
>> >>>>>>>> operators
>> >>>>>>>>     into
>> >>>>>>>>     one container, that container will also require more memory
>> and
>> >>>>>>>> that's
>> >>>>>>>>     how
>> >>>>>>>>     the container size is calculated in the physical plan. You
>> may
>> >>>>>>>> get some
>> >>>>>>>>     extra mileage when multiple operators share the same heap but
>> >>>>>>>> the need
>> >>>>>>>>     to
>> >>>>>>>>     identify the memory requirement per operator does not go
>> away.
>> >>>>>>>>
>> >>>>>>>>     Thomas
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>     On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>> >>>>>>>>     [email protected] <mailto:[email protected]>>
>> >>>>>>>>     wrote:
>> >>>>>>>>
>> >>>>>>>>     Would CONTAINER_LOCAL achieve the same thing and perform a
>> >>>>>>>> little better
>> >>>>>>>>
>> >>>>>>>>     on
>> >>>>>>>>>     a multi-core box ?
>> >>>>>>>>>
>> >>>>>>>>>     Ram
>> >>>>>>>>>
>> >>>>>>>>>     On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh <
>> >>>>>>>>>     [email protected] <mailto:[email protected]>>
>> >>>>>>>>>     wrote:
>> >>>>>>>>>
>> >>>>>>>>>     Yes, with this approach only two containers are required:
>> one
>> >>>>>>>>> for stram
>> >>>>>>>>>     and
>> >>>>>>>>>
>> >>>>>>>>>     another for all operators. You can easily fit around 10
>> >>>>>>>>> operators in
>> >>>>>>>>>
>> >>>>>>>>>>     less
>> >>>>>>>>>>     than 1GB.
>> >>>>>>>>>>     On 27 Sep 2015 00:32, "Timothy Farkas"<[email protected]
>> >
>> >>>>>>>>>> <mailto:[email protected]>  wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>     Hi Ram,
>> >>>>>>>>>>
>> >>>>>>>>>>     You could make all the operators thread local. This cuts
>> down
>> >>>>>>>>>>> on the
>> >>>>>>>>>>>     overhead of separate containers and maximizes the memory
>> >>>>>>>>>>> available to
>> >>>>>>>>>>>
>> >>>>>>>>>>>     each
>> >>>>>>>>>>>
>> >>>>>>>>>>     operator.
>> >>>>>>>>>>
>> >>>>>>>>>>>     Tim
>> >>>>>>>>>>>
>> >>>>>>>>>>>     On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
>> >>>>>>>>>>>
>> >>>>>>>>>>>     [email protected] <mailto:[email protected]>
>> >>>>>>>>>>>
>> >>>>>>>>>>     wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>         Hi,
>> >>>>>>>>>>>
>> >>>>>>>>>>>     I was running into memory issues when deploying my  app on
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     sandbox
>> >>>>>>>>>>>>
>> >>>>>>>>>>>     where all the operators were stuck forever in the PENDING
>> >>>>>>>>>> state
>> >>>>>>>>>>
>> >>>>>>>>>>     because
>> >>>>>>>>>>>
>> >>>>>>>>>>>     they were being continually aborted and restarted because
>> of
>> >>>>>>>>>> the
>> >>>>>>>>>>
>> >>>>>>>>>>     limited
>> >>>>>>>>>>>     memory on the sandbox. After some experimentation, I found
>> >>>>>>>>>>> that the
>> >>>>>>>>>>>
>> >>>>>>>>>>>     following config values seem to work:
>> >>>>>>>>>>>>     ------------------------------------------
>> >>>>>>>>>>>>     <
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> https://datatorrent.slack.com/archives/engineering/p1443263607000010
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     *<property>    <name>dt.attr.MASTER_MEMORY_MB</name>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     <value>500</value>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>         </property>  <property>
>> >>>>>>>>>>> <name>dt.application..operator.*
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     *.attr.MEMORY_MB</name>    <value>200</value>
>> </property>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     <property>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>
>> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>> >>>>>>>>>
>> >>>>>>>>>           <value>512</value>  </property>*
>> >>>>>>>>>
>> >>>>>>>>>>     ------------------------------------------------
>> >>>>>>>>>>>
>> >>>>>>>>>>>>     Are these reasonable values ? Is there a more systematic
>> >>>>>>>>>>>> way of
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     coming
>> >>>>>>>>>>>>
>> >>>>>>>>>>>     up
>> >>>>>>>>>>
>> >>>>>>>>>>     with these values than trial-and-error ? Most of my
>> operators
>> >>>>>>>>>> -- with
>> >>>>>>>>>>
>> >>>>>>>>>>>     the
>> >>>>>>>>>>>     exception of fileWordCount -- need very little memory; is
>> >>>>>>>>>>> there a way
>> >>>>>>>>>>>     to
>> >>>>>>>>>>>     cut all values down to the bare minimum and maximize
>> >>>>>>>>>>> available memory
>> >>>>>>>>>>>     for
>> >>>>>>>>>>>     this one operator ?
>> >>>>>>>>>>>
>> >>>>>>>>>>>     Thanks.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>     Ram
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Re: Thread and Container locality

Reply via email to