Re: Thread and Container locality

Pramod Immaneni Tue, 29 Sep 2015 20:48:06 -0700

Would it show any improvement in the case where the containers are on
different nodes.


On Tue, Sep 29, 2015 at 7:17 PM, Vlad Rozov <[email protected]> wrote:

> By changing QUEUE_CAPACITY to 1200000 I can get around 62 mil tuples for
> the case when wordGenerator emits the same tuple and 34 mil when it
> generates new tuples each time.
>
> Thank you,
>
> Vlad
>
>
> On 9/29/15 17:08, Vlad Rozov wrote:
>
>> 3 mil for container local and 55 mil for thread local.
>>
>> Thank you,
>>
>> Vlad
>>
>>
>>
>> On 9/29/15 16:57, Chetan Narsude wrote:
>>
>>> Vlad, what was the number without this fix?
>>>
>>> --
>>> Chetan
>>>
>>> On Tue, Sep 29, 2015 at 4:48 PM, Vlad Rozov <[email protected]>
>>> wrote:
>>>
>>> I did a quick prototype that uses http://jctools.github.io/JCTools SPSC
>>>> bounded queue instead of CircularBuffer. For container local I now see
>>>> 13
>>>> mil tuples per second.
>>>>
>>>> Thank you,
>>>>
>>>> Vlad <http://jctools.github.io/JCTools>
>>>>
>>>>
>>>> On 9/28/15 12:58, Chetan Narsude wrote:
>>>>
>>>> Let me shed some light on THREAD_LOCAL and CONTAINER_LOCAL.
>>>>>
>>>>> THREAD_LOCAL at the core is nothing but a function call. When an
>>>>> operator
>>>>> does emit(tuple), it gets translated in  downstream ports
>>>>> "process(tuple)"
>>>>> call which immediately gets invoked in the same thread. So obviously
>>>>> the
>>>>> performance is going to be a lot faster. The only thing that's
>>>>> happening
>>>>> in
>>>>> between is setting up the stack and invoking the function.
>>>>>
>>>>> With CONTAINER_LOCAL - there is a producer thread and  a consumer
>>>>> thread
>>>>> involved. Producer produces (emit(tuple)) and consumer
>>>>> consumes(process(tuple)). This scheme is the most optimal when the
>>>>> rate at
>>>>> which producer produces is equal to the rate at which consumer
>>>>> consumes.
>>>>> Often that's not the case - so we have a bounded memory buffer in
>>>>> between
>>>>> (the implementation is CircularBuffer). Now in addition to the things
>>>>> that
>>>>> THREAD_LOCAL does, CONTAINER_LOCAL pattern requires managing the
>>>>> circular
>>>>> buffer *and* thread context switch. The most expensive of the thread
>>>>> context switch is the memory synchronization. As you all have pointed
>>>>> out
>>>>> how expensive it is to use volatile, I need not get into details of how
>>>>> expensive memory synchronization can get.
>>>>>
>>>>> Long story short - no matter which pattern you use, when you use more
>>>>> than
>>>>> 1 thread there are certain memory synchronization penalties which are
>>>>> unavoidable and slow the things down considerably. In 2012, I had
>>>>> benchmarked atomic, volatile, synchronized and for the benchmark (I
>>>>> think
>>>>> there are unit tests for it), I found volatile to be least expensive at
>>>>> that time. Synchronized was not too much behind (it's very efficient
>>>>> when
>>>>> the contention is likely to be amongst a single digit number of
>>>>> threads).
>>>>> Not sure how those benchmark will look today but you get the idea.
>>>>>
>>>>> In a data intensive app, most of the time is spent in IO and there is a
>>>>> lot
>>>>> of CPU idling at individual operator so you will not see the difference
>>>>> when you change CONTAINER_LOCAL to THREAD_LOCAL yet you will see some
>>>>> memory optimization as you are taking away intermediate memory based
>>>>> buffer
>>>>> *and* delayed garbage collection of the objects held by this buffer.
>>>>>
>>>>> Recommendation: Do not bother with these micro optimizations unless you
>>>>> notice a problem. Use THREAD_LOCAL for processing
>>>>> low-throughput/infrequent
>>>>> streams. Use CONTAINER_LOCAL to avoid serialization/deserialization of
>>>>> objects. Leave the rest to the platform. I expect that as it matures it
>>>>> will make most of these decisions automatically.
>>>>>
>>>>> HTH.
>>>>>
>>>>> --
>>>>> Chetan
>>>>>
>>>>> On Mon, Sep 28, 2015 at 11:44 AM, Vlad Rozov <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi Tim,
>>>>>
>>>>>> I use benchmark application that is part of Apache Malhar project.
>>>>>> Please
>>>>>> let me know if you need help with compiling or running the
>>>>>> application.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Vlad
>>>>>>
>>>>>>
>>>>>> On 9/28/15 11:09, Timothy Farkas wrote:
>>>>>>
>>>>>> Also sharing a diff
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/DataTorrent/Netlet/compare/master...ilooner:condVarBuffer
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Tim
>>>>>>>
>>>>>>> On Mon, Sep 28, 2015 at 10:07 AM, Timothy Farkas <
>>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Vlad,
>>>>>>>
>>>>>>> Could you share your benchmarking applications? I'd like to test a
>>>>>>>> change
>>>>>>>> I made to the Circular Buffer
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <
>>>>>>>> [email protected]
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Vlad what was your mode of interaction/ordering between the two
>>>>>>>> threads
>>>>>>>>
>>>>>>>> for
>>>>>>>>> the 3rd test.
>>>>>>>>>
>>>>>>>>> On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <
>>>>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I created a simple test to check how quickly java can count to
>>>>>>>>>
>>>>>>>>> Integer.MAX_INTEGER. The result that I see is consistent with
>>>>>>>>>> CONTAINER_LOCAL behavior:
>>>>>>>>>>
>>>>>>>>>> counting long in a single thread: 0.9 sec
>>>>>>>>>> counting volatile long in a single thread: 17.7 sec
>>>>>>>>>> counting volatile long shared between two threads: 186.3 sec
>>>>>>>>>>
>>>>>>>>>> I suggest that we look into
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf
>>>>>>>>>
>>>>>>>>> or similar algorithm.
>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>>
>>>>>>>>>> Vlad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/28/15 08:19, Vlad Rozov wrote:
>>>>>>>>>>
>>>>>>>>>> Ram,
>>>>>>>>>>
>>>>>>>>>> The stream between operators in case of CONTAINER_LOCAL is
>>>>>>>>>>>
>>>>>>>>>>> InlineStream.
>>>>>>>>>>>
>>>>>>>>>> InlineStream extends DefaultReservoir that extends CircularBuffer.
>>>>>>>>>>
>>>>>>>>>> CircularBuffer does not use synchronized methods or locks, it uses
>>>>>>>>>>> volatile. I guess that using volatile causes CPU cache
>>>>>>>>>>> invalidation
>>>>>>>>>>> and
>>>>>>>>>>> along with memory locality (in thread local case tuple is always
>>>>>>>>>>> local
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>> both threads, while in container local case the second operator
>>>>>>>>>> thread
>>>>>>>>>> may
>>>>>>>>>> see data significantly later after the first thread produced it)
>>>>>>>>>> these
>>>>>>>>>> two
>>>>>>>>>> factors negatively impact CONTAINER_LOCAL performance. It is still
>>>>>>>>>> quite
>>>>>>>>>> surprising that the impact is so significant.
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>>>
>>>>>>>>>>> Vlad
>>>>>>>>>>>
>>>>>>>>>>> On 9/27/15 16:45, Munagala Ramanath wrote:
>>>>>>>>>>>
>>>>>>>>>>> Vlad,
>>>>>>>>>>>
>>>>>>>>>>> That's a fascinating and counter-intuitive result. I wonder if
>>>>>>>>>>>> some
>>>>>>>>>>>> internal synchronization is happening
>>>>>>>>>>>> (maybe the stream between them is a shared data structure that
>>>>>>>>>>>> is
>>>>>>>>>>>> lock
>>>>>>>>>>>> protected) to
>>>>>>>>>>>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are
>>>>>>>>>>>> both
>>>>>>>>>>>> going as fast as possible
>>>>>>>>>>>> it is likely that they will be frequently blocked by the lock.
>>>>>>>>>>>> If
>>>>>>>>>>>> that
>>>>>>>>>>>> is indeed the case, some sort of lock
>>>>>>>>>>>> striping or a near-lockless protocol for stream access should
>>>>>>>>>>>> tilt
>>>>>>>>>>>> the
>>>>>>>>>>>> balance in favor of CONTAINER_LOCAL.
>>>>>>>>>>>>
>>>>>>>>>>>> In the thread-local case of course there is no need for such
>>>>>>>>>>>> locking.
>>>>>>>>>>>>
>>>>>>>>>>>> Ram
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>        Changed subject to reflect shift of discussion.
>>>>>>>>>>>>
>>>>>>>>>>>>        After I recompiled netlet and hardcoded 0 wait time in
>>>>>>>>>>>> the
>>>>>>>>>>>>        CircularBuffer.put() method, I still see the same
>>>>>>>>>>>> difference
>>>>>>>>>>>> even
>>>>>>>>>>>>        when I increased operator memory to 10 GB and set "-D
>>>>>>>>>>>> dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D
>>>>>>>>>>>> dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000".
>>>>>>>>>>>> CPU %
>>>>>>>>>>>>        is close to 100% both for thread and container local
>>>>>>>>>>>> locality
>>>>>>>>>>>>        settings. Note that in thread local two operators share
>>>>>>>>>>>> 100%
>>>>>>>>>>>> CPU,
>>>>>>>>>>>>        while in container local each gets its own 100% load. It
>>>>>>>>>>>> sounds
>>>>>>>>>>>>        that container local will outperform thread local only
>>>>>>>>>>>> when
>>>>>>>>>>>>        number of emitted tuples is (relatively) low, for example
>>>>>>>>>>>> when
>>>>>>>>>>>> it
>>>>>>>>>>>>        is CPU costly to produce tuples (hash computations,
>>>>>>>>>>>>        compression/decompression, aggregations, filtering with
>>>>>>>>>>>> complex
>>>>>>>>>>>>        expressions). In cases where operator may emit 5 or more
>>>>>>>>>>>> million
>>>>>>>>>>>>        tuples per second, thread local may outperform container
>>>>>>>>>>>> local
>>>>>>>>>>>>        even when both operators are CPU intensive.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>        Thank you,
>>>>>>>>>>>>
>>>>>>>>>>>>        Vlad
>>>>>>>>>>>>
>>>>>>>>>>>>        On 9/26/15 22:52, Timothy Farkas wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>        Hi Vlad,
>>>>>>>>>>>>
>>>>>>>>>>>>        I just took a look at the CircularBuffer. Why are threads
>>>>>>>>>>>>> polling
>>>>>>>>>>>>> the state
>>>>>>>>>>>>>        of the buffer before doing operations? Couldn't polling
>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>> avoided
>>>>>>>>>>>>>
>>>>>>>>>>>> entirely
>>>>>>>>>>>        by using something like Condition variables to signal
>>>>>>>>>>> when the
>>>>>>>>>>>
>>>>>>>>>>>> buffer is
>>>>>>>>>>>>>        ready for an operation to be performed?
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Tim
>>>>>>>>>>>>>
>>>>>>>>>>>>>        On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov<
>>>>>>>>>>>>> [email protected]> <mailto:[email protected]>
>>>>>>>>>>>>>        wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        After looking at few stack traces I think that in the
>>>>>>>>>>>>> benchmark
>>>>>>>>>>>>>
>>>>>>>>>>>>>        application operators compete for the circular buffer
>>>>>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>> passes
>>>>>>>>>>>>>>
>>>>>>>>>>>>> slices
>>>>>>>>>>>>
>>>>>>>>>>>        from the emitter output to the consumer input and sleeps
>>>>>>>>>>> that
>>>>>>>>>>>
>>>>>>>>>>>> avoid busy
>>>>>>>>>>>>>>        wait are too long for the benchmark operators. I don't
>>>>>>>>>>>>>> see
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> stack
>>>>>>>>>>>>>>        similar to the one below all the time I take the
>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>> dump,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>
>>>>>>>>>>>>> still
>>>>>>>>>>>>
>>>>>>>>>>>        quite often to suspect that sleep is the root cause. I'll
>>>>>>>>>>>
>>>>>>>>>>>> recompile with
>>>>>>>>>>>>>>        smaller sleep time and see how this will affect
>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        ----
>>>>>>>>>>>>>> "1/wordGenerator:RandomWordInputModule" prio=10
>>>>>>>>>>>>>> tid=0x00007f78c8b8c000
>>>>>>>>>>>>>>        nid=0x780f waiting on condition [0x00007f78abb17000]
>>>>>>>>>>>>>>            java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>>>>>>>>>>>             at java.lang.Thread.sleep(Native Method)
>>>>>>>>>>>>>>             at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             at
>>>>>>>>>
>>>>>>>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             at
>>>>>>>>>>>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             at
>>>>>>>>>
>>>>>>>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115)
>>>>>>>>>>>
>>>>>>>>>>>>             at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800
>>>>>>>>>
>>>>>>>>>> nid=0x780d
>>>>>>>>>>>
>>>>>>>>>>>>        waiting on condition [0x00007f78abc18000]
>>>>>>>>>>>>>>            java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>>>>>>>>>>>             at java.lang.Thread.sleep(Native Method)
>>>>>>>>>>>>>>             at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        ----
>>>>>>>>>
>>>>>>>>>>        On 9/26/15 20:59, Amol Kekre wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        A good read -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://preshing.com/20111118/locks-arent-slow-lock-contention-is/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>        Though it does not explain order of magnitude
>>>>>>>>>>>> difference.
>>>>>>>>>>>>
>>>>>>>>>>>        Amol
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>        On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov<
>>>>>>>>>>>>>>> [email protected]> <mailto:[email protected]>
>>>>>>>>>>>>>>>        wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        In the benchmark test THREAD_LOCAL outperforms
>>>>>>>>>>>>>>> CONTAINER_LOCAL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> an order
>>>>>>>>>>>>>
>>>>>>>>>>>>        of magnitude and both operators compete for CPU. I'll
>>>>>>>>>>> take a
>>>>>>>>>>>
>>>>>>>>>>>> closer look
>>>>>>>>>>>>>>>>        why.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Thank you,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Vlad
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        On 9/26/15 14:52, Thomas Weise wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        THREAD_LOCAL - operators share thread
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        CONTAINER_LOCAL - each operator has its own thread
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        So as long as operators utilize the CPU sufficiently
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (compete),
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        latter
>>>>>>>>>>>
>>>>>>>>>>>>        will perform better.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        There will be cases where a single thread can
>>>>>>>>>>>>>>>>> accommodate
>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>        operators. For example, a socket reader (mostly
>>>>>>>>>>>>>>>>> waiting
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> IO)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        decompress (CPU hungry) can share a thread.
>>>>>>>>>>>
>>>>>>>>>>>>        But to get back to the original question, stream
>>>>>>>>>>>>>>>>> locality
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        not reduce the total memory requirement. If you add
>>>>>>>>>>>
>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>        into
>>>>>>>>>>>>>>>>>        one container, that container will also require more
>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that's
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        how
>>>>>>>>>>>
>>>>>>>>>>>>        the container size is calculated in the physical plan.
>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> get some
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        extra mileage when multiple operators share the same
>>>>>>>>>>> heap
>>>>>>>>>>>
>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> the need
>>>>>>>>>>>>>>>>>        to
>>>>>>>>>>>>>>>>>        identify the memory requirement per operator does
>>>>>>>>>>>>>>>>> not go
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> away.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Thomas
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath <
>>>>>>>>>>>>>>>>>        [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>>>        wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        Would CONTAINER_LOCAL achieve the same thing and
>>>>>>>>>>>>>>>>> perform a
>>>>>>>>>>>>>>>>> little better
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        on
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        a multi-core box ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>        Ram
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>        On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>        [email protected] <mailto:
>>>>>>>>>>>>>>>>>> [email protected]>>
>>>>>>>>>>>>>>>>>>        wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>        Yes, with this approach only two containers are
>>>>>>>>>>>>>>>>>> required:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> for stram
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        and
>>>>>>>>>>>
>>>>>>>>>>>> another for all operators. You can easily fit around 10
>>>>>>>>>>>>>>>>>> operators in
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>        less
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> than 1GB.
>>>>>>>>>>>>>>>>>>>        On 27 Sep 2015 00:32, "Timothy Farkas"<
>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        Hi Ram,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        You could make all the operators thread local.
>>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>> cuts
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        overhead of separate containers and maximizes the
>>>>>>>>>>> memory
>>>>>>>>>>>
>>>>>>>>>>>> available to
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        each
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        operator.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        Tim
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath <
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        [email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        I was running into memory issues when deploying my
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> app
>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> sandbox
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>        where all the operators were stuck forever in
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> PENDING
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        because
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        they were being continually aborted and restarted
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        limited
>>>>>>>>>>>
>>>>>>>>>>>> memory on the sandbox. After some experimentation, I
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        following config values seem to work:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ------------------------------------------
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> https://datatorrent.slack.com/archives/engineering/p1443263607000010
>>>>>>>>>
>>>>>>>>>        *<property> <name>dt.attr.MASTER_MEMORY_MB</name>
>>>>>>>>>
>>>>>>>>>> <value>500</value>
>>>>>>>>>>>
>>>>>>>>>>>> </property> <property>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <name>dt.application. .operator.*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *.attr.MEMORY_MB</name> <value>200</value>
>>>>>>>>>>>>>>>>>>>> </property>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        <property>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              <value>512</value> </property>*
>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>>        Are these reasonable values ? Is there a more
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> systematic
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> way of
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>        coming
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>        up
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>        with these values than trial-and-error ? Most
>>>>>>>>>>>>>>>>>>>> of my
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- with
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        the
>>>>>>>>>>>
>>>>>>>>>>>> exception of fileWordCount -- need very little
>>>>>>>>>>>>>>>>>>>> memory;
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> there a way
>>>>>>>>>>>>>>>>>>>>        to
>>>>>>>>>>>>>>>>>>>>        cut all values down to the bare minimum and
>>>>>>>>>>>>>>>>>>>> maximize
>>>>>>>>>>>>>>>>>>>> available memory
>>>>>>>>>>>>>>>>>>>>        for
>>>>>>>>>>>>>>>>>>>>        this one operator ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        Thanks.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        Ram
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>
>

Re: Thread and Container locality

Reply via email to