With Mpmc I get upto 20M for container local. Lets sync up for doing across nodes.
Thanks On Wed, Sep 30, 2015 at 8:31 AM, Vlad Rozov <[email protected]> wrote: > Blog and presentation on algorithms behind JCTools: > > http://psy-lob-saw.blogspot.com/p/lock-free-queues.html > https://vimeo.com/100197431 > > Thank you, > > Vlad > > > On 9/29/15 21:14, Vlad Rozov wrote: > >> I guess yes, it should show improvement every time there is >> consumer/producer contention on a resource from two different threads, so >> we should see improvements in the buffer server as well. The current >> prototype does not support containers on different nodes. >> >> Thank you, >> >> Vlad >> >> On 9/29/15 20:47, Pramod Immaneni wrote: >> >>> Would it show any improvement in the case where the containers are on >>> different nodes. >>> >>> On Tue, Sep 29, 2015 at 7:17 PM, Vlad Rozov <[email protected]> >>> wrote: >>> >>> By changing QUEUE_CAPACITY to 1200000 I can get around 62 mil tuples for >>>> the case when wordGenerator emits the same tuple and 34 mil when it >>>> generates new tuples each time. >>>> >>>> Thank you, >>>> >>>> Vlad >>>> >>>> >>>> On 9/29/15 17:08, Vlad Rozov wrote: >>>> >>>> 3 mil for container local and 55 mil for thread local. >>>>> >>>>> Thank you, >>>>> >>>>> Vlad >>>>> >>>>> >>>>> >>>>> On 9/29/15 16:57, Chetan Narsude wrote: >>>>> >>>>> Vlad, what was the number without this fix? >>>>>> >>>>>> -- >>>>>> Chetan >>>>>> >>>>>> On Tue, Sep 29, 2015 at 4:48 PM, Vlad Rozov <[email protected]> >>>>>> wrote: >>>>>> >>>>>> I did a quick prototype that uses http://jctools.github.io/JCTools >>>>>> SPSC >>>>>> >>>>>>> bounded queue instead of CircularBuffer. For container local I now >>>>>>> see >>>>>>> 13 >>>>>>> mil tuples per second. >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Vlad <http://jctools.github.io/JCTools> >>>>>>> >>>>>>> >>>>>>> On 9/28/15 12:58, Chetan Narsude wrote: >>>>>>> >>>>>>> Let me shed some light on THREAD_LOCAL and CONTAINER_LOCAL. >>>>>>> >>>>>>>> THREAD_LOCAL at the core is nothing but a function call. When an >>>>>>>> operator >>>>>>>> does emit(tuple), it gets translated in downstream ports >>>>>>>> "process(tuple)" >>>>>>>> call which immediately gets invoked in the same thread. So obviously >>>>>>>> the >>>>>>>> performance is going to be a lot faster. The only thing that's >>>>>>>> happening >>>>>>>> in >>>>>>>> between is setting up the stack and invoking the function. >>>>>>>> >>>>>>>> With CONTAINER_LOCAL - there is a producer thread and a consumer >>>>>>>> thread >>>>>>>> involved. Producer produces (emit(tuple)) and consumer >>>>>>>> consumes(process(tuple)). This scheme is the most optimal when the >>>>>>>> rate at >>>>>>>> which producer produces is equal to the rate at which consumer >>>>>>>> consumes. >>>>>>>> Often that's not the case - so we have a bounded memory buffer in >>>>>>>> between >>>>>>>> (the implementation is CircularBuffer). Now in addition to the >>>>>>>> things >>>>>>>> that >>>>>>>> THREAD_LOCAL does, CONTAINER_LOCAL pattern requires managing the >>>>>>>> circular >>>>>>>> buffer *and* thread context switch. The most expensive of the thread >>>>>>>> context switch is the memory synchronization. As you all have >>>>>>>> pointed >>>>>>>> out >>>>>>>> how expensive it is to use volatile, I need not get into details of >>>>>>>> how >>>>>>>> expensive memory synchronization can get. >>>>>>>> >>>>>>>> Long story short - no matter which pattern you use, when you use >>>>>>>> more >>>>>>>> than >>>>>>>> 1 thread there are certain memory synchronization penalties which >>>>>>>> are >>>>>>>> unavoidable and slow the things down considerably. In 2012, I had >>>>>>>> benchmarked atomic, volatile, synchronized and for the benchmark (I >>>>>>>> think >>>>>>>> there are unit tests for it), I found volatile to be least >>>>>>>> expensive at >>>>>>>> that time. Synchronized was not too much behind (it's very efficient >>>>>>>> when >>>>>>>> the contention is likely to be amongst a single digit number of >>>>>>>> threads). >>>>>>>> Not sure how those benchmark will look today but you get the idea. >>>>>>>> >>>>>>>> In a data intensive app, most of the time is spent in IO and there >>>>>>>> is a >>>>>>>> lot >>>>>>>> of CPU idling at individual operator so you will not see the >>>>>>>> difference >>>>>>>> when you change CONTAINER_LOCAL to THREAD_LOCAL yet you will see >>>>>>>> some >>>>>>>> memory optimization as you are taking away intermediate memory based >>>>>>>> buffer >>>>>>>> *and* delayed garbage collection of the objects held by this buffer. >>>>>>>> >>>>>>>> Recommendation: Do not bother with these micro optimizations unless >>>>>>>> you >>>>>>>> notice a problem. Use THREAD_LOCAL for processing >>>>>>>> low-throughput/infrequent >>>>>>>> streams. Use CONTAINER_LOCAL to avoid serialization/deserialization >>>>>>>> of >>>>>>>> objects. Leave the rest to the platform. I expect that as it >>>>>>>> matures it >>>>>>>> will make most of these decisions automatically. >>>>>>>> >>>>>>>> HTH. >>>>>>>> >>>>>>>> -- >>>>>>>> Chetan >>>>>>>> >>>>>>>> On Mon, Sep 28, 2015 at 11:44 AM, Vlad Rozov < >>>>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Tim, >>>>>>>> >>>>>>>> I use benchmark application that is part of Apache Malhar project. >>>>>>>>> Please >>>>>>>>> let me know if you need help with compiling or running the >>>>>>>>> application. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> >>>>>>>>> Vlad >>>>>>>>> >>>>>>>>> >>>>>>>>> On 9/28/15 11:09, Timothy Farkas wrote: >>>>>>>>> >>>>>>>>> Also sharing a diff >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://github.com/DataTorrent/Netlet/compare/master...ilooner:condVarBuffer >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Tim >>>>>>>>>> >>>>>>>>>> On Mon, Sep 28, 2015 at 10:07 AM, Timothy Farkas < >>>>>>>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Vlad, >>>>>>>>>> >>>>>>>>>> Could you share your benchmarking applications? I'd like to test a >>>>>>>>>> >>>>>>>>>>> change >>>>>>>>>>> I made to the Circular Buffer >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Tim >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni < >>>>>>>>>>> [email protected] >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Vlad what was your mode of interaction/ordering between the two >>>>>>>>>>> threads >>>>>>>>>>> >>>>>>>>>>> for >>>>>>>>>>> >>>>>>>>>>>> the 3rd test. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov < >>>>>>>>>>>> [email protected] >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I created a simple test to check how quickly java can count to >>>>>>>>>>>> >>>>>>>>>>>> Integer.MAX_INTEGER. The result that I see is consistent with >>>>>>>>>>>> >>>>>>>>>>>>> CONTAINER_LOCAL behavior: >>>>>>>>>>>>> >>>>>>>>>>>>> counting long in a single thread: 0.9 sec >>>>>>>>>>>>> counting volatile long in a single thread: 17.7 sec >>>>>>>>>>>>> counting volatile long shared between two threads: 186.3 sec >>>>>>>>>>>>> >>>>>>>>>>>>> I suggest that we look into >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf >>>>>>>>>>>> >>>>>>>>>>>> or similar algorithm. >>>>>>>>>>>> >>>>>>>>>>>> Thank you, >>>>>>>>>>>>> >>>>>>>>>>>>> Vlad >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 9/28/15 08:19, Vlad Rozov wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Ram, >>>>>>>>>>>>> >>>>>>>>>>>>> The stream between operators in case of CONTAINER_LOCAL is >>>>>>>>>>>>> >>>>>>>>>>>>>> InlineStream. >>>>>>>>>>>>>> >>>>>>>>>>>>>> InlineStream extends DefaultReservoir that extends >>>>>>>>>>>>> CircularBuffer. >>>>>>>>>>>>> >>>>>>>>>>>>> CircularBuffer does not use synchronized methods or locks, it >>>>>>>>>>>>> uses >>>>>>>>>>>>> >>>>>>>>>>>>>> volatile. I guess that using volatile causes CPU cache >>>>>>>>>>>>>> invalidation >>>>>>>>>>>>>> and >>>>>>>>>>>>>> along with memory locality (in thread local case tuple is >>>>>>>>>>>>>> always >>>>>>>>>>>>>> local >>>>>>>>>>>>>> >>>>>>>>>>>>>> to >>>>>>>>>>>>>> >>>>>>>>>>>>>> both threads, while in container local case the second >>>>>>>>>>>>> operator >>>>>>>>>>>>> thread >>>>>>>>>>>>> may >>>>>>>>>>>>> see data significantly later after the first thread produced >>>>>>>>>>>>> it) >>>>>>>>>>>>> these >>>>>>>>>>>>> two >>>>>>>>>>>>> factors negatively impact CONTAINER_LOCAL performance. It is >>>>>>>>>>>>> still >>>>>>>>>>>>> quite >>>>>>>>>>>>> surprising that the impact is so significant. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you, >>>>>>>>>>>>> >>>>>>>>>>>>>> Vlad >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 9/27/15 16:45, Munagala Ramanath wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Vlad, >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's a fascinating and counter-intuitive result. I wonder if >>>>>>>>>>>>>> >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>> internal synchronization is happening >>>>>>>>>>>>>>> (maybe the stream between them is a shared data structure >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> lock >>>>>>>>>>>>>>> protected) to >>>>>>>>>>>>>>> slow down the 2 threads in the CONTAINER_LOCAL case. If they >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> both >>>>>>>>>>>>>>> going as fast as possible >>>>>>>>>>>>>>> it is likely that they will be frequently blocked by the >>>>>>>>>>>>>>> lock. >>>>>>>>>>>>>>> If >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> is indeed the case, some sort of lock >>>>>>>>>>>>>>> striping or a near-lockless protocol for stream access should >>>>>>>>>>>>>>> tilt >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> balance in favor of CONTAINER_LOCAL. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In the thread-local case of course there is no need for such >>>>>>>>>>>>>>> locking. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ram >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov < >>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Changed subject to reflect shift of discussion. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After I recompiled netlet and hardcoded 0 wait time >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> CircularBuffer.put() method, I still see the same >>>>>>>>>>>>>>> difference >>>>>>>>>>>>>>> even >>>>>>>>>>>>>>> when I increased operator memory to 10 GB and set "-D >>>>>>>>>>>>>>> dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D >>>>>>>>>>>>>>> dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". >>>>>>>>>>>>>>> CPU % >>>>>>>>>>>>>>> is close to 100% both for thread and container local >>>>>>>>>>>>>>> locality >>>>>>>>>>>>>>> settings. Note that in thread local two operators >>>>>>>>>>>>>>> share >>>>>>>>>>>>>>> 100% >>>>>>>>>>>>>>> CPU, >>>>>>>>>>>>>>> while in container local each gets its own 100% >>>>>>>>>>>>>>> load. It >>>>>>>>>>>>>>> sounds >>>>>>>>>>>>>>> that container local will outperform thread local >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>> number of emitted tuples is (relatively) low, for >>>>>>>>>>>>>>> example >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>> it >>>>>>>>>>>>>>> is CPU costly to produce tuples (hash computations, >>>>>>>>>>>>>>> compression/decompression, aggregations, filtering >>>>>>>>>>>>>>> with >>>>>>>>>>>>>>> complex >>>>>>>>>>>>>>> expressions). In cases where operator may emit 5 or >>>>>>>>>>>>>>> more >>>>>>>>>>>>>>> million >>>>>>>>>>>>>>> tuples per second, thread local may outperform >>>>>>>>>>>>>>> container >>>>>>>>>>>>>>> local >>>>>>>>>>>>>>> even when both operators are CPU intensive. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Vlad >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 9/26/15 22:52, Timothy Farkas wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Vlad, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I just took a look at the CircularBuffer. Why are >>>>>>>>>>>>>>> threads >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> polling >>>>>>>>>>>>>>>> the state >>>>>>>>>>>>>>>> of the buffer before doing operations? Couldn't >>>>>>>>>>>>>>>> polling >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> avoided >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> entirely >>>>>>>>>>>>>>> >>>>>>>>>>>>>> by using something like Condition variables to signal >>>>>>>>>>>>>> when the >>>>>>>>>>>>>> >>>>>>>>>>>>>> buffer is >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ready for an operation to be performed? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Tim >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov< >>>>>>>>>>>>>>>> [email protected]> <mailto:[email protected]> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> After looking at few stack traces I think that in >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> benchmark >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> application operators compete for the circular >>>>>>>>>>>>>>>> buffer >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> passes >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> slices >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> from the emitter output to the consumer input and >>>>>>>>>>>>>> sleeps >>>>>>>>>>>>>> that >>>>>>>>>>>>>> >>>>>>>>>>>>>> avoid busy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> wait are too long for the benchmark operators. I >>>>>>>>>>>>>>>>> don't >>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> stack >>>>>>>>>>>>>>>>> similar to the one below all the time I take the >>>>>>>>>>>>>>>>> threads >>>>>>>>>>>>>>>>> dump, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> quite often to suspect that sleep is the root cause. >>>>>>>>>>>>>> I'll >>>>>>>>>>>>>> >>>>>>>>>>>>>> recompile with >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> smaller sleep time and see how this will affect >>>>>>>>>>>>>>>>> performance. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ---- >>>>>>>>>>>>>>>>> "1/wordGenerator:RandomWordInputModule" prio=10 >>>>>>>>>>>>>>>>> tid=0x00007f78c8b8c000 >>>>>>>>>>>>>>>>> nid=0x780f waiting on condition >>>>>>>>>>>>>>>>> [0x00007f78abb17000] >>>>>>>>>>>>>>>>> java.lang.Thread.State: TIMED_WAITING >>>>>>>>>>>>>>>>> (sleeping) >>>>>>>>>>>>>>>>> at java.lang.Thread.sleep(Native Method) >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> at >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115) >>>>>>>>>>>>> >>>>>>>>>>>>>> at >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> "2/counter:WordCountOperator" prio=10 >>>>>>>>>>>> tid=0x00007f78c8c98800 >>>>>>>>>>>> >>>>>>>>>>>> nid=0x780d >>>>>>>>>>>>> >>>>>>>>>>>>>> waiting on condition [0x00007f78abc18000] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>>>>>>>>>>>>>>> at java.lang.Thread.sleep(Native Method) >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> ---- >>>>>>>>>>>> >>>>>>>>>>>> On 9/26/15 20:59, Amol Kekre wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> A good read - >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> http://preshing.com/20111118/locks-arent-slow-lock-contention-is/ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Though it does not explain order of magnitude >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> difference. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Amol >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov< >>>>>>>>>>>>>>>>>> [email protected]> <mailto:[email protected]> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In the benchmark test THREAD_LOCAL outperforms >>>>>>>>>>>>>>>>>> CONTAINER_LOCAL >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> an order >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> of magnitude and both operators compete for CPU. >>>>>>>>>>>>>>> I'll >>>>>>>>>>>>>>> >>>>>>>>>>>>>> take a >>>>>>>>>>>>>> >>>>>>>>>>>>>> closer look >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> why. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Vlad >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 9/26/15 14:52, Thomas Weise wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> THREAD_LOCAL - operators share thread >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> CONTAINER_LOCAL - each operator has its own >>>>>>>>>>>>>>>>>>> thread >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> So as long as operators utilize the CPU >>>>>>>>>>>>>>>>>>> sufficiently >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (compete), >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> will perform better. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There will be cases where a single thread can >>>>>>>>>>>>>>>>>>>> accommodate >>>>>>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>>>>>>> operators. For example, a socket reader (mostly >>>>>>>>>>>>>>>>>>>> waiting >>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> IO) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> and a >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> decompress (CPU hungry) can share a thread. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But to get back to the original question, stream >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> locality >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> does >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> generally >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> not reduce the total memory requirement. If you >>>>>>>>>>>>>>>>> add >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> operators >>>>>>>>>>>>>>>>>>>> into >>>>>>>>>>>>>>>>>>>> one container, that container will also require >>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>> memory >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> that's >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> how >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> the container size is calculated in the physical >>>>>>>>>>>>>>> plan. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> may >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> get some >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> extra mileage when multiple operators share the >>>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> heap >>>>>>>>>>>>>> >>>>>>>>>>>>>> but >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> the need >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> identify the memory requirement per operator >>>>>>>>>>>>>>>>>>>> does >>>>>>>>>>>>>>>>>>>> not go >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> away. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thomas >>>>>>>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 12:41 PM, Munagala >>>>>>>>>>>>>>>>>>> Ramanath < >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Would CONTAINER_LOCAL achieve the same thing and >>>>>>>>>>>>>>>>>>>> perform a >>>>>>>>>>>>>>>>>>>> little better >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> a multi-core box ? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Ram >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 12:18 PM, Sandeep >>>>>>>>>>>>>>>>>>>>> Deshmukh >>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>> [email protected] <mailto: >>>>>>>>>>>>>>>>>>>>> [email protected]>> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Yes, with this approach only two containers are >>>>>>>>>>>>>>>>>>>>> required: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> for stram >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> another for all operators. You can easily fit around 10 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> operators in >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> less >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> than 1GB. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On 27 Sep 2015 00:32, "Timothy Farkas"< >>>>>>>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>>>>>> <mailto:[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Ram, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> You could make all the operators thread local. >>>>>>>>>>>>>>>>>>>>>> This >>>>>>>>>>>>>>>>>>>>>> cuts >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> down >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> on the >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> overhead of separate containers and maximizes >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> memory >>>>>>>>>>>>>> >>>>>>>>>>>>>> available to >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> each >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> operator. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Tim >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath < >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I was running into memory issues when >>>>>>>>>>>>>>>>>>>>>> deploying my >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> app >>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> sandbox >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> where all the operators were stuck forever >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> PENDING >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> because >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> they were being continually aborted and >>>>>>>>>>>>>>>>>>>>>> restarted >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> because >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> limited >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> memory on the sandbox. After some experimentation, I >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> found >>>>>>>>>>>>>>>>>>>>>>> that the >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> following config values seem to work: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ------------------------------------------ >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> https://datatorrent.slack.com/archives/engineering/p1443263607000010 >>>>>>>>>>>> >>>>>>>>>>>> *<property> <name>dt.attr.MASTER_MEMORY_MB</name> >>>>>>>>>>>> >>>>>>>>>>>> <value>500</value> >>>>>>>>>>>>> >>>>>>>>>>>>>> </property> <property> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <name>dt.application. .operator.* >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> * .attr.MEMORY_MB</name> <value>200</value> >>>>>>>>>>>>>>>>>>>>>>> </property> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> <property> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> <value>512</value> </property>* >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------ >>>>>>>>>>>>> >>>>>>>>>>>>>> Are these reasonable values ? Is there a more >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> systematic >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> way of >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> coming >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> up >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> with these values than trial-and-error ? >>>>>>>>>>>>>>>>>>>>>>>> Most >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> of my >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> operators >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- with >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> exception of fileWordCount -- need very little >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> memory; >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> there a way >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> cut all values down to the bare minimum and >>>>>>>>>>>>>>>>>>>>>>> maximize >>>>>>>>>>>>>>>>>>>>>>> available memory >>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> this one operator ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Ram >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >> >
