Also sharing a diff https://github.com/DataTorrent/Netlet/compare/master...ilooner:condVarBuffer
Thanks, Tim On Mon, Sep 28, 2015 at 10:07 AM, Timothy Farkas <[email protected]> wrote: > Hi Vlad, > > Could you share your benchmarking applications? I'd like to test a change > I made to the Circular Buffer > > > https://github.com/ilooner/Netlet/blob/condVarBuffer/src/main/java/com/datatorrent/netlet/util/CircularBuffer.java > > Thanks, > Tim > > On Mon, Sep 28, 2015 at 9:56 AM, Pramod Immaneni <[email protected]> > wrote: > >> Vlad what was your mode of interaction/ordering between the two threads >> for >> the 3rd test. >> >> On Mon, Sep 28, 2015 at 10:51 AM, Vlad Rozov <[email protected]> >> wrote: >> >> > I created a simple test to check how quickly java can count to >> > Integer.MAX_INTEGER. The result that I see is consistent with >> > CONTAINER_LOCAL behavior: >> > >> > counting long in a single thread: 0.9 sec >> > counting volatile long in a single thread: 17.7 sec >> > counting volatile long shared between two threads: 186.3 sec >> > >> > I suggest that we look into >> > >> https://qconsf.com/sf2012/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf >> > or similar algorithm. >> > >> > Thank you, >> > >> > Vlad >> > >> > >> > >> > On 9/28/15 08:19, Vlad Rozov wrote: >> > >> >> Ram, >> >> >> >> The stream between operators in case of CONTAINER_LOCAL is >> InlineStream. >> >> InlineStream extends DefaultReservoir that extends CircularBuffer. >> >> CircularBuffer does not use synchronized methods or locks, it uses >> >> volatile. I guess that using volatile causes CPU cache invalidation and >> >> along with memory locality (in thread local case tuple is always local >> to >> >> both threads, while in container local case the second operator thread >> may >> >> see data significantly later after the first thread produced it) these >> two >> >> factors negatively impact CONTAINER_LOCAL performance. It is still >> quite >> >> surprising that the impact is so significant. >> >> >> >> Thank you, >> >> >> >> Vlad >> >> >> >> On 9/27/15 16:45, Munagala Ramanath wrote: >> >> >> >>> Vlad, >> >>> >> >>> That's a fascinating and counter-intuitive result. I wonder if some >> >>> internal synchronization is happening >> >>> (maybe the stream between them is a shared data structure that is lock >> >>> protected) to >> >>> slow down the 2 threads in the CONTAINER_LOCAL case. If they are both >> >>> going as fast as possible >> >>> it is likely that they will be frequently blocked by the lock. If that >> >>> is indeed the case, some sort of lock >> >>> striping or a near-lockless protocol for stream access should tilt the >> >>> balance in favor of CONTAINER_LOCAL. >> >>> >> >>> In the thread-local case of course there is no need for such locking. >> >>> >> >>> Ram >> >>> >> >>> On Sun, Sep 27, 2015 at 12:17 PM, Vlad Rozov <[email protected] >> >>> <mailto:[email protected]>> wrote: >> >>> >> >>> Changed subject to reflect shift of discussion. >> >>> >> >>> After I recompiled netlet and hardcoded 0 wait time in the >> >>> CircularBuffer.put() method, I still see the same difference even >> >>> when I increased operator memory to 10 GB and set "-D >> >>> dt.application.*.operator.*.attr.SPIN_MILLIS=0 -D >> >>> dt.application.*.operator.*.attr.QUEUE_CAPACITY=1024000". CPU % >> >>> is close to 100% both for thread and container local locality >> >>> settings. Note that in thread local two operators share 100% CPU, >> >>> while in container local each gets its own 100% load. It sounds >> >>> that container local will outperform thread local only when >> >>> number of emitted tuples is (relatively) low, for example when it >> >>> is CPU costly to produce tuples (hash computations, >> >>> compression/decompression, aggregations, filtering with complex >> >>> expressions). In cases where operator may emit 5 or more million >> >>> tuples per second, thread local may outperform container local >> >>> even when both operators are CPU intensive. >> >>> >> >>> >> >>> >> >>> >> >>> Thank you, >> >>> >> >>> Vlad >> >>> >> >>> On 9/26/15 22:52, Timothy Farkas wrote: >> >>> >> >>>> Hi Vlad, >> >>>> >> >>>> I just took a look at the CircularBuffer. Why are threads polling >> >>>> the state >> >>>> of the buffer before doing operations? Couldn't polling be >> avoided >> >>>> entirely >> >>>> by using something like Condition variables to signal when the >> >>>> buffer is >> >>>> ready for an operation to be performed? >> >>>> >> >>>> Tim >> >>>> >> >>>> On Sat, Sep 26, 2015 at 10:42 PM, Vlad Rozov< >> >>>> [email protected]> <mailto:[email protected]> >> >>>> wrote: >> >>>> >> >>>> After looking at few stack traces I think that in the benchmark >> >>>>> application operators compete for the circular buffer that >> passes >> >>>>> slices >> >>>>> from the emitter output to the consumer input and sleeps that >> >>>>> avoid busy >> >>>>> wait are too long for the benchmark operators. I don't see the >> >>>>> stack >> >>>>> similar to the one below all the time I take the threads dump, >> but >> >>>>> still >> >>>>> quite often to suspect that sleep is the root cause. I'll >> >>>>> recompile with >> >>>>> smaller sleep time and see how this will affect performance. >> >>>>> >> >>>>> ---- >> >>>>> "1/wordGenerator:RandomWordInputModule" prio=10 >> >>>>> tid=0x00007f78c8b8c000 >> >>>>> nid=0x780f waiting on condition [0x00007f78abb17000] >> >>>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >> >>>>> at java.lang.Thread.sleep(Native Method) >> >>>>> at >> >>>>> >> >>>>> >> com.datatorrent.netlet.util.CircularBuffer.put(CircularBuffer.java:182) >> >>>>> at >> >>>>> com.datatorrent.stram.stream.InlineStream.put(InlineStream.java:79) >> >>>>> at >> >>>>> com.datatorrent.stram.stream.MuxStream.put(MuxStream.java:117) >> >>>>> at >> >>>>> >> >>>>> >> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:48) >> >>>>> at >> >>>>> >> >>>>> >> com.datatorrent.benchmark.RandomWordInputModule.emitTuples(RandomWordInputModule.java:108) >> >>>>> at >> >>>>> com.datatorrent.stram.engine.InputNode.run(InputNode.java:115) >> >>>>> at >> >>>>> >> >>>>> >> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377) >> >>>>> >> >>>>> "2/counter:WordCountOperator" prio=10 tid=0x00007f78c8c98800 >> >>>>> nid=0x780d >> >>>>> waiting on condition [0x00007f78abc18000] >> >>>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >> >>>>> at java.lang.Thread.sleep(Native Method) >> >>>>> at >> >>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:519) >> >>>>> at >> >>>>> >> >>>>> >> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1377) >> >>>>> >> >>>>> ---- >> >>>>> >> >>>>> >> >>>>> On 9/26/15 20:59, Amol Kekre wrote: >> >>>>> >> >>>>> A good read - >> >>>>>> >> http://preshing.com/20111118/locks-arent-slow-lock-contention-is/ >> >>>>>> >> >>>>>> Though it does not explain order of magnitude difference. >> >>>>>> >> >>>>>> Amol >> >>>>>> >> >>>>>> >> >>>>>> On Sat, Sep 26, 2015 at 4:25 PM, Vlad Rozov< >> >>>>>> [email protected]> <mailto:[email protected]> >> >>>>>> wrote: >> >>>>>> >> >>>>>> In the benchmark test THREAD_LOCAL outperforms CONTAINER_LOCAL >> by >> >>>>>> an order >> >>>>>> >> >>>>>>> of magnitude and both operators compete for CPU. I'll take a >> >>>>>>> closer look >> >>>>>>> why. >> >>>>>>> >> >>>>>>> Thank you, >> >>>>>>> >> >>>>>>> Vlad >> >>>>>>> >> >>>>>>> >> >>>>>>> On 9/26/15 14:52, Thomas Weise wrote: >> >>>>>>> >> >>>>>>> THREAD_LOCAL - operators share thread >> >>>>>>> >> >>>>>>>> CONTAINER_LOCAL - each operator has its own thread >> >>>>>>>> >> >>>>>>>> So as long as operators utilize the CPU sufficiently >> (compete), >> >>>>>>>> the >> >>>>>>>> latter >> >>>>>>>> will perform better. >> >>>>>>>> >> >>>>>>>> There will be cases where a single thread can accommodate >> >>>>>>>> multiple >> >>>>>>>> operators. For example, a socket reader (mostly waiting for >> IO) >> >>>>>>>> and a >> >>>>>>>> decompress (CPU hungry) can share a thread. >> >>>>>>>> >> >>>>>>>> But to get back to the original question, stream locality >> does >> >>>>>>>> generally >> >>>>>>>> not reduce the total memory requirement. If you add multiple >> >>>>>>>> operators >> >>>>>>>> into >> >>>>>>>> one container, that container will also require more memory >> and >> >>>>>>>> that's >> >>>>>>>> how >> >>>>>>>> the container size is calculated in the physical plan. You >> may >> >>>>>>>> get some >> >>>>>>>> extra mileage when multiple operators share the same heap but >> >>>>>>>> the need >> >>>>>>>> to >> >>>>>>>> identify the memory requirement per operator does not go >> away. >> >>>>>>>> >> >>>>>>>> Thomas >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Sat, Sep 26, 2015 at 12:41 PM, Munagala Ramanath < >> >>>>>>>> [email protected] <mailto:[email protected]>> >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> Would CONTAINER_LOCAL achieve the same thing and perform a >> >>>>>>>> little better >> >>>>>>>> >> >>>>>>>> on >> >>>>>>>>> a multi-core box ? >> >>>>>>>>> >> >>>>>>>>> Ram >> >>>>>>>>> >> >>>>>>>>> On Sat, Sep 26, 2015 at 12:18 PM, Sandeep Deshmukh < >> >>>>>>>>> [email protected] <mailto:[email protected]>> >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>> Yes, with this approach only two containers are required: >> one >> >>>>>>>>> for stram >> >>>>>>>>> and >> >>>>>>>>> >> >>>>>>>>> another for all operators. You can easily fit around 10 >> >>>>>>>>> operators in >> >>>>>>>>> >> >>>>>>>>>> less >> >>>>>>>>>> than 1GB. >> >>>>>>>>>> On 27 Sep 2015 00:32, "Timothy Farkas"<[email protected] >> > >> >>>>>>>>>> <mailto:[email protected]> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi Ram, >> >>>>>>>>>> >> >>>>>>>>>> You could make all the operators thread local. This cuts >> down >> >>>>>>>>>>> on the >> >>>>>>>>>>> overhead of separate containers and maximizes the memory >> >>>>>>>>>>> available to >> >>>>>>>>>>> >> >>>>>>>>>>> each >> >>>>>>>>>>> >> >>>>>>>>>> operator. >> >>>>>>>>>> >> >>>>>>>>>>> Tim >> >>>>>>>>>>> >> >>>>>>>>>>> On Sat, Sep 26, 2015 at 10:07 AM, Munagala Ramanath < >> >>>>>>>>>>> >> >>>>>>>>>>> [email protected] <mailto:[email protected]> >> >>>>>>>>>>> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi, >> >>>>>>>>>>> >> >>>>>>>>>>> I was running into memory issues when deploying my app on >> >>>>>>>>>>>> the >> >>>>>>>>>>>> >> >>>>>>>>>>>> sandbox >> >>>>>>>>>>>> >> >>>>>>>>>>> where all the operators were stuck forever in the PENDING >> >>>>>>>>>> state >> >>>>>>>>>> >> >>>>>>>>>> because >> >>>>>>>>>>> >> >>>>>>>>>>> they were being continually aborted and restarted because >> of >> >>>>>>>>>> the >> >>>>>>>>>> >> >>>>>>>>>> limited >> >>>>>>>>>>> memory on the sandbox. After some experimentation, I found >> >>>>>>>>>>> that the >> >>>>>>>>>>> >> >>>>>>>>>>> following config values seem to work: >> >>>>>>>>>>>> ------------------------------------------ >> >>>>>>>>>>>> < >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> https://datatorrent.slack.com/archives/engineering/p1443263607000010 >> >>>>>>>>>>>> >> >>>>>>>>>>>> *<property> <name>dt.attr.MASTER_MEMORY_MB</name> >> >>>>>>>>>>>> >> >>>>>>>>>>>> <value>500</value> >> >>>>>>>>>>>> >> >>>>>>>>>>> </property> <property> >> >>>>>>>>>>> <name>dt.application..operator.* >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> *.attr.MEMORY_MB</name> <value>200</value> >> </property> >> >>>>>>>>>>>> >> >>>>>>>>>>>> <property> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> <name>dt.application.TopNWordsWithQueries.operator.fileWordCount.attr.MEMORY_MB</name> >> >>>>>>>>> >> >>>>>>>>> <value>512</value> </property>* >> >>>>>>>>> >> >>>>>>>>>> ------------------------------------------------ >> >>>>>>>>>>> >> >>>>>>>>>>>> Are these reasonable values ? Is there a more systematic >> >>>>>>>>>>>> way of >> >>>>>>>>>>>> >> >>>>>>>>>>>> coming >> >>>>>>>>>>>> >> >>>>>>>>>>> up >> >>>>>>>>>> >> >>>>>>>>>> with these values than trial-and-error ? Most of my >> operators >> >>>>>>>>>> -- with >> >>>>>>>>>> >> >>>>>>>>>>> the >> >>>>>>>>>>> exception of fileWordCount -- need very little memory; is >> >>>>>>>>>>> there a way >> >>>>>>>>>>> to >> >>>>>>>>>>> cut all values down to the bare minimum and maximize >> >>>>>>>>>>> available memory >> >>>>>>>>>>> for >> >>>>>>>>>>> this one operator ? >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Ram >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>> >> >>> >> >> >> > >> > >
