----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60376/#review178907 -----------------------------------------------------------
What is your main optimization objective? Reducing page load time or reducing steady observer CPU load? I have observed that when running many tasks per node (say ~30-100), it can happen that the metric collection threads essentially starve the UI from almost all CPU time (due to the Python GIL). In these cases, it would actually be better to just use fresh metrics all the time and eliminate the regular collection instead. This would result in slower UI rending but should yield more consistent latency. - Stephan Erb On June 22, 2017, 10:18 p.m., Reza Motamedi wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60376/ > ----------------------------------------------------------- > > (Updated June 22, 2017, 10:18 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Santhosh Kumar > Shanmugham. > > > Repository: aurora > > > Description > ------- > > # Observer task page to load consumption info from history > > Resource consumptions of Thermos Processes are periodically calculated by > TaskResourceMonitor threads (one thread per Thermos task). This information > is used to display a (semi) fresh state of the tasks running on a host in the > Observer host page, aka landing page. An aggregate history of the > consumptions is kept at the task level, although TaskResourceMonitor needs to > first collect the resource at the Process level and then aggregate them. > > On the other hand, when an Observer _task page_ is visited, the resources > consumption of Thermos Processes within that task are calculated again and > displayed without being aggregated. This can become very slow since time to > complete resource calculation is affected by the load on the host. > > By applying this patch we take advantage of the periodic work and fulfill > information resource requested in Observer task page from already collected > resource consumptions. > > > Diffs > ----- > > src/main/python/apache/thermos/monitoring/resource.py > 434666696e600a0e6c19edd986c86575539976f2 > src/test/python/apache/thermos/monitoring/test_resource.py > d794a998f1d9fc52ba260cd31ac444aee7f8ed28 > > > Diff: https://reviews.apache.org/r/60376/diff/1/ > > > Testing > ------- > > I stress tested this patch on a host that had a slow Observer page. > Interestingly, I did not need to do much to make the Observer slow. There are > a few points to be made clear first. > - We at Twitter limit the resources allocated to the Observer using > `systemd`. The observer is allowed to use only 20% of a CPU core. The > attached screen shots are from such a setup. > - Having assigned 20% of a cpu core to Observer, starting only 8 `task`s, > each with 3 `process`es is enough to make the Observer slow; 11secs to load > `task page`. > > > File Attachments > ---------------- > > without the patch -- Screen Shot 2017-06-22 at 1.11.12 PM.png > > https://reviews.apache.org/media/uploaded/files/2017/06/22/03968028-a2f5-4a99-ba57-b7a41c471436__without_the_patch_--_Screen_Shot_2017-06-22_at_1.11.12_PM.png > with the patch -- Screen Shot 2017-06-22 at 1.07.41 PM.png > > https://reviews.apache.org/media/uploaded/files/2017/06/22/5962c018-27d3-4463-a277-f6ad48b7f2d7__with_the_patch_--_Screen_Shot_2017-06-22_at_1.07.41_PM.png > > > Thanks, > > Reza Motamedi > >
