> On June 26, 2017, 5:38 p.m., Stephan Erb wrote:
> > What is your main optimization objective? Reducing page load time or 
> > reducing steady observer CPU load? 
> > 
> > I have observed that when running many tasks per node (say ~30-100), it can 
> > happen that the metric collection threads essentially starve the UI from 
> > almost all CPU time (due to the Python GIL). In these cases, it would 
> > actually be better to just use fresh metrics all the time and eliminate the 
> > regular collection instead. This would result in slower UI rending but 
> > should yield more consistent latency.
> 
> Reza Motamedi wrote:
>     I observed the same problem as well. My objective was to reduce page load 
> time and what worked best was to reuse the collected resource consumption 
> data. This lets us keep all the information that we currently provide. 
>     
>     I did a more or less through profiling of what consumes the most CPU and 
> takes the longest and saw that looking up the children of a pid seems to be 
> very CPU intensive. Check the psutil implementation here: 
> [Process.children](https://pythonhosted.org/psutil/_modules/psutil.html#Process.children).
>  Constanly running this in the background does not seem to help :).
>     
>     I agree that the background thread that computes the resource consumption 
> of all processes isn't very useful, and perhaps it might be better to collect 
> all consumption data as users visit pages. However, We need to remember that 
> the thread is actually performing some collections that could easily become 
> slow to compute, for instance running DU on `n` sandboxes. Also, users can 
> easily flood the UI by constantly refreshing the page, and triggering 
> repeated work.
>     
>     An alternative solution would be to keep the disk collection inside an 
> always running thread and collect the CPU and mem as users visit the page. 
> This should only change what we do in showing the Thermos host (landing) page.

Although, I am not sure how that would perform in practice when the `du` is 
backlogged.


- Reza


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60376/#review178907
-----------------------------------------------------------


On June 22, 2017, 8:18 p.m., Reza Motamedi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60376/
> -----------------------------------------------------------
> 
> (Updated June 22, 2017, 8:18 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Santhosh Kumar 
> Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> # Observer task page to load consumption info from history
> 
> Resource consumptions of Thermos Processes are periodically calculated by 
> TaskResourceMonitor threads (one thread per Thermos task). This information 
> is used to display a (semi) fresh state of the tasks running on a host in the 
> Observer host page, aka landing page. An aggregate history of the 
> consumptions is kept at the task level, although TaskResourceMonitor needs to 
> first collect the resource at the Process level and then aggregate them.
> 
> On the other hand, when an Observer _task page_ is visited, the resources 
> consumption of Thermos Processes within that task are calculated again and 
> displayed without being aggregated. This can become very slow since time to 
> complete resource calculation is affected by the load on the host.
> 
> By applying this patch we take advantage of the periodic work and fulfill 
> information resource requested in Observer task page from already collected 
> resource consumptions.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/thermos/monitoring/resource.py 
> 434666696e600a0e6c19edd986c86575539976f2 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> d794a998f1d9fc52ba260cd31ac444aee7f8ed28 
> 
> 
> Diff: https://reviews.apache.org/r/60376/diff/1/
> 
> 
> Testing
> -------
> 
> I stress tested this patch on a host that had a slow Observer page. 
> Interestingly, I did not need to do much to make the Observer slow. There are 
> a few points to be made clear first.
> - We at Twitter limit the resources allocated to the Observer using 
> `systemd`. The observer is allowed to use only 20% of a CPU core. The 
> attached screen shots are from such a setup.
> - Having assigned 20% of a cpu core to Observer, starting only 8 `task`s, 
> each with 3 `process`es is enough to make the Observer slow; 11secs to load 
> `task page`.
> 
> 
> File Attachments
> ----------------
> 
> without the patch -- Screen Shot 2017-06-22 at 1.11.12 PM.png
>   
> https://reviews.apache.org/media/uploaded/files/2017/06/22/03968028-a2f5-4a99-ba57-b7a41c471436__without_the_patch_--_Screen_Shot_2017-06-22_at_1.11.12_PM.png
> with the patch -- Screen Shot 2017-06-22 at 1.07.41 PM.png
>   
> https://reviews.apache.org/media/uploaded/files/2017/06/22/5962c018-27d3-4463-a277-f6ad48b7f2d7__with_the_patch_--_Screen_Shot_2017-06-22_at_1.07.41_PM.png
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>

Reply via email to