Re: [Ganglia plugin] Next steps

Chris Mattmann Fri, 28 Jun 2013 10:41:14 -0700

Hey Rajith!

Great breaking down of the problem. My thoughts/comments are below:



-----Original Message-----

From: Rajith Siriwardana <rajithsiriward...@gmail.com>
Date: Thursday, June 27, 2013 10:27 AM
To: OODT dev <dev@oodt.apache.org>, jpluser
<chris.a.mattm...@jpl.nasa.gov>, jpluser <chris.mattm...@gmail.com>
Subject: Re: [Ganglia plugin] Next steps

>Hi all,
>Attached/linked diagram [1] shows how the GangliaResourceMonitorFactory
>will be integrated to AssignmentMonitor to calculate load.
>In here in AssignmentMonitor it keeps the node's load in a static hashmap
>(<nodeId, load>) so I guess the loadMap should be updated in a timely
>manner (ex: 1 min interval) by parsing the ganglia XML right?

Bingo, correct.

> 
>
>Since the load we need is not a traditional value and it's a value which
>says how many of these jobs can fit on a machine. So as I understood, the
>load calculation should happen a way that, which the most relevant
>metrics are taken into calculation and weights should be added to the
>values. then the load value should normalize within the range of 0 and 1.

Yep, or more likely, I would make the load normalize into a value between
0 and node.getCapacity(),
where that value is read from the nodes.xml file.

> 
>I guess following metrics are the most relevant ones with the default
>Ganglia metrics for the calculation.
>
>load_one = one minute load average
>load_five = five minutes load average
>load_fifteen = fifteen minutes load average
>
>mem_free = amount of available memory
>swap_free = amount of available swap memory

+1

>
>Followings are the models currently have in mind.
>(I). weight the 1 min, 5 min and 15 min load numbers and normalize the
>value.

+1

>(II). adding the mem_free and swap_free metrics to the calculation with
>model I.

+1

>
>
>more weight should goes to either 5 or 15. according to [3].
>#1. but how can I rationalize the weights i give?

Use node.getCapacity() and allow the user to provide that rationalization,
e.g., 
allow them to easily tinker (via configuration) the different weights on
the metrics,
while at the same time ensuring those weights when multiplied together
with the metrics
values to remain between 0 and node.getCapacity()

> 
>#2. furthermore what is the capacity of a Node? since we are talking
>about normalization what is the role of this capacity? how it affects
>this calculation. (when assigning load to a particular node it calculate
>something like "if (loadValue <= (loadCap - curLoad))" inhere loadCap =
>node.getCapacity() and curLoad=loadMap.get(node.getNodeId())).intValue() )

Allow the user to set capacity() in nodes.xml, and then read it from there
(as a start).

> 
>
>Other considerations
>#3. what should be the value if the node is offline?

Capacity should probably be set to 0 at that point. IOW, if it's offline,
ignore the user's pre-profiled
capacity, and then say it can't hold any jobs.

>We can say a particular Node is offline by TN and TMAX value. gmetad, a
>host is considered offline and is ignored if TN > 4 * TMAX.[2]
>(TN :  TN value is the number of seconds since the metric was last
>updated TMAX: The maximum time in seconds between gmetric calls)

+1

Great work! Please proceed.

Cheers,
Chris

>
>
>default  ganglia metrics is listed here and your thoughts are welcome.
>disk_free = Disk Space Available
>machine_type = System architecture
>bytes_out = Number of bytes out per second
>gexec = DESC VAL = gexec available
>proc_total = Total number of processes
>cpu_nice = Percentage of CPU utilization that occurred while executing at
>the user level with nice priority
>pkts_in = Packets in per second
>cpu_speed = CPU Speed in terms of MHz
>boottime = The last time that the system was started
>cpu_wio = Percentage of time that the CPU or CPUs were idle during which
>the system had an outstanding disk I/O request
>os_name = Operating system name
>load_one = One minute load average
>os_release = Operating system release date
>disk_total = Total available disk space
>cpu_user = Percentage of CPU utilization that occurred while executing at
>the user level
>cpu_idle = Percentage of time that the CPU or CPUs were idle and the
>system did not have an outstanding disk I/O request
>swap_free = Amount of available swap memory
>mem_cached = Amount of cached memory
>pkts_out = Packets out per second
>load_five = Five minute load average
>cpu_num = Total number of CPUs
>load_fifteen  = Fifteen minute load average
>mem_free = Amount of available memory
>cpu_system = Percentage of CPU utilization that occurred while executing
>at the system level
>proc_run = Total number of running processes
>mem_total = Total amount of memory displayed in KBs
>cpu_aidle = Percent of time since boot idle CPU
>bytes_in  = Number of bytes in per second
>mem_buffers  = Amount of buffered memory
>mem_shared = Amount of shared memory
>swap_total = Total amount of swap space displayed in KBs
>part_max_used = Maximum percent used for all partitions
>
>
>[1] 
>https://issues.apache.org/jira/secure/attachment/12589911/diagram1.png
>[2] http://entropy.gforge.inria.fr/ganglia.html
>[3] 
>http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
>
>
>Cheers,
>Rajith



------------------------
Chris Mattmann
chris.mattm...@gmail.com




>
>
>
>On Fri, Jun 21, 2013 at 7:22 PM, Rajith Siriwardana
><rajithsiriward...@gmail.com> wrote:
>
>
>moving the conversation to dev.
>
>
>Cheers,
>Rajith
>
>On Thu, Jun 20, 2013 at 11:10 AM, Chris Mattmann
><chris.mattm...@gmail.com> wrote:
>
>
>
>Hi Rajith,
>
>RE: #1 yep that's the next step.
>
>RE: #2, I would create a pluggable function/class that allows
>different "Besting" algorithms to be plugged in. One simple one
>would be AverageLoad (avg between the 3 load values). Another
>simple would be FiveMinuteLoad; another OneMinLoad; etc. I would
>also imagine allowing ArbitraryMetricWeightedAvgLoad where it takes
>in maybe a List<String> specifying the metric names, and then also
>maybe a HashMap<String, Double> that identifies the metric name,
>and then the weight to apply in the weighted average, e.g., maybe
>{{"1minload", "3.0"}, {"5minload", "10.0"}, {"15minload", "1.0"}}
>
>indicating that the final load should be calculated as:
>
>3*[val of 1minLoad] + 10*[val of 5minLoad] + 1*[val of 15minLoad]
>-----------------------------------------------------------------
>                          3
>Or something like the above
>
>for #3 (use casting and maybe Math.max)?
>
>for #4, see above.
>
>Also this should all probably go on dev@oodt.apache.org so can
>you move the conversation there?
>
>Cheers,
>Chris
>
>------------------------
>Chris Mattmann
>chris.mattm...@gmail.com
>
>
>
>
>-----Original Message-----
>From: Rajith Siriwardana <rajithsiriward...@gmail.com>
>Date: Wednesday, June 19, 2013 11:32 AM
>To: jpluser <chris.a.mattm...@jpl.nasa.gov>, jpluser
><chris.mattm...@gmail.com>
>Subject: [Ganglia plugin] Next steps
>
>>Hi Chris,
>>My next steps would be
>>
>>Adding the capability of creating a GangliaAssignmentMonitor from the
>>GangliaAssignmentMonitorFactory to AssignmentMonitor.
>>
>>in that case I have few questions,
>>
>>1. GangliaAssignmentMonitor should get the XML downloaded and parsed when
>>the AssignmentMonitor requests about the nodes current load right? and
>>this should update the loadMap in AssignmentMonitor?
>>
>>2. About the current load, what it should be 15 mins ? 5 mins ? 1 min ?
>>or should it be an average load. (since the requirement is the current
>>load, i guess this should be a weighted average of these three load
>>values)
>>
>>3. Ganglia provides the load values as percentage values.  loadMap uses
>>Integer, how the mapping should happen?
>>
>>4. I couldn't find anywhere which require any metric other than the load
>>of a resource node.
>>
>>
>>
>>Thank you,
>>Rajith
>>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: [Ganglia plugin] Next steps

Reply via email to