On Wed, 4 Dec 2013 11:39:29 AM Jeff Squyres wrote:

> On Dec 3, 2013, at 7:54 PM,
> Christopher Samuel <sam...@unimelb.edu.au> wrote:
>
> > Would it make any sense to expose system/environmental/thermal
> > information to the application via MPI_T ?
> 
> Hmm.  Interesting idea.

Phew. :-)

> Is the best way to grab such stuff via IPMI?

I don't think so, that means either having the process have permissions to 
access /dev/ipmi* or needing to talk over the network to the adapter, neither 
of which are likely to be desirable (or even possible, our iDataplex IMMs are 
not accessible from the compute nodes).

However, using the coretemp kernel module means you get access to at least 
information about CPU temperatures on Intel systems:

/sys/bus/platform/devices/coretemp.${A}/temp${B}_input

which contains the core temperature in 100ths of a degree Celsius and are 
world readable.  You also get access to the various thermal trip points and 
alarms.

The ${B} value is 1 for the CPU package (SandyBridge or later only), then 
sequentially for the physical cores.  ${A} is 0 for the first socket, then 
max($B of $A)+1 for the next socket, etc..

So on the test login node of our 2010 era Nehalem iDataplex you get a file per 
CPU core but nothing for the socket, viz:

[root@merri-test ~]# ls /sys/bus/platform/devices/coretemp.*/*input*
/sys/bus/platform/devices/coretemp.0/temp2_input
/sys/bus/platform/devices/coretemp.0/temp3_input
/sys/bus/platform/devices/coretemp.0/temp4_input
/sys/bus/platform/devices/coretemp.0/temp5_input
/sys/bus/platform/devices/coretemp.4/temp2_input
/sys/bus/platform/devices/coretemp.4/temp3_input
/sys/bus/platform/devices/coretemp.4/temp4_input
/sys/bus/platform/devices/coretemp.4/temp5_input

[root@merri-test ~]# cat /sys/bus/platform/devices/coretemp.*/*input*
52000
52000
52000
53000
59000
55000
58000
56000

On the test login node of our SandyBridge iDataplex delivered mid year we get 
the package as well:

[root@barcoo-test ~]# ls /sys/bus/platform/devices/coretemp.*/*input*
/sys/bus/platform/devices/coretemp.0/temp1_input
/sys/bus/platform/devices/coretemp.0/temp2_input
/sys/bus/platform/devices/coretemp.0/temp3_input
/sys/bus/platform/devices/coretemp.0/temp4_input
/sys/bus/platform/devices/coretemp.0/temp5_input
/sys/bus/platform/devices/coretemp.0/temp6_input
/sys/bus/platform/devices/coretemp.0/temp7_input
/sys/bus/platform/devices/coretemp.6/temp1_input
/sys/bus/platform/devices/coretemp.6/temp2_input
/sys/bus/platform/devices/coretemp.6/temp3_input
/sys/bus/platform/devices/coretemp.6/temp4_input
/sys/bus/platform/devices/coretemp.6/temp5_input
/sys/bus/platform/devices/coretemp.6/temp6_input
/sys/bus/platform/devices/coretemp.6/temp7_input

[root@barcoo-test ~]# cat /sys/bus/platform/devices/coretemp.*/*input*
44000
43000
44000
42000
43000
38000
44000
37000
33000
37000
32000
34000
36000
33000

There's more information in $KERNEL_SOURCE/Documentation/hwmon/coretemp.

Both those systems are running RHEL6, so it should be fairly well supported 
*if* the sysadmin has loaded the modules.

> That might well be do-able, since there's no performance penalty for reading
> such values until you actually read the values (i.e., we don't actively
> monitor these values in OMPI's overall progression engine; they're only
> read when the application invokes an MPI_T read function).

Indeed, these *shouldn't* hang trying to read them. ;-)

cheers,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to