On Wed, 4 Dec 2013 11:39:29 AM Jeff Squyres wrote: > On Dec 3, 2013, at 7:54 PM, > Christopher Samuel <sam...@unimelb.edu.au> wrote: > > > Would it make any sense to expose system/environmental/thermal > > information to the application via MPI_T ? > > Hmm. Interesting idea.
Phew. :-) > Is the best way to grab such stuff via IPMI? I don't think so, that means either having the process have permissions to access /dev/ipmi* or needing to talk over the network to the adapter, neither of which are likely to be desirable (or even possible, our iDataplex IMMs are not accessible from the compute nodes). However, using the coretemp kernel module means you get access to at least information about CPU temperatures on Intel systems: /sys/bus/platform/devices/coretemp.${A}/temp${B}_input which contains the core temperature in 100ths of a degree Celsius and are world readable. You also get access to the various thermal trip points and alarms. The ${B} value is 1 for the CPU package (SandyBridge or later only), then sequentially for the physical cores. ${A} is 0 for the first socket, then max($B of $A)+1 for the next socket, etc.. So on the test login node of our 2010 era Nehalem iDataplex you get a file per CPU core but nothing for the socket, viz: [root@merri-test ~]# ls /sys/bus/platform/devices/coretemp.*/*input* /sys/bus/platform/devices/coretemp.0/temp2_input /sys/bus/platform/devices/coretemp.0/temp3_input /sys/bus/platform/devices/coretemp.0/temp4_input /sys/bus/platform/devices/coretemp.0/temp5_input /sys/bus/platform/devices/coretemp.4/temp2_input /sys/bus/platform/devices/coretemp.4/temp3_input /sys/bus/platform/devices/coretemp.4/temp4_input /sys/bus/platform/devices/coretemp.4/temp5_input [root@merri-test ~]# cat /sys/bus/platform/devices/coretemp.*/*input* 52000 52000 52000 53000 59000 55000 58000 56000 On the test login node of our SandyBridge iDataplex delivered mid year we get the package as well: [root@barcoo-test ~]# ls /sys/bus/platform/devices/coretemp.*/*input* /sys/bus/platform/devices/coretemp.0/temp1_input /sys/bus/platform/devices/coretemp.0/temp2_input /sys/bus/platform/devices/coretemp.0/temp3_input /sys/bus/platform/devices/coretemp.0/temp4_input /sys/bus/platform/devices/coretemp.0/temp5_input /sys/bus/platform/devices/coretemp.0/temp6_input /sys/bus/platform/devices/coretemp.0/temp7_input /sys/bus/platform/devices/coretemp.6/temp1_input /sys/bus/platform/devices/coretemp.6/temp2_input /sys/bus/platform/devices/coretemp.6/temp3_input /sys/bus/platform/devices/coretemp.6/temp4_input /sys/bus/platform/devices/coretemp.6/temp5_input /sys/bus/platform/devices/coretemp.6/temp6_input /sys/bus/platform/devices/coretemp.6/temp7_input [root@barcoo-test ~]# cat /sys/bus/platform/devices/coretemp.*/*input* 44000 43000 44000 42000 43000 38000 44000 37000 33000 37000 32000 34000 36000 33000 There's more information in $KERNEL_SOURCE/Documentation/hwmon/coretemp. Both those systems are running RHEL6, so it should be fairly well supported *if* the sysadmin has loaded the modules. > That might well be do-able, since there's no performance penalty for reading > such values until you actually read the values (i.e., we don't actively > monitor these values in OMPI's overall progression engine; they're only > read when the application invokes an MPI_T read function). Indeed, these *shouldn't* hang trying to read them. ;-) cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci