[hwloc-devel] v1.11.0

2015-06-13 Thread Ralph Castain
Hi folks

I’ve been working on updating the OMPI hwloc code to the 1.11 version. I 
reported via Jeff about the config issue, so I updated to the latest nightly 
tarball of 1.11 to pickup that change. I’m now able to configure, but hit one 
last required change to make it build:

diff --git a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c 
b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c
index 8d129d0..01be274 100644
--- a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c
+++ b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c
@@ -2599,7 +2599,7 @@ next_noncpubackend:
   && strcmp(topology->backends->component->name, "xml")) {
 char *value;
 /* add a hwlocVersion */
-hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion", VERSION);
+hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion", HWLOC_VERSION);
 /* add a ProcessName */
 value = hwloc_progname(topology);
 if (value) {


I’m not sure if this is a prefixing issue when embedded, or a more general 
problem. Any thoughts?
Ralph



Re: [hwloc-devel] hwloc failures

2014-11-19 Thread Ralph Castain
FWIW: I just downloaded and build 1.10.0 without problem on Mac Yosemite using 
GCC. I have the Darwin ports libxml2 installed - version 2.9.2.


> On Nov 19, 2014, at 1:28 PM, Brice Goglin  wrote:
> 
> Which version of libxml2 do you have?
> 
> Brice
> 
> 
> 
> 
> Le 19/11/2014 22:26, Balaji, Pavan a écrit :
>> I’m seeing the following failure with hwloc on the mac (yosemite):
>> 
>>  CC   topology-xml-libxml.lo
>> ../../../../../../../../../mpich/src/pm/hydra/tools/topo/hwloc/hwloc/src/topology-xml-libxml.c:17:27:
>>  fatal error: libxml/parser.h: No such file or directory
>> #include 
>> 
>> This is GNU compilers and the latest hwloc release.  I have libxml2 
>> installed.
>> 
>> Do I need to install a different package?  Why is configure not able to 
>> detect it?  What files can I send to help diagnose this?
>> 
>>  — Pavan
>> 
>> --
>> Pavan Balaji  ✉️
>> http://www.mcs.anl.gov/~balaji
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-devel/2014/11/4296.php
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2014/11/4297.php



Re: [hwloc-devel] Using hwloc to detect Hard Disks

2014-09-23 Thread Ralph Castain
True - but we intend to collect the inventory as root anyway. :-)

On Sep 23, 2014, at 1:50 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:

> On 24/09/14 00:57, Ralph Castain wrote:
> 
>> Memory info is available from lshw, though they are a GPL code:
> 
> FWIW on this laptop (Intel Haswell) lshw only report DIMM info when run
> as root, which I suspect would point them to accessing DMI information
> via /dev/mem.
> 
> Using strace supports this:
> 
> 3405  open("/dev/mem", O_RDONLY)= -1 EACCES (Permission denied)
> 
> FWIW dmidecode does the same.
> 
> samuel@haswell:~$ dmidecode
> # dmidecode 2.12
> /dev/mem: Permission denied
> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4238.php



Re: [hwloc-devel] Using hwloc to detect Hard Disks

2014-09-23 Thread Ralph Castain
Memory info is available from lshw, though they are a GPL code:

 *-bank:0
  description: DIMM Synchronous 1333 MHz (0.8 ns)
  product: M393B1K70DH0-YH9
  vendor: 0x80CE
  physical id: 0
  serial: 0x85B5FED3
  slot: DIMM_A1
  size: 8GiB
  width: 64 bits
  clock: 1333MHz (0.8ns)

Not sure how they are getting it, but I can have someone look at the code to 
see where the info is being obtained.


On Sep 22, 2014, at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Sep 22, 2014, at 4:58 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
>> On Sep 22, 2014, at 6:55 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>> 
>>>> HWLOC already provides similar info for processors and mother boards, so 
>>>> it seemed a natural extension of current capabilities to provide it for 
>>>> other system elements.
>>> 
>>> Disk vendor/model is easy to add from sysfs on Linux. I don't know where
>>> to find the serial number. Spindle speed may require more than just
>>> sysfs. Do you have more info on how to get these attributes?
>>> 
>>> For memory, we currently have a single memory object for all DIMMs of a
>>> single NUMA node. Adding multiple objects may not be useful, but adding
>>> many serials to a single NUMA object may be ugly.
>>> There are some information about physical memory in
>>> /sys/devices/system/node/node0/memory* but it doesn't correspond to
>>> DIMMs (I have 135 of them on my laptop for only 2 SODIMMs). dmidecode
>>> gets DIMM info somehow.
>> 
>> Back in Nehalem days, it wasn't possible to map Linux kernel "physical" 
>> memory back to individual DIMMs (because the BIOS could/would introduce 
>> another layer of kernel<-->DIMM mapping that the kernel might not be aware 
>> of).
>> 
>> Has that changed?
> 
> I don't think so, no - at least, I'm not sure you can map a specific DIMM to 
> a specific address within a NUMA region. However, we can at least add the 
> DIMMs to the root-object attributes. In addition, you can certainly map a 
> DIMM to a specific DIMM socket, and I believe that means you can map it to a 
> given NUMA region even if you can't say *where* it is within that region. 
> Have to verify that.
> 
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4229.php



Re: [hwloc-devel] Using hwloc to detect Hard Disks

2014-09-23 Thread Ralph Castain

On Sep 22, 2014, at 4:58 PM, Jeff Squyres (jsquyres)  wrote:

> On Sep 22, 2014, at 6:55 PM, Brice Goglin  wrote:
> 
>>> HWLOC already provides similar info for processors and mother boards, so it 
>>> seemed a natural extension of current capabilities to provide it for other 
>>> system elements.
>> 
>> Disk vendor/model is easy to add from sysfs on Linux. I don't know where
>> to find the serial number. Spindle speed may require more than just
>> sysfs. Do you have more info on how to get these attributes?
>> 
>> For memory, we currently have a single memory object for all DIMMs of a
>> single NUMA node. Adding multiple objects may not be useful, but adding
>> many serials to a single NUMA object may be ugly.
>> There are some information about physical memory in
>> /sys/devices/system/node/node0/memory* but it doesn't correspond to
>> DIMMs (I have 135 of them on my laptop for only 2 SODIMMs). dmidecode
>> gets DIMM info somehow.
> 
> Back in Nehalem days, it wasn't possible to map Linux kernel "physical" 
> memory back to individual DIMMs (because the BIOS could/would introduce 
> another layer of kernel<-->DIMM mapping that the kernel might not be aware 
> of).
> 
> Has that changed?

I don't think so, no - at least, I'm not sure you can map a specific DIMM to a 
specific address within a NUMA region. However, we can at least add the DIMMs 
to the root-object attributes. In addition, you can certainly map a DIMM to a 
specific DIMM socket, and I believe that means you can map it to a given NUMA 
region even if you can't say *where* it is within that region. Have to verify 
that.


> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4229.php



Re: [hwloc-devel] Interesting warning

2014-09-12 Thread Ralph Castain
Yep, that worked!

On Sep 12, 2014, at 1:30 AM, Samuel Thibault <samuel.thiba...@inria.fr> wrote:

> Hello,
> 
> Ralph Castain, le Wed 10 Sep 2014 17:41:17 -0700, a écrit :
>> Just got this from Clang 3.4.2 on Linux x86-64:
>> 
>> In file included from topology-x86.c:23:
>> /home/common/openmpi/svn-trunk/opal/mca/hwloc/hwloc191/hwloc/include/private/
>> cpuid-x86.h:67:3: warning: extension used [-Wlanguage-extension-token]
>>  asm(
>>  ^
>> 1 warning generated.
>> 
>> 
>> Guess it doesn't like that assembler in there
> 
> Could you try the attached patch?
> 
> Samuel
> 



[hwloc-devel] Interesting warning

2014-09-10 Thread Ralph Castain
Just got this from Clang 3.4.2 on Linux x86-64:

In file included from topology-x86.c:23:
/home/common/openmpi/svn-trunk/opal/mca/hwloc/hwloc191/hwloc/include/private/cpuid-x86.h:67:3:
 warning: extension used [-Wlanguage-extension-token]
  asm(
  ^
1 warning generated.


Guess it doesn't like that assembler in there
Ralph



Re: [hwloc-devel] GIT: hwloc branch master updated. 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec

2014-03-29 Thread Ralph Castain
Jeff just left today for a 1-week vacation. However, this came up on the OMPI 
mailing list - turns out that some linux distro's automatically set LS_COLORS 
in your environment when running old versions of csh/tcsh via their default dot 
files, and it can cause problems with the script. So just ensuring it isn't set 
solves the problem.


On Mar 29, 2014, at 7:59 AM, Brice Goglin  wrote:

> Jeff,
> Where does this LS_COLORS variable come from? Who is setting it?
> Brice
> 
> 
> 
> Le 27/03/2014 11:45, MPI Team a écrit :
>> The branch, master has been updated
>>   via  0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec (commit)
>>  from  00f85033d269e2c312370bb24043f92a92dff7e3 (commit)
>> 
>> Those revisions listed above that are new to this repository have
>> not appeared on any other notification email; so we list those
>> revisions in full, below.
>> 
>> - Log -
>> https://github.com/open-mpi/hwloc/commit/0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec
>> 
>> commit 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec
>> Author: Jeff Squyres 
>> Date:   Thu Mar 27 06:28:45 2014 -0400
>> 
>>BUILD: fix "make dist" failure on some linux distro with old csh/tcsh
>> 
>>On some linux distro (sles11sp2) csh fails to parse $LS_COLORS and
>>borks with error: Unknown colorls variable `mh'.
>> 
>>The workaround is to unset LS_COLORS before calling to csh script.
>> ---
>> Makefile.am | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Makefile.am b/Makefile.am
>> index ca9c00c..34d0aa2 100644
>> --- a/Makefile.am
>> +++ b/Makefile.am
>> @@ -1,6 +1,6 @@
>> # Copyright © 2009-2014 Inria.  All rights reserved.
>> # Copyright © 2009  Université Bordeaux 1
>> -# Copyright © 2009-2010 Cisco Systems, Inc.  All rights reserved.
>> +# Copyright © 2009-2014 Cisco Systems, Inc.  All rights reserved.
>> # See COPYING in top-level directory.
>> 
>> # Note that the -I directory must *exactly* match what was specified
>> @@ -48,7 +48,7 @@ endif
>> 
>> if HWLOC_BUILD_STANDALONE
>> dist-hook:
>> -csh "$(top_srcdir)/config/distscript.csh" "$(top_srcdir)" "$(distdir)" 
>> "$(HWLOC_VERSION)"
>> +env LS_COLORS= csh "$(top_srcdir)/config/distscript.csh" 
>> "$(top_srcdir)" "$(distdir)" "$(HWLOC_VERSION)"
>> endif HWLOC_BUILD_STANDALONE
>> 
>> #
>> 
>> ---
>> 
>> Summary of changes:
>> Makefile.am | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> 
>> hooks/post-receive
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



Re: [hwloc-devel] Attribute request

2014-01-29 Thread Ralph Castain
I'd prefer your first option - it's easy enough to check the info objects for 
existence of a particular attribute.

On Jan 29, 2014, at 1:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Assuming people will confirm that ARM information isn't so simple, I wonder 
> where it's better to put architecture specific fields. With the proposed 
> solution, Intel and ARM would be different:
> Architecture=x86_64
> CPUVendor=GenuineIntel
> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> CPUModelNumber=45
> CPUFamilyNumber=6
> and
> Architecture=armv7l
> CPUVendor=cardhu
> CPUModel=ARMv7 Processor rev 9 (v7l)
> CPUImplementer=0x41
> CPUArchitecture=7
> CPUVariant=0x2
> CPUPart=0xc09
> CPURevision=9 
> 
> We could also merge those arch-specific into a single generic one:
> Architecture=x86_64
> CPUVendor=GenuineIntel
> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> CPUModelNumber=family=6;model=45
> and
> Architecture=armv7l
> CPUVendor=cardhu
> CPUModel=ARMv7 Processor rev 9 (v7l)
> 
> CPUModelNumber=implementer=0x41;architecture=7;variant=0x2;part=0xc09;revision=9
> 
> The drawback is that you'd have to parse CPUModelNumber to extract family and 
> model.
> 
> I am not sure which one is best.
> 
> Brice
> 
> 
> 
> 
> 
> Le 28/01/2014 00:09, Brice Goglin a écrit :
>> Hello,
>> I have some code that seems to work. Here's what it reports below. Does that 
>> look ok to you?
>> I had to modify quite a lot of things to make the parsing of /proc/cpuinfo 
>> more robust (the code is basically arch-specific now), so I am not sure 
>> we'll be able to backport this to OMPI.
>> Brice
>> 
>> 
>> * Sandy-Bridge Xeon E5 (Stampede)
>> CPUVendor=GenuineIntel
>> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>> CPUModelNumber=45
>> CPUFamilyNumber=6
>> * Old Nehalem-EX
>> CPUVendor=GenuineIntel
>> CPUModel=Intel(R) Xeon(R) CPU   E7540  @ 2.00GHz
>> CPUModelNumber=46
>> CPUFamilyNumber=6
>> * Itanium
>> CPUVendor=GenuineIntel
>> CPUModel=Dual-Core Intel(R) Itanium(R) Processor 9140N
>> CPUModelNumber=1
>> CPUFamilyNumber=32
>> * AMD
>> CPUVendor=AuthenticAMD
>> CPUModel=Dual Core AMD Opteron(tm) Processor 865
>> CPUModelNumber=33
>> CPUFamilyNumber=15
>> * MIC (Stampede)
>> CPUVendor=GenuineIntel
>> CPUModel=0b/01
>> CPUModelNumber=1
>> CPUFamilyNumber=11
>> 
>> 
>> 
>> 
>> Le 23/01/2014 19:50, Ralph Castain a écrit :
>>> That would be perfect! Thanks
>>> 
>>> On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>> 
>>>> Should be easy on Linux, sure.
>>>> The model name is already known as CPUModel in hwloc.
>>>> We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), 
>>>> CPUFamily (or CPUFamilyNumber if there's a name for these families?) and 
>>>> CPUModelNumber ?
>>>> 
>>>> Brice
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Le 23/01/2014 19:09, Ralph Castain a écrit :
>>>>> Hi folks
>>>>> 
>>>>> Looking at the current topology info, I see you capture the model name 
>>>>> for the socket, but not a couple of other key things Intel could use:
>>>>> 
>>>>> cpu family  : 6
>>>>> model   : 44
>>>>> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
>>>>> 
>>>>> 
>>>>> Both the cpu family and model are important to us - any issue with adding 
>>>>> them to the "infos" array?
>>>>> 
>>>>> Ralph
>>>>> 
>>>>> 
>>>>> 
>>>>> ___
>>>>> hwloc-devel mailing list
>>>>> hwloc-de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>> 
>>>> ___
>>>> hwloc-devel mailing list
>>>> hwloc-de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>> 
>>> 
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



Re: [hwloc-devel] Attribute request

2014-01-27 Thread Ralph Castain
Hello,
I have some code that seems to work. Here's what it reports below. Does that 
look ok to you?
I had to modify quite a lot of things to make the parsing of /proc/cpuinfo more 
robust (the code is basically arch-specific now), so I am not sure we'll be 
able to backport this to OMPI.
Brice


* Sandy-Bridge Xeon E5 (Stampede)
CPUVendor=GenuineIntel
CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPUModelNumber=45
CPUFamilyNumber=6
* Old Nehalem-EX
CPUVendor=GenuineIntel
CPUModel=Intel(R) Xeon(R) CPU   E7540  @ 2.00GHz
CPUModelNumber=46
CPUFamilyNumber=6
* Itanium
CPUVendor=GenuineIntel
CPUModel=Dual-Core Intel(R) Itanium(R) Processor 9140N
CPUModelNumber=1
CPUFamilyNumber=32
* AMD
CPUVendor=AuthenticAMD
CPUModel=Dual Core AMD Opteron(tm) Processor 865
CPUModelNumber=33
CPUFamilyNumber=15
* MIC (Stampede)
CPUVendor=GenuineIntel
CPUModel=0b/01
CPUModelNumber=1
CPUFamilyNumber=11




Le 23/01/2014 19:50, Ralph Castain a écrit :
> That would be perfect! Thanks
> 
> On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
>> Should be easy on Linux, sure.
>> The model name is already known as CPUModel in hwloc.
>> We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), 
>> CPUFamily (or CPUFamilyNumber if there's a name for these families?) and 
>> CPUModelNumber ?
>> 
>> Brice
>> 
>> 
>> 
>> 
>> Le 23/01/2014 19:09, Ralph Castain a écrit :
>>> Hi folks
>>> 
>>> Looking at the current topology info, I see you capture the model name for 
>>> the socket, but not a couple of other key things Intel could use:
>>> 
>>> cpu family  : 6
>>> model   : 44
>>> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
>>> 
>>> 
>>> Both the cpu family and model are important to us - any issue with adding 
>>> them to the "infos" array?
>>> 
>>> Ralph
>>> 
>>> 
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> 
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel

___
hwloc-devel mailing list
hwloc-de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel

Re: [hwloc-devel] Attribute request

2014-01-23 Thread Ralph Castain
That would be perfect! Thanks

On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Should be easy on Linux, sure.
> The model name is already known as CPUModel in hwloc.
> We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), 
> CPUFamily (or CPUFamilyNumber if there's a name for these families?) and 
> CPUModelNumber ?
> 
> Brice
> 
> 
> 
> 
> Le 23/01/2014 19:09, Ralph Castain a écrit :
>> Hi folks
>> 
>> Looking at the current topology info, I see you capture the model name for 
>> the socket, but not a couple of other key things Intel could use:
>> 
>> cpu family  : 6
>> model   : 44
>> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
>> 
>> 
>> Both the cpu family and model are important to us - any issue with adding 
>> them to the "infos" array?
>> 
>> Ralph
>> 
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



[hwloc-devel] Attribute request

2014-01-23 Thread Ralph Castain
Hi folks

Looking at the current topology info, I see you capture the model name for the 
socket, but not a couple of other key things Intel could use:

cpu family  : 6
model   : 44
model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz


Both the cpu family and model are important to us - any issue with adding them 
to the "infos" array?

Ralph



[hwloc-devel] Support for new architecture

2013-11-08 Thread Ralph Castain
Hi folks

We are seeing a new architecture appearing in the very near future, and I'm not 
sure how hwloc will handle it. Consider the following case:

* I have a rack that contains multiple "hosts"

* each host consists of a box/shelf with common support infrastructure in it - 
it has some kind of controller in it, and might have some networking support, 
maybe a pool of memory that can be allocated across the occupants.

* in the host, I have one or more "boards". Each board again has a controller 
in it with some common infrastructure to support its local sockets - might 
include some networking that would look like NICs (though not necessarily on a 
PCIe interface), a board-level memory pool, etc.

* each socket contains one or more die. Each die runs its own instance of an OS 
- probably a lightweight kernel - that can vary between dies (e.g., might have 
a tweaked configuration), and has its own associated memory that will 
physically reside outside the socket. You can think of each die as constituting 
a "shared memory locus" - i.e., processes running on that die can share memory 
between them as it would sit under the same OS instance.

* each die has some number of cores/hwthreads/caches etc.

Note that the sockets are not sitting in some PCIe bus - they appear to be 
directly connected to the overall network just like a "node" would appear 
today. However, there is a definite need for higher layers (RMs and MPIs) to 
understand this overall hierarchy and the "distances" between the individual 
elements.

Any thoughts on how we can support this?
Ralph



[hwloc-devel] Strange difference

2013-10-12 Thread Ralph Castain
Yo guys

I was doing some work that involved traversing the hwloc topo tree, and 
encountered the following odd discrepancy.

hwloc_topology_get_depth  => returns "unsigned"

hwloc_get_type_depth  => returns "int"

Why the difference? Makes it hard sometimes to avoid the "comparison between 
unsigned and signed" warnings when using these functions.
Ralph



Re: [hwloc-devel] xml file load incompatibilities

2013-09-21 Thread Ralph Castain
Okay, I found it - was a sequencing problem in OMPI itself (we "set" the new 
topology too late in the setup sequence). Sorry for false alarm.

Thanks for the help!
Ralph

On Sep 20, 2013, at 11:36 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Strange, the backtrace below looks total crazy, I don't see how debug checks 
> could still pass in that case.
> Any chance you valgrind that thing?
> 
> Brice
> 
> 
> 
> Le 21/09/2013 01:09, Ralph Castain a écrit :
>> Hmmm...nope, not a peep (no extra output at all). Just segfaulted like 
>> before.
>> 
>> On Sep 20, 2013, at 4:06 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>> 
>>> Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many 
>>> assertions at the end of hwloc_topology_load()
>>> 
>>> Brice
>>> 
>>> 
>>> 
>>> Le 21/09/2013 01:03, Ralph Castain a écrit :
>>>> I didn't try loading it with lstopo - just tried the OMPI trunk. It loads 
>>>> okay, but segfaults when you try to find an object by depth
>>>> 
>>>> #0  0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth 
>>>> (topology=Cannot access memory at address 0xfff7
>>>> ) at traversal.c:623
>>>> #1  0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj 
>>>> (topology=Cannot access memory at address 0xfff7
>>>> ) at rmaps_rr_mappers.c:747
>>>> #2  0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory 
>>>> at address 0xff77
>>>> ) at rmaps_rr_mappers.c:774
>>>> #3  0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at 
>>>> address 0xff17
>>>> ) at rmaps_rr.c:211
>>>> #4  0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory 
>>>> at address 0xfe7b
>>>> ) at base/rmaps_base_map_job.c:320
>>>> #5  0x0001005ce28c in event_process_active_single_queue (base=Cannot 
>>>> access memory at address 0xffe7
>>>> ) at event.c:1367
>>>> #6  0x0001005ce500 in event_process_active (base=Cannot access memory 
>>>> at address 0xffe7
>>>> ) at event.c:1437
>>>> #7  0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot 
>>>> access memory at address 0xffb7
>>>> ) at event.c:1645
>>>> #8  0x0001002c5158 in orterun (argc=Cannot access memory at address 
>>>> 0xfd1b
>>>> ) at orterun.c:3039
>>>> #9  0x0001002c32a4 in main (argc=Cannot access memory at address 
>>>> 0xfffb
>>>> ) at main.c:14
>>>> 
>>>> Looks to me like memory may be getting hosed
>>>> 
>>>> 
>>>> On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>> 
>>>>> I can't see any segfault. Where does the segfault occurs for you? In OMPI 
>>>>> only (or lstopo too)? When loading or when using the topology?
>>>>> 
>>>>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in 
>>>>> case the bug is in one of XML backends), looks ok.
>>>>> 
>>>>> Brice
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 20/09/2013 23:53, Ralph Castain a écrit :
>>>>>> Here are the two files I tried - not from the same machine. The foo.xml 
>>>>>> works, the topo.xml segfaults
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> One of our users reported it from their machine, but I don't have their 
>>>>>> topo file.
>>>>>> 
>>>>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> I don't see anything reason for such an incompatibility. But there are
>>>>>>> many combinations, we can't test everything.
>>>>>>> I can't reproduce that on my machines. Can you send the XML output of
>>>>>>> both versions on one of your machines?
>>>>>>> Brice
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Le 20/09/2013 23:32, Ralph Castain a écrit :
>

Re: [hwloc-devel] xml file load incompatibilities

2013-09-20 Thread Ralph Castain
Hmmm...nope, not a peep (no extra output at all). Just segfaulted like before.

On Sep 20, 2013, at 4:06 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many 
> assertions at the end of hwloc_topology_load()
> 
> Brice
> 
> 
> 
> Le 21/09/2013 01:03, Ralph Castain a écrit :
>> I didn't try loading it with lstopo - just tried the OMPI trunk. It loads 
>> okay, but segfaults when you try to find an object by depth
>> 
>> #0  0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth 
>> (topology=Cannot access memory at address 0xfff7
>> ) at traversal.c:623
>> #1  0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj (topology=Cannot 
>> access memory at address 0xfff7
>> ) at rmaps_rr_mappers.c:747
>> #2  0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory 
>> at address 0xff77
>> ) at rmaps_rr_mappers.c:774
>> #3  0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at 
>> address 0xff17
>> ) at rmaps_rr.c:211
>> #4  0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory 
>> at address 0xfe7b
>> ) at base/rmaps_base_map_job.c:320
>> #5  0x0001005ce28c in event_process_active_single_queue (base=Cannot 
>> access memory at address 0xffe7
>> ) at event.c:1367
>> #6  0x0001005ce500 in event_process_active (base=Cannot access memory at 
>> address 0xffe7
>> ) at event.c:1437
>> #7  0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot 
>> access memory at address 0xffb7
>> ) at event.c:1645
>> #8  0x0001002c5158 in orterun (argc=Cannot access memory at address 
>> 0xfd1b
>> ) at orterun.c:3039
>> #9  0x0001002c32a4 in main (argc=Cannot access memory at address 
>> 0xfffb
>> ) at main.c:14
>> 
>> Looks to me like memory may be getting hosed
>> 
>> 
>> On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>> 
>>> I can't see any segfault. Where does the segfault occurs for you? In OMPI 
>>> only (or lstopo too)? When loading or when using the topology?
>>> 
>>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in 
>>> case the bug is in one of XML backends), looks ok.
>>> 
>>> Brice
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Le 20/09/2013 23:53, Ralph Castain a écrit :
>>>> Here are the two files I tried - not from the same machine. The foo.xml 
>>>> works, the topo.xml segfaults
>>>> 
>>>> 
>>>> 
>>>> 
>>>> One of our users reported it from their machine, but I don't have their 
>>>> topo file.
>>>> 
>>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>> 
>>>>> Hello,
>>>>> I don't see anything reason for such an incompatibility. But there are
>>>>> many combinations, we can't test everything.
>>>>> I can't reproduce that on my machines. Can you send the XML output of
>>>>> both versions on one of your machines?
>>>>> Brice
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 20/09/2013 23:32, Ralph Castain a écrit :
>>>>>> Hi folks
>>>>>> 
>>>>>> I've run across a rather strange behavior. We have two branches in OMPI 
>>>>>> - the devel trunk (using hwloc v1.7.2) and our feature release series 
>>>>>> (using hwloc 1.5.2). I have found the following:
>>>>>> 
>>>>>> *the feature series can correctly load an xml file generated by lstopo 
>>>>>> of versions 1.5 or greater
>>>>>> 
>>>>>> * the devel series can correctly load an xml file generated by lstopo of 
>>>>>> versions 1.7 or greater, but not files generated by prior versions. In 
>>>>>> the latter case, I segfault as soon as I try to use the loaded topology.
>>>>>> 
>>>>>> Any ideas why the discrepancy? Can I at least detect the version used to 
>>>>>> create a file when loading it so I can error out instead of segfaulting?
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>> ___
>>>>>> hwloc-devel mailing list
>>>>>> hwloc-de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>> ___
>>>>> hwloc-devel mailing list
>>>>> hwloc-de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>> 
>>>> 
>>>> ___
>>>> hwloc-devel mailing list
>>>> hwloc-de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



Re: [hwloc-devel] xml file load incompatibilities

2013-09-20 Thread Ralph Castain
I didn't try loading it with lstopo - just tried the OMPI trunk. It loads okay, 
but segfaults when you try to find an object by depth

#0  0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth (topology=Cannot 
access memory at address 0xfff7
) at traversal.c:623
#1  0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj (topology=Cannot 
access memory at address 0xfff7
) at rmaps_rr_mappers.c:747
#2  0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory at 
address 0xff77
) at rmaps_rr_mappers.c:774
#3  0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at 
address 0xff17
) at rmaps_rr.c:211
#4  0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory at 
address 0xfe7b
) at base/rmaps_base_map_job.c:320
#5  0x0001005ce28c in event_process_active_single_queue (base=Cannot access 
memory at address 0xffe7
) at event.c:1367
#6  0x0001005ce500 in event_process_active (base=Cannot access memory at 
address 0xffe7
) at event.c:1437
#7  0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot access 
memory at address 0xffb7
) at event.c:1645
#8  0x0001002c5158 in orterun (argc=Cannot access memory at address 
0xfd1b
) at orterun.c:3039
#9  0x0001002c32a4 in main (argc=Cannot access memory at address 
0xfffb
) at main.c:14

Looks to me like memory may be getting hosed


On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> I can't see any segfault. Where does the segfault occurs for you? In OMPI 
> only (or lstopo too)? When loading or when using the topology?
> 
> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in 
> case the bug is in one of XML backends), looks ok.
> 
> Brice
> 
> 
> 
> 
> 
> Le 20/09/2013 23:53, Ralph Castain a écrit :
>> Here are the two files I tried - not from the same machine. The foo.xml 
>> works, the topo.xml segfaults
>> 
>> 
>> 
>> 
>> 
>> One of our users reported it from their machine, but I don't have their topo 
>> file.
>> 
>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>> 
>>> Hello,
>>> I don't see anything reason for such an incompatibility. But there are
>>> many combinations, we can't test everything.
>>> I can't reproduce that on my machines. Can you send the XML output of
>>> both versions on one of your machines?
>>> Brice
>>> 
>>> 
>>> 
>>> Le 20/09/2013 23:32, Ralph Castain a écrit :
>>>> Hi folks
>>>> 
>>>> I've run across a rather strange behavior. We have two branches in OMPI - 
>>>> the devel trunk (using hwloc v1.7.2) and our feature release series (using 
>>>> hwloc 1.5.2). I have found the following:
>>>> 
>>>> *the feature series can correctly load an xml file generated by lstopo of 
>>>> versions 1.5 or greater
>>>> 
>>>> * the devel series can correctly load an xml file generated by lstopo of 
>>>> versions 1.7 or greater, but not files generated by prior versions. In the 
>>>> latter case, I segfault as soon as I try to use the loaded topology.
>>>> 
>>>> Any ideas why the discrepancy? Can I at least detect the version used to 
>>>> create a file when loading it so I can error out instead of segfaulting?
>>>> 
>>>> Ralph
>>>> 
>>>> ___
>>>> hwloc-devel mailing list
>>>> hwloc-de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



Re: [hwloc-devel] xml file load incompatibilities

2013-09-20 Thread Ralph Castain
Here are the two files I tried - not from the same machine. The foo.xml works, 
the topo.xml segfaults




topo.xml
Description: XML document


foo.xml
Description: XML document


One of our users reported it from their machine, but I don't have their topo 
file.

On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Hello,
> I don't see anything reason for such an incompatibility. But there are
> many combinations, we can't test everything.
> I can't reproduce that on my machines. Can you send the XML output of
> both versions on one of your machines?
> Brice
> 
> 
> 
> Le 20/09/2013 23:32, Ralph Castain a écrit :
>> Hi folks
>> 
>> I've run across a rather strange behavior. We have two branches in OMPI - 
>> the devel trunk (using hwloc v1.7.2) and our feature release series (using 
>> hwloc 1.5.2). I have found the following:
>> 
>> *the feature series can correctly load an xml file generated by lstopo of 
>> versions 1.5 or greater
>> 
>> * the devel series can correctly load an xml file generated by lstopo of 
>> versions 1.7 or greater, but not files generated by prior versions. In the 
>> latter case, I segfault as soon as I try to use the loaded topology.
>> 
>> Any ideas why the discrepancy? Can I at least detect the version used to 
>> create a file when loading it so I can error out instead of segfaulting?
>> 
>> Ralph
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel



[hwloc-devel] xml file load incompatibilities

2013-09-20 Thread Ralph Castain
Hi folks

I've run across a rather strange behavior. We have two branches in OMPI - the 
devel trunk (using hwloc v1.7.2) and our feature release series (using hwloc 
1.5.2). I have found the following:

*the feature series can correctly load an xml file generated by lstopo of 
versions 1.5 or greater

* the devel series can correctly load an xml file generated by lstopo of 
versions 1.7 or greater, but not files generated by prior versions. In the 
latter case, I segfault as soon as I try to use the loaded topology.

Any ideas why the discrepancy? Can I at least detect the version used to create 
a file when loading it so I can error out instead of segfaulting?

Ralph



Re: [hwloc-devel] v1.7

2013-01-07 Thread Ralph Castain

On Jan 7, 2013, at 6:05 AM, Samuel Thibault  wrote:

> Hello,
> 
> Brice Goglin, le Mon 31 Dec 2012 10:05:41 +0100, a écrit :
>> + The HWLOC_COMPONENTS may now start with '^' to only define a list of
>>   components to exclude.
> 
> I'm finding it not intuitive and not generic enough, I'm wondering how
> that didn't affect Open-MPI, which as IUI uses this convention.
> 
> It means that
> 
> HWLOC_COMPONENTS=^cuda,opencl
> 
> disables cuda *and* opencl,

FWIW: that is the OMPI convention


> while intuition would have told me that it
> disables cuda but enables opencl.
> 
> Also, one would for instance want to be able to do this:
> 
> HWLOC_COMPONENTS=x86,^cuda,^opencl,nvml
> 
> To be able to enable x86 before the default linux, but disable cuda and
> opencl, but enable nvml, as well as all the other usual plugins (without
> having to know the list, which is important for future extensions).
> 
> Samuel
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] Cgroup resource limits

2012-11-05 Thread Ralph Castain

On Nov 4, 2012, at 7:28 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 03/11/12 09:05, Ralph Castain wrote:
> 
>> System resource managers don't usually provide this capability, so
>> we will do it at the ORTE level.
> 
> I would argue that the resource managers *should* be doing it

No argument from me - I would love for them to provide me with an easy API that 
mpirun can use to specify the requirements for a given application.

> - however,
> I will also argue that the resource managers should be doing it via
> hwloc (so I'm afraid it's not an out for you folks :-) ).

Agreed, though I leave that to the individual RMs to decide.

> 
> It's also worth remembering that the memcg code has an appalling
> reputation with the kernel developers in terms of performance overhead,
> for instance at the recent Kernel Summit numbers were reported showing a
> substantial impact for just having the code present, but not used.
> 
> Following that a patch set was sent out trying to avoid that impact if
> it's not in use which doesn't help here but does give a measure of the
> performance hit:
> 
> http://lwn.net/Articles/517562/
> 
> # So as one can see, the difference between base and nomemcg in terms
> # of both system time and elapsed time is quite drastic, and consistent
> # with the figures shown by Mel Gorman in the Kernel summit. This is a
> # ~7 % drop in performance, just by having memcg enabled. memcg
> # functions appear heavily in the profiles, even if all tasks lives in
> # the root memcg.
> 

Yick! However, I would expect the community to reduce that impact over time. If 
systems don't want that capability, then they can and should disable it. On the 
other hand, if they do want it, then we want to support it.


> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://www.enigmail.net/
> 
> iEYEARECAAYFAlCXMlUACgkQO2KABBYQAh8eTgCgkruuxIKc3mqpoxwMaeQBI1hR
> /osAn225q4G6FWs1b4Lm6F/9GHDgw9JB
> =jkm0
> -END PGP SIGNATURE-
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] Cgroup resource limits

2012-11-02 Thread Ralph Castain
Hi Brice

I think Linux cgroups makes the most sense in terms of a mechanism for doing 
this. We don't already do it, but it is something our customers want to see in 
the platform - so we have to provide it.

The basic use-case is for an application to specify a max memory requirement, 
thus allowing us to subdivide the node when allocating resources. In that case, 
we need to ensure that the application remains within that memory limit so we 
don't start swapping. This is a typical "big data" requirement, and the apps 
know how to handle the situation where they run up against the limit (e.g., 
what to do when malloc returns NULL).

System resource managers don't usually provide this capability, so we will do 
it at the ORTE level. We already use hwloc there for resource discovery and 
process placement, so it seems natural to include the ability to specify 
limits. Since ORTE also does the process launching, it could do the final 
cgroup definition and pass it to Linux.

We envision an API that basically is modeled after the cgroup structure. What 
we would want hwloc to do is the final step - we pass in the resource 
constraints, including bind and memory policy specs, and hwloc does the "magic" 
to tell Linux what needs to be done.

Make sense?
Ralph


On Nov 2, 2012, at 2:18 PM, Brice Goglin <brice.gog...@inria.fr> wrote:

> Hello Ralph,
> 
> I am not very familiar with these features. What system mechanism do you
> currently use for this? Linux cgroups? Any concrete example of what you
> would like to do?
> 
> Brice
> 
> 
> 
> Le 02/11/2012 22:12, Ralph Castain a écrit :
>> Hi folks
>> 
>> We (Greenplum) have a need to support resource limits (e.g., memory and cpu 
>> usage) on processes running under Open MPI's RTE. OMPI uses hwloc for 
>> processor and memory affinity, so this seems a likely place to add the 
>> required support. Jeff tells me that it doesn't yet exist in hwloc - I'm 
>> wondering if you would welcome and/or be willing to consider contributions 
>> from our engineers towards adding this capability?
>> 
>> Obviously, we'd need to discuss how and where to do the extension. Just 
>> wanted to first see if this is an option, or if we should do it directly in 
>> OMPI.
>> Ralph
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 




[hwloc-devel] Cgroup resource limits

2012-11-02 Thread Ralph Castain
Hi folks

We (Greenplum) have a need to support resource limits (e.g., memory and cpu 
usage) on processes running under Open MPI's RTE. OMPI uses hwloc for processor 
and memory affinity, so this seems a likely place to add the required support. 
Jeff tells me that it doesn't yet exist in hwloc - I'm wondering if you would 
welcome and/or be willing to consider contributions from our engineers towards 
adding this capability?

Obviously, we'd need to discuss how and where to do the extension. Just wanted 
to first see if this is an option, or if we should do it directly in OMPI.
Ralph




Re: [hwloc-devel] lstopo-nox strikes back

2012-04-25 Thread Ralph Castain
I don't have a strong opinion, but the historical "standard practice" for 
Linux/Unix has always been to default to cmd line, non-graphical interfaces. 
Graphical output was optional. Of course, that stemmed from the days before 
everyone had a graphical display, but it is still generally followed.


On Apr 25, 2012, at 3:38 AM, Brice Goglin wrote:

> Hello,
> 
> We recently got some complains from redhat/centos users that wanted to 
> install hwloc on their cluster but couldn't because it brought so many X 
> libraries that they don't care about.
> 
> Debian solves this by having two hwloc packages: the main hwloc one, and 
> hwloc-nox where cairo is disabled. You just install one of them, packages are 
> marked as conflicting with each others.
> 
> I asked Jirka, our fellow RPM hwloc packager. He feels that RPM distros don't 
> work that way. They usually have a core 'foo' package without X, and 
> something such as 'foo-gui' with the X-enabled binary. So you'd have lstopo 
> and lstopo-gui installed at the same time.
> 
> I don't have any preference but RPM is much more widely used than deb in HPC, 
> so we must consider the issue, either in hwloc or in RPM packaging. And we 
> need a solution that is consistent across distros (we don't want users to get 
> lost because Debian/Ubuntu lstopo is graphical while RPM lstopo is not and 
> lstopo-gui is).
> 
> It's not hard to build two lstopo binaries in the same hwloc (quick patch 
> attached). But we'd need to decide their names (lstopo/lstopo-nox, 
> lstopo/lstopo-nogui, lstopo-gui/lstopo), and find a good way to make the 
> existing packages deal with them.
> 
> How do people feel about this? Is it ok to choose between hwloc and hwloc-nox 
> packages on Debian/Ubuntu? Does somebody want to *always* have a lstopo-nox 
> installed? Should the default lstopo be graphical/cario or not?
> 
> Brice
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] hwloc + OMPI issue

2011-10-12 Thread Ralph Castain
Thanks! I'll add the latter to our code.

Ralph

On Oct 12, 2011, at 3:11 PM, Brice Goglin wrote:

> Le 12/10/2011 22:56, Jeff Squyres a écrit :
>> One of the OMPI devs found a problem when I upgraded the OMPI SVN trunk to 
>> the hwloc 1.2.2ompi version last week that I think I am just now beginning 
>> to understand.
>> 
>> Brief reminder of our strategy:
>> 
>> - on each compute node, OMPI launches a local "orted" helper daemon
>> - this orted fork/exec's the local MPI processes
>> 
>> To avoid the penalty of each MPI process invoking hwloc discovery 
>> more-or-less simultaneously upon startup (which, as we've see on this list 
>> before, can be painful when core counts are large), we have the orted do the 
>> hwloc discovery, serialize this into XML, and send it to each of its local 
>> processes.  The local processes receive this XML and then load it into hwloc 
>> and run from there.
>> 
>> However, it looks like the resulting loaded-from-XML topology->is_thissystem 
>> is set to 0, and therefore functions like hwloc_get_cpubind() actually get 
>> wired up to dontget_thisproc_cpubind() (instead of the proper Linux backend, 
>> for example).
>> 
>> How do we avoid this?  We need working hwloc functions after loading up an 
>> XML topology string.
> 
> export HWLOC_THISSYSTEM=1
> or
> hwloc_topology_set_flags(HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM) between
> init() and load()
> 
> Brice
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] 1.3 -- wait!

2011-10-11 Thread Ralph Castain

On Oct 11, 2011, at 7:34 AM, Brice Goglin wrote:

> Le 11/10/2011 15:04, Jeff Squyres a écrit :
>> Looks like Ralph's size/linesize patch didn't make it to v1.3:
>> 
>> Index: src/traversal.c
>> ===
>> --- src/traversal.c  (revision 3883)
>> +++ src/traversal.c  (working copy)
>> @@ -478,7 +478,7 @@
>>  *assoc = '\0';
>>   else
>>  snprintf(assoc, sizeof(assoc), "%sways=%d", separator, 
>> obj->attr->cache.associativity);
>> -  res = hwloc_snprintf(tmp, tmplen, "%s%lu%s%sline=%u%s",
>> +  res = hwloc_snprintf(tmp, tmplen, "%ssize=%lu%s%slinesize=%u%s",
>> prefix,
>> (unsigned long) 
>> hwloc_memory_size_printf_value(obj->attr->cache.size, verbose),
>> hwloc_memory_size_printf_unit(obj->attr->cache.size, 
>> verbose),
>> 
>> 
>> Can this go in before 1.3 is released?
>> 
> 
> I didn't think it was that important. I can backport it for sure.

Thanks!

> Do you
> want it in v1.2-ompi too?

Not necessary for v1.2-ompi - I already inserted it in our local copy, and I 
imagine we'll update to 1.3 shortly.

> 
> Brice
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] Something lighter-weight than XML?

2011-09-24 Thread Ralph Castain
Thanks!

On Sep 24, 2011, at 2:18 PM, Brice Goglin wrote:

> I fixed one parsing bug in commit 3660 on the v1.2-ompi branch. Things
> should work better now.
> 
> Parsing XML distance matrices was broken when the XML file came from the
> no-libxml exporter. That's why you had problems on your dual-amd machine
> (those have distance matrices) and not on your mac (single processor, no
> distances, no bug).
> 
> The v1.2 branch doesn't report parsing failure well, so it just crashed.
> Trunk exits with an error instead of crashing.
> 
> Brice
> 
> 
> 
> 
> Le 24/09/2011 20:37, Ralph Castain a écrit :
>> Yep, it fails. Runs on my Mac, but not under Linux.
>> 
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x2acdbedd in hwloc_bitmap_snprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> (gdb) where
>> #0  0x2acdbedd in hwloc_bitmap_snprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #1  0x2acdc060 in hwloc_bitmap_asprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #2  0x2acd9b34 in hwloc__xml_export_object () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #3  0x2acda325 in hwloc___nolibxml_prepare_export () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #4  0x2acda39c in hwloc__nolibxml_prepare_export () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #5  0x2acda4be in hwloc_topology_export_xmlbuffer () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #6  0x004009b8 in main () at xmlbuffer.c:31
>> 
>> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote:
>> 
>>> Indeed, this object contains invalid pointers.
>>> 
>>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does
>>> export+import+export+compare on the same machine. It would be good to
>>> know if it fails on one of the machines you're using here.
>>> 
>>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt
>>> 
>>> thanks
>>> Brice
>>> 
>>> 
>>> 
>>> Le 24/09/2011 17:07, Ralph Castain a écrit :
>>>> FWIW: I tried just printing out the contents of that root object 
>>>> immediately after importing the xml, and it clearly has a problem:
>>>> 
>>>> (gdb) print *obj
>>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 
>>>> , memory = {
>>>>   total_memory = 46912502995240, local_memory = 46912502995240, 
>>>> page_types_len = 0, page_types = 0x0}, attr = 0x2, 
>>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, 
>>>> prev_cousin = 0x, parent = 0x0, 
>>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, 
>>>> children = 0x2b139738, 
>>>> first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = 
>>>> 0x0, complete_cpuset = 0x0, 
>>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, 
>>>> complete_nodeset = 0x644c90, 
>>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 
>>>> 690, infos = 0x0, infos_count = 0}
>>>> 
>>>> 
>>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:
>>>> 
>>>>> Here's the trace:
>>>>> 
>>>>> #0  0x2ae61737 in hwloc__xml_export_object 
>>>>> (output=0x7fffd890, topology=0x695f10, obj=0x2b139b28)
>>>>>  at topology-xml.c:1094
>>>>> #1  0x2ae61b69 in hwloc___nolibxml_prepare_export 
>>>>> (topology=0x695f10, 
>>>>>  xmlbuffer=0x698a70 ">>>> encoding=\"UTF-8\"?>\n>>>> \"hwloc.dtd\">\n\n  >>>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" 
>>>>> complete_cpuset=\"0xf...f\" onl"..., 
>>>>>  buflen=16384) at topology-xml.c:1193
>>>>> #2  0x2ae61be0 in hwloc__nolibxml_prepare_export 
>>>>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c)
>>>>>  at topology-xml.c:1207
>>>>> #3  0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer 
>>>>> (topology=0x695f10, xmlbuffer=0x7fffd988, 
>>>>>  buflen=0x7fffd97c) at topology-xml.c:1281
>>>>> #4  0x2ae

Re: [hwloc-devel] Something lighter-weight than XML?

2011-09-24 Thread Ralph Castain
This is 1.2-ompi, running on Linux 2.6.18-274.el5 on x86_64

$ uname -a
Linux xxx 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 
x86_64 GNU/Linux


On Sep 24, 2011, at 12:43 PM, Brice Goglin wrote:

> What platform and distribution do you have?
> 
> Brice
> 
> 
> 
> Le 24/09/2011 20:37, Ralph Castain a écrit :
>> Yep, it fails. Runs on my Mac, but not under Linux.
>> 
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x2acdbedd in hwloc_bitmap_snprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> (gdb) where
>> #0  0x2acdbedd in hwloc_bitmap_snprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #1  0x2acdc060 in hwloc_bitmap_asprintf () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #2  0x2acd9b34 in hwloc__xml_export_object () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #3  0x2acda325 in hwloc___nolibxml_prepare_export () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #4  0x2acda39c in hwloc__nolibxml_prepare_export () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #5  0x2acda4be in hwloc_topology_export_xmlbuffer () from 
>> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
>> #6  0x004009b8 in main () at xmlbuffer.c:31
>> 
>> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote:
>> 
>>> Indeed, this object contains invalid pointers.
>>> 
>>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does
>>> export+import+export+compare on the same machine. It would be good to
>>> know if it fails on one of the machines you're using here.
>>> 
>>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt
>>> 
>>> thanks
>>> Brice
>>> 
>>> 
>>> 
>>> Le 24/09/2011 17:07, Ralph Castain a écrit :
>>>> FWIW: I tried just printing out the contents of that root object 
>>>> immediately after importing the xml, and it clearly has a problem:
>>>> 
>>>> (gdb) print *obj
>>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 
>>>> , memory = {
>>>>   total_memory = 46912502995240, local_memory = 46912502995240, 
>>>> page_types_len = 0, page_types = 0x0}, attr = 0x2, 
>>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, 
>>>> prev_cousin = 0x, parent = 0x0, 
>>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, 
>>>> children = 0x2b139738, 
>>>> first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = 
>>>> 0x0, complete_cpuset = 0x0, 
>>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, 
>>>> complete_nodeset = 0x644c90, 
>>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 
>>>> 690, infos = 0x0, infos_count = 0}
>>>> 
>>>> 
>>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:
>>>> 
>>>>> Here's the trace:
>>>>> 
>>>>> #0  0x2ae61737 in hwloc__xml_export_object 
>>>>> (output=0x7fffd890, topology=0x695f10, obj=0x2b139b28)
>>>>>  at topology-xml.c:1094
>>>>> #1  0x2ae61b69 in hwloc___nolibxml_prepare_export 
>>>>> (topology=0x695f10, 
>>>>>  xmlbuffer=0x698a70 ">>>> encoding=\"UTF-8\"?>\n>>>> \"hwloc.dtd\">\n\n  >>>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" 
>>>>> complete_cpuset=\"0xf...f\" onl"..., 
>>>>>  buflen=16384) at topology-xml.c:1193
>>>>> #2  0x2ae61be0 in hwloc__nolibxml_prepare_export 
>>>>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c)
>>>>>  at topology-xml.c:1207
>>>>> #3  0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer 
>>>>> (topology=0x695f10, xmlbuffer=0x7fffd988, 
>>>>>  buflen=0x7fffd97c) at topology-xml.c:1281
>>>>> #4  0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, 
>>>>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183
>>>>> #5  0x2adf348c in opal_dss_compare (value1=0x695f10, 
>>>>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39
>>>>> #6 

Re: [hwloc-devel] Something lighter-weight than XML?

2011-09-24 Thread Ralph Castain
Yep, it fails. Runs on my Mac, but not under Linux.

Program terminated with signal 11, Segmentation fault.
#0  0x2acdbedd in hwloc_bitmap_snprintf () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
(gdb) where
#0  0x2acdbedd in hwloc_bitmap_snprintf () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#1  0x2acdc060 in hwloc_bitmap_asprintf () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#2  0x2acd9b34 in hwloc__xml_export_object () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#3  0x2acda325 in hwloc___nolibxml_prepare_export () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#4  0x2acda39c in hwloc__nolibxml_prepare_export () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#5  0x2acda4be in hwloc_topology_export_xmlbuffer () from 
/nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
#6  0x004009b8 in main () at xmlbuffer.c:31

On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote:

> Indeed, this object contains invalid pointers.
> 
> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does
> export+import+export+compare on the same machine. It would be good to
> know if it fails on one of the machines you're using here.
> 
> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt
> 
> thanks
> Brice
> 
> 
> 
> Le 24/09/2011 17:07, Ralph Castain a écrit :
>> FWIW: I tried just printing out the contents of that root object immediately 
>> after importing the xml, and it clearly has a problem:
>> 
>> (gdb) print *obj
>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 
>> , memory = {
>>total_memory = 46912502995240, local_memory = 46912502995240, 
>> page_types_len = 0, page_types = 0x0}, attr = 0x2, 
>>  depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, 
>> prev_cousin = 0x, parent = 0x0, 
>>  sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, 
>> children = 0x2b139738, 
>>  first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = 
>> 0x0, complete_cpuset = 0x0, 
>>  online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, 
>> complete_nodeset = 0x644c90, 
>>  allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 
>> 690, infos = 0x0, infos_count = 0}
>> 
>> 
>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:
>> 
>>> Here's the trace:
>>> 
>>> #0  0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, 
>>> topology=0x695f10, obj=0x2b139b28)
>>>   at topology-xml.c:1094
>>> #1  0x2ae61b69 in hwloc___nolibxml_prepare_export 
>>> (topology=0x695f10, 
>>>   xmlbuffer=0x698a70 "\n>> topology SYSTEM \"hwloc.dtd\">\n\n  >> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" 
>>> complete_cpuset=\"0xf...f\" onl"..., 
>>>   buflen=16384) at topology-xml.c:1193
>>> #2  0x2ae61be0 in hwloc__nolibxml_prepare_export 
>>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c)
>>>   at topology-xml.c:1207
>>> #3  0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer 
>>> (topology=0x695f10, xmlbuffer=0x7fffd988, 
>>>   buflen=0x7fffd97c) at topology-xml.c:1281
>>> #4  0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, 
>>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183
>>> #5  0x2adf348c in opal_dss_compare (value1=0x695f10, 
>>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39
>>> #6  0x2ad9b5f7 in process_orted_launch_report (fd=-1, event=1, 
>>> data=0x6444d0) at base/plm_base_launch_support.c:564
>>> #7  0x2ae3881f in event_process_active_single_queue (base=0x60dd60, 
>>> activeq=0x6111e0) at event.c:1329
>>> #8  0x2ae38c71 in event_process_active (base=0x60dd60) at 
>>> event.c:1396
>>> #9  0x2ae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, 
>>> flags=1) at event.c:1598
>>> #10 0x2adf080d in opal_progress () at runtime/opal_progress.c:189
>>> #11 0x2ad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at 
>>> base/plm_base_launch_support.c:666
>>> #12 0x2ada49e1 in plm_slurm_launch_job (jdata=0x67a500) at 
>>> plm_slurm_module.c:404
>>> #13 0x00403822 in orterun (argc=4, argv=0x7fffe1d8) at 
>>> orterun.c:817
>>> #14 0x00402aa3 in main (argc=4, argv=0x7fffe1d8) at main.c:13
>>>

Re: [hwloc-devel] Something lighter-weight than XML?

2011-09-24 Thread Ralph Castain
FWIW: I tried just printing out the contents of that root object immediately 
after importing the xml, and it clearly has a problem:

(gdb) print *obj
$2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 
, memory = {
total_memory = 46912502995240, local_memory = 46912502995240, 
page_types_len = 0, page_types = 0x0}, attr = 0x2, 
  depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, 
prev_cousin = 0x, parent = 0x0, 
  sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, 
children = 0x2b139738, 
  first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = 0x0, 
complete_cpuset = 0x0, 
  online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, 
complete_nodeset = 0x644c90, 
  allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 690, 
infos = 0x0, infos_count = 0}


On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:

> Here's the trace:
> 
> #0  0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, 
> topology=0x695f10, obj=0x2b139b28)
>at topology-xml.c:1094
> #1  0x2ae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10, 
>xmlbuffer=0x698a70 "\n topology SYSTEM \"hwloc.dtd\">\n\n   os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" 
> complete_cpuset=\"0xf...f\" onl"..., 
>buflen=16384) at topology-xml.c:1193
> #2  0x2ae61be0 in hwloc__nolibxml_prepare_export (topology=0x695f10, 
> bufferp=0x7fffd988, buflenp=0x7fffd97c)
>at topology-xml.c:1207
> #3  0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer 
> (topology=0x695f10, xmlbuffer=0x7fffd988, 
>buflen=0x7fffd97c) at topology-xml.c:1281
> #4  0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, topo2=0x6915c0, 
> type=22 '\026') at base/hwloc_base_dt.c:183
> #5  0x2adf348c in opal_dss_compare (value1=0x695f10, value2=0x6915c0, 
> type=22 '\026') at dss/dss_compare.c:39
> #6  0x2ad9b5f7 in process_orted_launch_report (fd=-1, event=1, 
> data=0x6444d0) at base/plm_base_launch_support.c:564
> #7  0x2ae3881f in event_process_active_single_queue (base=0x60dd60, 
> activeq=0x6111e0) at event.c:1329
> #8  0x2ae38c71 in event_process_active (base=0x60dd60) at event.c:1396
> #9  0x2ae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, 
> flags=1) at event.c:1598
> #10 0x2adf080d in opal_progress () at runtime/opal_progress.c:189
> #11 0x2ad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at 
> base/plm_base_launch_support.c:666
> #12 0x2ada49e1 in plm_slurm_launch_job (jdata=0x67a500) at 
> plm_slurm_module.c:404
> #13 0x00403822 in orterun (argc=4, argv=0x7fffe1d8) at 
> orterun.c:817
> #14 0x00402aa3 in main (argc=4, argv=0x7fffe1d8) at main.c:13
> 
> And the error report
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, 
> topology=0x695f10, obj=0x2b139b28)
>at topology-xml.c:1094
> 1094  sprintf(tmp, "%llu", (unsigned long long) 
> obj->memory.page_types[i].count);
> (gdb) print obj
> $1 = (opal_hwloc122_hwloc_obj_t) 0x2b139b28
> (gdb) print *obj
> $2 = {type = 2870188824, os_index = 10922, name = 0x2b139b18 
> "\b\233\023\253\252*", memory = {total_memory = 6579376, 
>local_memory = 6579376, page_types_len = 2870188856, page_types = 
> 0x2b139b38}, attr = 0x2b139b48, 
>  depth = 2870188872, logical_index = 10922, os_level = -1424778408, 
> next_cousin = 0x2b139b58, 
>  prev_cousin = 0x2b139b68, parent = 0x2b139b68, sibling_rank = 
> 2870188920, next_sibling = 0x2b139b78, 
>  prev_sibling = 0x2b139b88, arity = 2870188936, children = 
> 0x2b139b98, first_child = 0x2b139b98, 
>  last_child = 0x2b139ba8, userdata = 0x2b139ba8, cpuset = 
> 0x2b139bb8, complete_cpuset = 0x2b139bb8, 
>  online_cpuset = 0x2b139bc8, allowed_cpuset = 0x2b139bc8, nodeset = 
> 0x2b139bd8, 
>  complete_nodeset = 0x2b139bd8, allowed_nodeset = 0x2b139be8, 
> distances = 0x2b139be8, 
>  distances_count = 2870189048, infos = 0x2b139bf8, infos_count = 
> 2870189064}
> (gdb) print obj->memory
> $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = 
> 2870188856, page_types = 0x2b139b38}
> (gdb) print obj->memory.page_types
> $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2b139b38
> (gdb) print i
> $5 = 1612
> (gdb) print obj->memory.page_types[1600]
> $6 = {size = 0, count = 0}
> (gdb) print obj->memory.page_types[1612]
> Ca

Re: [hwloc-devel] Something lighter-weight than XML?

2011-09-24 Thread Ralph Castain
Here's the trace:

#0  0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, 
topology=0x695f10, obj=0x2b139b28)
at topology-xml.c:1094
#1  0x2ae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10, 
xmlbuffer=0x698a70 "\n\n\n  memory.page_types[i].count);
(gdb) print obj
$1 = (opal_hwloc122_hwloc_obj_t) 0x2b139b28
(gdb) print *obj
$2 = {type = 2870188824, os_index = 10922, name = 0x2b139b18 
"\b\233\023\253\252*", memory = {total_memory = 6579376, 
local_memory = 6579376, page_types_len = 2870188856, page_types = 
0x2b139b38}, attr = 0x2b139b48, 
  depth = 2870188872, logical_index = 10922, os_level = -1424778408, 
next_cousin = 0x2b139b58, 
  prev_cousin = 0x2b139b68, parent = 0x2b139b68, sibling_rank = 
2870188920, next_sibling = 0x2b139b78, 
  prev_sibling = 0x2b139b88, arity = 2870188936, children = 0x2b139b98, 
first_child = 0x2b139b98, 
  last_child = 0x2b139ba8, userdata = 0x2b139ba8, cpuset = 
0x2b139bb8, complete_cpuset = 0x2b139bb8, 
  online_cpuset = 0x2b139bc8, allowed_cpuset = 0x2b139bc8, nodeset = 
0x2b139bd8, 
  complete_nodeset = 0x2b139bd8, allowed_nodeset = 0x2b139be8, 
distances = 0x2b139be8, 
  distances_count = 2870189048, infos = 0x2b139bf8, infos_count = 
2870189064}
(gdb) print obj->memory
$3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = 
2870188856, page_types = 0x2b139b38}
(gdb) print obj->memory.page_types
$4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2b139b38
(gdb) print i
$5 = 1612
(gdb) print obj->memory.page_types[1600]
$6 = {size = 0, count = 0}
(gdb) print obj->memory.page_types[1612]
Cannot access memory at address 0x2b13fff8
(gdb) print obj->memory.page_types[1611]
$7 = {size = 0, count = 0}
(gdb) 


The whole obj looks like trash to me. I looked a little more - the object 
referenced is the root object:

1193  hwloc__xml_export_object (, topology, 
hwloc_get_root_obj(topology));

I'm continuing to look in case I'm doing something stupid, but the code is 
pretty linear here - unpack, import, export for compare.


On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote:

> Here's some feedback from Ralph -- any idea what's going wrong here?
> 
> -
> 
> 1. I export a topology into xml using
> 
>   hwloc_topology_export_xmlbuffer(t, , );
> 
> I then pack and send the string.
> 
> 2. I unpack the string on the other end and import it into a topology
>   hwloc_topology_init();
>   if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, 
> strlen(xmlbuffer {
>   hwloc_topology_destroy(t);
>   goto cleanup;
>   }
>   hwloc_topology_load(t);
> 
> 3. I then need to compare two topologies, so I export the topology I received 
> into another xml string
>   hwloc_topology_export_xmlbuffer(t1, , );
> 
> It is this export that fails, which implies to me that somehow the import 
> didn't work right. Note that this code worked fine with libxml2, so this is a 
> regression.
> 
> 
> On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote:
> 
>> Yes, I can get some testing of the ompi branch pretty quickly.  I can bring 
>> in a new copy of this later today and see what we can see.
>> 
>> Many thanks!
>> 
>> 
>> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote:
>> 
>>> I pushed the new minimalistic XML import/export implementation without
>>> libxml2 to the nolibxml branch. If libxml2 is available, it's still used
>>> by default. --disable-libxml2 or some env variables can be used for
>>> force the minimalistic implementation if needed. The minimalistic implem
>>> is only guaranteed to import XML files that were generated by hwloc
>>> (even if libxml was enabled there).
>>> 
>>> I also backported most of this to the new v1.2-ompi branch (required to
>>> backport some other XML cleanups from trunk). This branch will now serve
>>> as a base for Open MPI's embedded hwloc. The idea is to have a complete
>>> v1.2 + nolibxml somewhere so that we can at least run make check (Open
>>> MPI does not embed enough to run hwloc's make check).
>>> 
>>> How do we proceed now? Can we have the OMPI guys test the new code soon?
>>> Should I wait for their feedback before merging the nolibxml branch into
>>> the trunk? I'd like to merge this in v1.3 too (and basically release rc2
>>> as the actual first feature-complete RC), so getting feedback early
>>> might be appreciated.
>>> 
>>> Brice
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 
> 

Re: [hwloc-devel] Some practical hwloc API feedback

2011-09-24 Thread Ralph Castain

On Sep 24, 2011, at 5:52 AM, Jeff Squyres wrote:

> On Sep 24, 2011, at 7:46 AM, Jeff Squyres wrote:
> 
>>> The funky thing here is that the parent/child links between the first
>>> socket and its core go across level 2 because nothing matches there. In
>>> the first socket, you have Socket(depth1)->Core(depth3) while in the
>>> second socket you have Socket(depth1)->Cache(depth2)->Core(depth3)
> 
> Oh crap; this scenario is explicitly listed in the figure on page 21 of the 
> letter PDF.  :-\
> 
> So... disregard my comments here.

You might want to point that section out in hwloc.h as this is somewhat 
non-intuitive.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel




Re: [hwloc-devel] Some practical hwloc API feedback

2011-09-23 Thread Ralph Castain

On Sep 22, 2011, at 3:05 PM, Brice Goglin wrote:

> Le 22/09/2011 22:42, Ralph Castain a écrit :
>> I guess I didn't get that from your documentation. Since caches sit
>> between socket and core, they appear to affect the depth of the core
>> in a given socket. Thus, if there are different numbers of caches in
>> the different sockets on a node, then the core/pu level would change
>> across the sockets.
> 
> No, the level always contain all elements of the same type (+depth for
> caches), even if they are not at the same "distance" to the root (not
> "depth").
> 
> Let's say you have two single-core sockets. One with no cache. One with
> a L1.
> What happens is:
> * first level/depth is socket, contains two sockets, cover all cores.
> * level 2 is L2, single element, *does not cover all cores*
> * level 3 is core, two elements.
> 
> The funky thing here is that the parent/child links between the first
> socket and its core go across level 2 because nothing matches there. In
> the first socket, you have Socket(depth1)->Core(depth3) while in the
> second socket you have Socket(depth1)->Cache(depth2)->Core(depth3)
> 
> So what we call "depth" in hwloc, is not the number of parent/child
> links between you and the root, it's really the number of levels between
> you and the root, even if you don't have any parent in some of these levels.
> 
> Looks like we need to clarify this :)
> 

Indeed - having the above example in hwloc.h would help. I think the key thing 
here is that the depth for a given type is being set across the entire node, 
and not by the local structure - i.e., the depth of the core in your example is 
determined by looking across the node at the max depth of any core in its local 
structure. Those of us coming from the chip world will find that confusing, as 
we look at things one socket at a time, but we can adapt.

All that said, if I put my dictionary away and can get the code to work, 
hopefully we won't have to parse thru it again. :-)

Thanks!



Re: [hwloc-devel] Patch for hwloc_obj_attr_snprintf

2011-09-22 Thread Ralph Castain
Resending - had to join list!


Oh - I should have noted that I took the labels directly from the xml output. 
So cachesize and cacheline are what you have in the xml output.

I don't care if they match, though - just pointing it out.


On Sep 22, 2011, at 1:59 PM, Brice Goglin wrote:

> Hello Ralph,
> 
> Indeed, adding something before the cache size might be good.
> 
> But if I was picky, I would say "size=32kB linesize=64". The word "Cache" is 
> already written above (in the object type), why would we duplicate it in 
> "Cachesize" and "Cacheline" ?
> 
> Right now, lstopo shows:
> L3Cache L#3 (4096KB line=64)
> With your patch, it would say:
> L3Cache L#3 (Cachesize=4096KB Cacheline=64)
> With my variant, it would say:
> L3Cache L#3 (size=4096KB linesize=64)
> 
> Brice
> 
> 
> 
> 
> Le 22/09/2011 21:27, Jeff Squyres a écrit :
>> 
>> Ralph noticed the following when working on integrating hwloc deeply into 
>> OMPI, and suggests the attached patch.  Does it look good to you guys?
>> 
>> -
>> 
>> Something isn't right with hwloc_obj_attr_snprintf() when the object is a 
>> cache. I get this when printing the topology of my Mac:
>> 
>>  Detected Resources: Type: Machine Number of child objects: 1
>>  Name=NULL
>>  total=3145728KB
>>  Backend=Darwin
>>  OSName=Darwin
>>  OSRelease=10.8.0
>>  OSVersion="Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 
>> PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386"
>>  Architecture=i386
>>  Cpuset:  0x0003
>>  Online:  0x0003
>>  Allowed: 0x0003
>>  Type: NUMANode Number of child objects: 1
>>  Name=NULL
>>  local=3145728KB
>>  total=3145728KB
>>  Cpuset:  0x0003
>>  Online:  0x0003
>>  Allowed: 0x0003
>>  Type: Socket Number of child objects: 1
>>  Name=NULL
>>  
>>  Cpuset:  0x0003
>>  Online:  0x0003
>>  Allowed: 0x0003
>>  Type: L2Cache Number of child objects: 2
>>  Name=NULL
>>  4096KB
>>  line=64
>>  Cpuset:  0x0003
>>  Online:  0x0003
>>  Allowed: 0x0003
>>  Type: L1Cache Number of child objects: 1
>>  Name=NULL
>>  32KB
>>  line=64
>>  Cpuset:  0x0001
>>  Online:  0x0001
>>  Allowed: 0x0001
>>  Type: Core Number of child 
>> objects: 1
>>  Name=NULL
>>  
>>  Cpuset:  0x0001
>>  Online:  0x0001
>>  Allowed: 0x0001
>>  Type: PU Number of 
>> child objects: 0
>>  Name=NULL
>>  
>>  Cpuset:  
>> 0x0001
>>  Online:  
>> 0x0001
>>  Allowed: 
>> 0x0001
>>  Type: L1Cache Number of child objects: 1
>>  Name=NULL
>>  32KB
>>  line=64
>>  Cpuset:  0x0002
>>  Online:  0x0002
>>  Allowed: 0x0002
>>  Type: Core Number of child 
>> objects: 1
>>  Name=NULL
>>  
>>  Cpuset:  0x0002
>>  Online:  0x0002
>>  Allowed: 0x0002
>>