[hwloc-devel] v1.11.0
Hi folks I’ve been working on updating the OMPI hwloc code to the 1.11 version. I reported via Jeff about the config issue, so I updated to the latest nightly tarball of 1.11 to pickup that change. I’m now able to configure, but hit one last required change to make it build: diff --git a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c index 8d129d0..01be274 100644 --- a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c +++ b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c @@ -2599,7 +2599,7 @@ next_noncpubackend: && strcmp(topology->backends->component->name, "xml")) { char *value; /* add a hwlocVersion */ -hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion", VERSION); +hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion", HWLOC_VERSION); /* add a ProcessName */ value = hwloc_progname(topology); if (value) { I’m not sure if this is a prefixing issue when embedded, or a more general problem. Any thoughts? Ralph
Re: [hwloc-devel] hwloc failures
FWIW: I just downloaded and build 1.10.0 without problem on Mac Yosemite using GCC. I have the Darwin ports libxml2 installed - version 2.9.2. > On Nov 19, 2014, at 1:28 PM, Brice Goglinwrote: > > Which version of libxml2 do you have? > > Brice > > > > > Le 19/11/2014 22:26, Balaji, Pavan a écrit : >> I’m seeing the following failure with hwloc on the mac (yosemite): >> >> CC topology-xml-libxml.lo >> ../../../../../../../../../mpich/src/pm/hydra/tools/topo/hwloc/hwloc/src/topology-xml-libxml.c:17:27: >> fatal error: libxml/parser.h: No such file or directory >> #include >> >> This is GNU compilers and the latest hwloc release. I have libxml2 >> installed. >> >> Do I need to install a different package? Why is configure not able to >> detect it? What files can I send to help diagnose this? >> >> — Pavan >> >> -- >> Pavan Balaji ✉️ >> http://www.mcs.anl.gov/~balaji >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> Link to this post: >> http://www.open-mpi.org/community/lists/hwloc-devel/2014/11/4296.php > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-devel/2014/11/4297.php
Re: [hwloc-devel] Using hwloc to detect Hard Disks
True - but we intend to collect the inventory as root anyway. :-) On Sep 23, 2014, at 1:50 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > On 24/09/14 00:57, Ralph Castain wrote: > >> Memory info is available from lshw, though they are a GPL code: > > FWIW on this laptop (Intel Haswell) lshw only report DIMM info when run > as root, which I suspect would point them to accessing DMI information > via /dev/mem. > > Using strace supports this: > > 3405 open("/dev/mem", O_RDONLY)= -1 EACCES (Permission denied) > > FWIW dmidecode does the same. > > samuel@haswell:~$ dmidecode > # dmidecode 2.12 > /dev/mem: Permission denied > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4238.php
Re: [hwloc-devel] Using hwloc to detect Hard Disks
Memory info is available from lshw, though they are a GPL code: *-bank:0 description: DIMM Synchronous 1333 MHz (0.8 ns) product: M393B1K70DH0-YH9 vendor: 0x80CE physical id: 0 serial: 0x85B5FED3 slot: DIMM_A1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) Not sure how they are getting it, but I can have someone look at the code to see where the info is being obtained. On Sep 22, 2014, at 8:54 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Sep 22, 2014, at 4:58 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > >> On Sep 22, 2014, at 6:55 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >>>> HWLOC already provides similar info for processors and mother boards, so >>>> it seemed a natural extension of current capabilities to provide it for >>>> other system elements. >>> >>> Disk vendor/model is easy to add from sysfs on Linux. I don't know where >>> to find the serial number. Spindle speed may require more than just >>> sysfs. Do you have more info on how to get these attributes? >>> >>> For memory, we currently have a single memory object for all DIMMs of a >>> single NUMA node. Adding multiple objects may not be useful, but adding >>> many serials to a single NUMA object may be ugly. >>> There are some information about physical memory in >>> /sys/devices/system/node/node0/memory* but it doesn't correspond to >>> DIMMs (I have 135 of them on my laptop for only 2 SODIMMs). dmidecode >>> gets DIMM info somehow. >> >> Back in Nehalem days, it wasn't possible to map Linux kernel "physical" >> memory back to individual DIMMs (because the BIOS could/would introduce >> another layer of kernel<-->DIMM mapping that the kernel might not be aware >> of). >> >> Has that changed? > > I don't think so, no - at least, I'm not sure you can map a specific DIMM to > a specific address within a NUMA region. However, we can at least add the > DIMMs to the root-object attributes. In addition, you can certainly map a > DIMM to a specific DIMM socket, and I believe that means you can map it to a > given NUMA region even if you can't say *where* it is within that region. > Have to verify that. > > >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> Link to this post: >> http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4229.php
Re: [hwloc-devel] Using hwloc to detect Hard Disks
On Sep 22, 2014, at 4:58 PM, Jeff Squyres (jsquyres)wrote: > On Sep 22, 2014, at 6:55 PM, Brice Goglin wrote: > >>> HWLOC already provides similar info for processors and mother boards, so it >>> seemed a natural extension of current capabilities to provide it for other >>> system elements. >> >> Disk vendor/model is easy to add from sysfs on Linux. I don't know where >> to find the serial number. Spindle speed may require more than just >> sysfs. Do you have more info on how to get these attributes? >> >> For memory, we currently have a single memory object for all DIMMs of a >> single NUMA node. Adding multiple objects may not be useful, but adding >> many serials to a single NUMA object may be ugly. >> There are some information about physical memory in >> /sys/devices/system/node/node0/memory* but it doesn't correspond to >> DIMMs (I have 135 of them on my laptop for only 2 SODIMMs). dmidecode >> gets DIMM info somehow. > > Back in Nehalem days, it wasn't possible to map Linux kernel "physical" > memory back to individual DIMMs (because the BIOS could/would introduce > another layer of kernel<-->DIMM mapping that the kernel might not be aware > of). > > Has that changed? I don't think so, no - at least, I'm not sure you can map a specific DIMM to a specific address within a NUMA region. However, we can at least add the DIMMs to the root-object attributes. In addition, you can certainly map a DIMM to a specific DIMM socket, and I believe that means you can map it to a given NUMA region even if you can't say *where* it is within that region. Have to verify that. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-devel/2014/09/4229.php
Re: [hwloc-devel] Interesting warning
Yep, that worked! On Sep 12, 2014, at 1:30 AM, Samuel Thibault <samuel.thiba...@inria.fr> wrote: > Hello, > > Ralph Castain, le Wed 10 Sep 2014 17:41:17 -0700, a écrit : >> Just got this from Clang 3.4.2 on Linux x86-64: >> >> In file included from topology-x86.c:23: >> /home/common/openmpi/svn-trunk/opal/mca/hwloc/hwloc191/hwloc/include/private/ >> cpuid-x86.h:67:3: warning: extension used [-Wlanguage-extension-token] >> asm( >> ^ >> 1 warning generated. >> >> >> Guess it doesn't like that assembler in there > > Could you try the attached patch? > > Samuel >
[hwloc-devel] Interesting warning
Just got this from Clang 3.4.2 on Linux x86-64: In file included from topology-x86.c:23: /home/common/openmpi/svn-trunk/opal/mca/hwloc/hwloc191/hwloc/include/private/cpuid-x86.h:67:3: warning: extension used [-Wlanguage-extension-token] asm( ^ 1 warning generated. Guess it doesn't like that assembler in there Ralph
Re: [hwloc-devel] GIT: hwloc branch master updated. 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec
Jeff just left today for a 1-week vacation. However, this came up on the OMPI mailing list - turns out that some linux distro's automatically set LS_COLORS in your environment when running old versions of csh/tcsh via their default dot files, and it can cause problems with the script. So just ensuring it isn't set solves the problem. On Mar 29, 2014, at 7:59 AM, Brice Goglinwrote: > Jeff, > Where does this LS_COLORS variable come from? Who is setting it? > Brice > > > > Le 27/03/2014 11:45, MPI Team a écrit : >> The branch, master has been updated >> via 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec (commit) >> from 00f85033d269e2c312370bb24043f92a92dff7e3 (commit) >> >> Those revisions listed above that are new to this repository have >> not appeared on any other notification email; so we list those >> revisions in full, below. >> >> - Log - >> https://github.com/open-mpi/hwloc/commit/0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec >> >> commit 0e6fe307c10d47efee3fb95c50aee9c0f01bc8ec >> Author: Jeff Squyres >> Date: Thu Mar 27 06:28:45 2014 -0400 >> >>BUILD: fix "make dist" failure on some linux distro with old csh/tcsh >> >>On some linux distro (sles11sp2) csh fails to parse $LS_COLORS and >>borks with error: Unknown colorls variable `mh'. >> >>The workaround is to unset LS_COLORS before calling to csh script. >> --- >> Makefile.am | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/Makefile.am b/Makefile.am >> index ca9c00c..34d0aa2 100644 >> --- a/Makefile.am >> +++ b/Makefile.am >> @@ -1,6 +1,6 @@ >> # Copyright © 2009-2014 Inria. All rights reserved. >> # Copyright © 2009 Université Bordeaux 1 >> -# Copyright © 2009-2010 Cisco Systems, Inc. All rights reserved. >> +# Copyright © 2009-2014 Cisco Systems, Inc. All rights reserved. >> # See COPYING in top-level directory. >> >> # Note that the -I directory must *exactly* match what was specified >> @@ -48,7 +48,7 @@ endif >> >> if HWLOC_BUILD_STANDALONE >> dist-hook: >> -csh "$(top_srcdir)/config/distscript.csh" "$(top_srcdir)" "$(distdir)" >> "$(HWLOC_VERSION)" >> +env LS_COLORS= csh "$(top_srcdir)/config/distscript.csh" >> "$(top_srcdir)" "$(distdir)" "$(HWLOC_VERSION)" >> endif HWLOC_BUILD_STANDALONE >> >> # >> >> --- >> >> Summary of changes: >> Makefile.am | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> >> hooks/post-receive > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Attribute request
I'd prefer your first option - it's easy enough to check the info objects for existence of a particular attribute. On Jan 29, 2014, at 1:12 AM, Brice Goglin <brice.gog...@inria.fr> wrote: > Assuming people will confirm that ARM information isn't so simple, I wonder > where it's better to put architecture specific fields. With the proposed > solution, Intel and ARM would be different: > Architecture=x86_64 > CPUVendor=GenuineIntel > CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz > CPUModelNumber=45 > CPUFamilyNumber=6 > and > Architecture=armv7l > CPUVendor=cardhu > CPUModel=ARMv7 Processor rev 9 (v7l) > CPUImplementer=0x41 > CPUArchitecture=7 > CPUVariant=0x2 > CPUPart=0xc09 > CPURevision=9 > > We could also merge those arch-specific into a single generic one: > Architecture=x86_64 > CPUVendor=GenuineIntel > CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz > CPUModelNumber=family=6;model=45 > and > Architecture=armv7l > CPUVendor=cardhu > CPUModel=ARMv7 Processor rev 9 (v7l) > > CPUModelNumber=implementer=0x41;architecture=7;variant=0x2;part=0xc09;revision=9 > > The drawback is that you'd have to parse CPUModelNumber to extract family and > model. > > I am not sure which one is best. > > Brice > > > > > > Le 28/01/2014 00:09, Brice Goglin a écrit : >> Hello, >> I have some code that seems to work. Here's what it reports below. Does that >> look ok to you? >> I had to modify quite a lot of things to make the parsing of /proc/cpuinfo >> more robust (the code is basically arch-specific now), so I am not sure >> we'll be able to backport this to OMPI. >> Brice >> >> >> * Sandy-Bridge Xeon E5 (Stampede) >> CPUVendor=GenuineIntel >> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz >> CPUModelNumber=45 >> CPUFamilyNumber=6 >> * Old Nehalem-EX >> CPUVendor=GenuineIntel >> CPUModel=Intel(R) Xeon(R) CPU E7540 @ 2.00GHz >> CPUModelNumber=46 >> CPUFamilyNumber=6 >> * Itanium >> CPUVendor=GenuineIntel >> CPUModel=Dual-Core Intel(R) Itanium(R) Processor 9140N >> CPUModelNumber=1 >> CPUFamilyNumber=32 >> * AMD >> CPUVendor=AuthenticAMD >> CPUModel=Dual Core AMD Opteron(tm) Processor 865 >> CPUModelNumber=33 >> CPUFamilyNumber=15 >> * MIC (Stampede) >> CPUVendor=GenuineIntel >> CPUModel=0b/01 >> CPUModelNumber=1 >> CPUFamilyNumber=11 >> >> >> >> >> Le 23/01/2014 19:50, Ralph Castain a écrit : >>> That would be perfect! Thanks >>> >>> On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote: >>> >>>> Should be easy on Linux, sure. >>>> The model name is already known as CPUModel in hwloc. >>>> We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), >>>> CPUFamily (or CPUFamilyNumber if there's a name for these families?) and >>>> CPUModelNumber ? >>>> >>>> Brice >>>> >>>> >>>> >>>> >>>> Le 23/01/2014 19:09, Ralph Castain a écrit : >>>>> Hi folks >>>>> >>>>> Looking at the current topology info, I see you capture the model name >>>>> for the socket, but not a couple of other key things Intel could use: >>>>> >>>>> cpu family : 6 >>>>> model : 44 >>>>> model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz >>>>> >>>>> >>>>> Both the cpu family and model are important to us - any issue with adding >>>>> them to the "infos" array? >>>>> >>>>> Ralph >>>>> >>>>> >>>>> >>>>> ___ >>>>> hwloc-devel mailing list >>>>> hwloc-de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>> >>>> ___ >>>> hwloc-devel mailing list >>>> hwloc-de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>> >>> >>> >>> ___ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Attribute request
Hello, I have some code that seems to work. Here's what it reports below. Does that look ok to you? I had to modify quite a lot of things to make the parsing of /proc/cpuinfo more robust (the code is basically arch-specific now), so I am not sure we'll be able to backport this to OMPI. Brice * Sandy-Bridge Xeon E5 (Stampede) CPUVendor=GenuineIntel CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz CPUModelNumber=45 CPUFamilyNumber=6 * Old Nehalem-EX CPUVendor=GenuineIntel CPUModel=Intel(R) Xeon(R) CPU E7540 @ 2.00GHz CPUModelNumber=46 CPUFamilyNumber=6 * Itanium CPUVendor=GenuineIntel CPUModel=Dual-Core Intel(R) Itanium(R) Processor 9140N CPUModelNumber=1 CPUFamilyNumber=32 * AMD CPUVendor=AuthenticAMD CPUModel=Dual Core AMD Opteron(tm) Processor 865 CPUModelNumber=33 CPUFamilyNumber=15 * MIC (Stampede) CPUVendor=GenuineIntel CPUModel=0b/01 CPUModelNumber=1 CPUFamilyNumber=11 Le 23/01/2014 19:50, Ralph Castain a écrit : > That would be perfect! Thanks > > On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote: > >> Should be easy on Linux, sure. >> The model name is already known as CPUModel in hwloc. >> We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), >> CPUFamily (or CPUFamilyNumber if there's a name for these families?) and >> CPUModelNumber ? >> >> Brice >> >> >> >> >> Le 23/01/2014 19:09, Ralph Castain a écrit : >>> Hi folks >>> >>> Looking at the current topology info, I see you capture the model name for >>> the socket, but not a couple of other key things Intel could use: >>> >>> cpu family : 6 >>> model : 44 >>> model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz >>> >>> >>> Both the cpu family and model are important to us - any issue with adding >>> them to the "infos" array? >>> >>> Ralph >>> >>> >>> >>> ___ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel ___ hwloc-devel mailing list hwloc-de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Attribute request
That would be perfect! Thanks On Jan 23, 2014, at 10:27 AM, Brice Goglin <brice.gog...@inria.fr> wrote: > Should be easy on Linux, sure. > The model name is already known as CPUModel in hwloc. > We should likely add CPUVendor (would be GenuineIntel or AuthenticAMD), > CPUFamily (or CPUFamilyNumber if there's a name for these families?) and > CPUModelNumber ? > > Brice > > > > > Le 23/01/2014 19:09, Ralph Castain a écrit : >> Hi folks >> >> Looking at the current topology info, I see you capture the model name for >> the socket, but not a couple of other key things Intel could use: >> >> cpu family : 6 >> model : 44 >> model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz >> >> >> Both the cpu family and model are important to us - any issue with adding >> them to the "infos" array? >> >> Ralph >> >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
[hwloc-devel] Attribute request
Hi folks Looking at the current topology info, I see you capture the model name for the socket, but not a couple of other key things Intel could use: cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz Both the cpu family and model are important to us - any issue with adding them to the "infos" array? Ralph
[hwloc-devel] Support for new architecture
Hi folks We are seeing a new architecture appearing in the very near future, and I'm not sure how hwloc will handle it. Consider the following case: * I have a rack that contains multiple "hosts" * each host consists of a box/shelf with common support infrastructure in it - it has some kind of controller in it, and might have some networking support, maybe a pool of memory that can be allocated across the occupants. * in the host, I have one or more "boards". Each board again has a controller in it with some common infrastructure to support its local sockets - might include some networking that would look like NICs (though not necessarily on a PCIe interface), a board-level memory pool, etc. * each socket contains one or more die. Each die runs its own instance of an OS - probably a lightweight kernel - that can vary between dies (e.g., might have a tweaked configuration), and has its own associated memory that will physically reside outside the socket. You can think of each die as constituting a "shared memory locus" - i.e., processes running on that die can share memory between them as it would sit under the same OS instance. * each die has some number of cores/hwthreads/caches etc. Note that the sockets are not sitting in some PCIe bus - they appear to be directly connected to the overall network just like a "node" would appear today. However, there is a definite need for higher layers (RMs and MPIs) to understand this overall hierarchy and the "distances" between the individual elements. Any thoughts on how we can support this? Ralph
[hwloc-devel] Strange difference
Yo guys I was doing some work that involved traversing the hwloc topo tree, and encountered the following odd discrepancy. hwloc_topology_get_depth => returns "unsigned" hwloc_get_type_depth => returns "int" Why the difference? Makes it hard sometimes to avoid the "comparison between unsigned and signed" warnings when using these functions. Ralph
Re: [hwloc-devel] xml file load incompatibilities
Okay, I found it - was a sequencing problem in OMPI itself (we "set" the new topology too late in the setup sequence). Sorry for false alarm. Thanks for the help! Ralph On Sep 20, 2013, at 11:36 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > Strange, the backtrace below looks total crazy, I don't see how debug checks > could still pass in that case. > Any chance you valgrind that thing? > > Brice > > > > Le 21/09/2013 01:09, Ralph Castain a écrit : >> Hmmm...nope, not a peep (no extra output at all). Just segfaulted like >> before. >> >> On Sep 20, 2013, at 4:06 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >>> Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many >>> assertions at the end of hwloc_topology_load() >>> >>> Brice >>> >>> >>> >>> Le 21/09/2013 01:03, Ralph Castain a écrit : >>>> I didn't try loading it with lstopo - just tried the OMPI trunk. It loads >>>> okay, but segfaults when you try to find an object by depth >>>> >>>> #0 0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth >>>> (topology=Cannot access memory at address 0xfff7 >>>> ) at traversal.c:623 >>>> #1 0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj >>>> (topology=Cannot access memory at address 0xfff7 >>>> ) at rmaps_rr_mappers.c:747 >>>> #2 0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory >>>> at address 0xff77 >>>> ) at rmaps_rr_mappers.c:774 >>>> #3 0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at >>>> address 0xff17 >>>> ) at rmaps_rr.c:211 >>>> #4 0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory >>>> at address 0xfe7b >>>> ) at base/rmaps_base_map_job.c:320 >>>> #5 0x0001005ce28c in event_process_active_single_queue (base=Cannot >>>> access memory at address 0xffe7 >>>> ) at event.c:1367 >>>> #6 0x0001005ce500 in event_process_active (base=Cannot access memory >>>> at address 0xffe7 >>>> ) at event.c:1437 >>>> #7 0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot >>>> access memory at address 0xffb7 >>>> ) at event.c:1645 >>>> #8 0x0001002c5158 in orterun (argc=Cannot access memory at address >>>> 0xfd1b >>>> ) at orterun.c:3039 >>>> #9 0x0001002c32a4 in main (argc=Cannot access memory at address >>>> 0xfffb >>>> ) at main.c:14 >>>> >>>> Looks to me like memory may be getting hosed >>>> >>>> >>>> On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >>>> >>>>> I can't see any segfault. Where does the segfault occurs for you? In OMPI >>>>> only (or lstopo too)? When loading or when using the topology? >>>>> >>>>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in >>>>> case the bug is in one of XML backends), looks ok. >>>>> >>>>> Brice >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Le 20/09/2013 23:53, Ralph Castain a écrit : >>>>>> Here are the two files I tried - not from the same machine. The foo.xml >>>>>> works, the topo.xml segfaults >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> One of our users reported it from their machine, but I don't have their >>>>>> topo file. >>>>>> >>>>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >>>>>> >>>>>>> Hello, >>>>>>> I don't see anything reason for such an incompatibility. But there are >>>>>>> many combinations, we can't test everything. >>>>>>> I can't reproduce that on my machines. Can you send the XML output of >>>>>>> both versions on one of your machines? >>>>>>> Brice >>>>>>> >>>>>>> >>>>>>> >>>>>>> Le 20/09/2013 23:32, Ralph Castain a écrit : >
Re: [hwloc-devel] xml file load incompatibilities
Hmmm...nope, not a peep (no extra output at all). Just segfaulted like before. On Sep 20, 2013, at 4:06 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many > assertions at the end of hwloc_topology_load() > > Brice > > > > Le 21/09/2013 01:03, Ralph Castain a écrit : >> I didn't try loading it with lstopo - just tried the OMPI trunk. It loads >> okay, but segfaults when you try to find an object by depth >> >> #0 0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth >> (topology=Cannot access memory at address 0xfff7 >> ) at traversal.c:623 >> #1 0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj (topology=Cannot >> access memory at address 0xfff7 >> ) at rmaps_rr_mappers.c:747 >> #2 0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory >> at address 0xff77 >> ) at rmaps_rr_mappers.c:774 >> #3 0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at >> address 0xff17 >> ) at rmaps_rr.c:211 >> #4 0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory >> at address 0xfe7b >> ) at base/rmaps_base_map_job.c:320 >> #5 0x0001005ce28c in event_process_active_single_queue (base=Cannot >> access memory at address 0xffe7 >> ) at event.c:1367 >> #6 0x0001005ce500 in event_process_active (base=Cannot access memory at >> address 0xffe7 >> ) at event.c:1437 >> #7 0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot >> access memory at address 0xffb7 >> ) at event.c:1645 >> #8 0x0001002c5158 in orterun (argc=Cannot access memory at address >> 0xfd1b >> ) at orterun.c:3039 >> #9 0x0001002c32a4 in main (argc=Cannot access memory at address >> 0xfffb >> ) at main.c:14 >> >> Looks to me like memory may be getting hosed >> >> >> On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >>> I can't see any segfault. Where does the segfault occurs for you? In OMPI >>> only (or lstopo too)? When loading or when using the topology? >>> >>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in >>> case the bug is in one of XML backends), looks ok. >>> >>> Brice >>> >>> >>> >>> >>> >>> Le 20/09/2013 23:53, Ralph Castain a écrit : >>>> Here are the two files I tried - not from the same machine. The foo.xml >>>> works, the topo.xml segfaults >>>> >>>> >>>> >>>> >>>> One of our users reported it from their machine, but I don't have their >>>> topo file. >>>> >>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >>>> >>>>> Hello, >>>>> I don't see anything reason for such an incompatibility. But there are >>>>> many combinations, we can't test everything. >>>>> I can't reproduce that on my machines. Can you send the XML output of >>>>> both versions on one of your machines? >>>>> Brice >>>>> >>>>> >>>>> >>>>> Le 20/09/2013 23:32, Ralph Castain a écrit : >>>>>> Hi folks >>>>>> >>>>>> I've run across a rather strange behavior. We have two branches in OMPI >>>>>> - the devel trunk (using hwloc v1.7.2) and our feature release series >>>>>> (using hwloc 1.5.2). I have found the following: >>>>>> >>>>>> *the feature series can correctly load an xml file generated by lstopo >>>>>> of versions 1.5 or greater >>>>>> >>>>>> * the devel series can correctly load an xml file generated by lstopo of >>>>>> versions 1.7 or greater, but not files generated by prior versions. In >>>>>> the latter case, I segfault as soon as I try to use the loaded topology. >>>>>> >>>>>> Any ideas why the discrepancy? Can I at least detect the version used to >>>>>> create a file when loading it so I can error out instead of segfaulting? >>>>>> >>>>>> Ralph >>>>>> >>>>>> ___ >>>>>> hwloc-devel mailing list >>>>>> hwloc-de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>>> ___ >>>>> hwloc-devel mailing list >>>>> hwloc-de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>> >>>> >>>> ___ >>>> hwloc-devel mailing list >>>> hwloc-de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>> >>> ___ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] xml file load incompatibilities
I didn't try loading it with lstopo - just tried the OMPI trunk. It loads okay, but segfaults when you try to find an object by depth #0 0x0001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth (topology=Cannot access memory at address 0xfff7 ) at traversal.c:623 #1 0x000100b6dfaa in opal_hwloc172_hwloc_get_root_obj (topology=Cannot access memory at address 0xfff7 ) at rmaps_rr_mappers.c:747 #2 0x000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory at address 0xff77 ) at rmaps_rr_mappers.c:774 #3 0x000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at address 0xff17 ) at rmaps_rr.c:211 #4 0x000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory at address 0xfe7b ) at base/rmaps_base_map_job.c:320 #5 0x0001005ce28c in event_process_active_single_queue (base=Cannot access memory at address 0xffe7 ) at event.c:1367 #6 0x0001005ce500 in event_process_active (base=Cannot access memory at address 0xffe7 ) at event.c:1437 #7 0x0001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot access memory at address 0xffb7 ) at event.c:1645 #8 0x0001002c5158 in orterun (argc=Cannot access memory at address 0xfd1b ) at orterun.c:3039 #9 0x0001002c32a4 in main (argc=Cannot access memory at address 0xfffb ) at main.c:14 Looks to me like memory may be getting hosed On Sep 20, 2013, at 2:59 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > I can't see any segfault. Where does the segfault occurs for you? In OMPI > only (or lstopo too)? When loading or when using the topology? > > I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in > case the bug is in one of XML backends), looks ok. > > Brice > > > > > > Le 20/09/2013 23:53, Ralph Castain a écrit : >> Here are the two files I tried - not from the same machine. The foo.xml >> works, the topo.xml segfaults >> >> >> >> >> >> One of our users reported it from their machine, but I don't have their topo >> file. >> >> On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >>> Hello, >>> I don't see anything reason for such an incompatibility. But there are >>> many combinations, we can't test everything. >>> I can't reproduce that on my machines. Can you send the XML output of >>> both versions on one of your machines? >>> Brice >>> >>> >>> >>> Le 20/09/2013 23:32, Ralph Castain a écrit : >>>> Hi folks >>>> >>>> I've run across a rather strange behavior. We have two branches in OMPI - >>>> the devel trunk (using hwloc v1.7.2) and our feature release series (using >>>> hwloc 1.5.2). I have found the following: >>>> >>>> *the feature series can correctly load an xml file generated by lstopo of >>>> versions 1.5 or greater >>>> >>>> * the devel series can correctly load an xml file generated by lstopo of >>>> versions 1.7 or greater, but not files generated by prior versions. In the >>>> latter case, I segfault as soon as I try to use the loaded topology. >>>> >>>> Any ideas why the discrepancy? Can I at least detect the version used to >>>> create a file when loading it so I can error out instead of segfaulting? >>>> >>>> Ralph >>>> >>>> ___ >>>> hwloc-devel mailing list >>>> hwloc-de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>> ___ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] xml file load incompatibilities
Here are the two files I tried - not from the same machine. The foo.xml works, the topo.xml segfaults topo.xml Description: XML document foo.xml Description: XML document One of our users reported it from their machine, but I don't have their topo file. On Sep 20, 2013, at 2:41 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > Hello, > I don't see anything reason for such an incompatibility. But there are > many combinations, we can't test everything. > I can't reproduce that on my machines. Can you send the XML output of > both versions on one of your machines? > Brice > > > > Le 20/09/2013 23:32, Ralph Castain a écrit : >> Hi folks >> >> I've run across a rather strange behavior. We have two branches in OMPI - >> the devel trunk (using hwloc v1.7.2) and our feature release series (using >> hwloc 1.5.2). I have found the following: >> >> *the feature series can correctly load an xml file generated by lstopo of >> versions 1.5 or greater >> >> * the devel series can correctly load an xml file generated by lstopo of >> versions 1.7 or greater, but not files generated by prior versions. In the >> latter case, I segfault as soon as I try to use the loaded topology. >> >> Any ideas why the discrepancy? Can I at least detect the version used to >> create a file when loading it so I can error out instead of segfaulting? >> >> Ralph >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
[hwloc-devel] xml file load incompatibilities
Hi folks I've run across a rather strange behavior. We have two branches in OMPI - the devel trunk (using hwloc v1.7.2) and our feature release series (using hwloc 1.5.2). I have found the following: *the feature series can correctly load an xml file generated by lstopo of versions 1.5 or greater * the devel series can correctly load an xml file generated by lstopo of versions 1.7 or greater, but not files generated by prior versions. In the latter case, I segfault as soon as I try to use the loaded topology. Any ideas why the discrepancy? Can I at least detect the version used to create a file when loading it so I can error out instead of segfaulting? Ralph
Re: [hwloc-devel] v1.7
On Jan 7, 2013, at 6:05 AM, Samuel Thibaultwrote: > Hello, > > Brice Goglin, le Mon 31 Dec 2012 10:05:41 +0100, a écrit : >> + The HWLOC_COMPONENTS may now start with '^' to only define a list of >> components to exclude. > > I'm finding it not intuitive and not generic enough, I'm wondering how > that didn't affect Open-MPI, which as IUI uses this convention. > > It means that > > HWLOC_COMPONENTS=^cuda,opencl > > disables cuda *and* opencl, FWIW: that is the OMPI convention > while intuition would have told me that it > disables cuda but enables opencl. > > Also, one would for instance want to be able to do this: > > HWLOC_COMPONENTS=x86,^cuda,^opencl,nvml > > To be able to enable x86 before the default linux, but disable cuda and > opencl, but enable nvml, as well as all the other usual plugins (without > having to know the list, which is important for future extensions). > > Samuel > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Cgroup resource limits
On Nov 4, 2012, at 7:28 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 03/11/12 09:05, Ralph Castain wrote: > >> System resource managers don't usually provide this capability, so >> we will do it at the ORTE level. > > I would argue that the resource managers *should* be doing it No argument from me - I would love for them to provide me with an easy API that mpirun can use to specify the requirements for a given application. > - however, > I will also argue that the resource managers should be doing it via > hwloc (so I'm afraid it's not an out for you folks :-) ). Agreed, though I leave that to the individual RMs to decide. > > It's also worth remembering that the memcg code has an appalling > reputation with the kernel developers in terms of performance overhead, > for instance at the recent Kernel Summit numbers were reported showing a > substantial impact for just having the code present, but not used. > > Following that a patch set was sent out trying to avoid that impact if > it's not in use which doesn't help here but does give a measure of the > performance hit: > > http://lwn.net/Articles/517562/ > > # So as one can see, the difference between base and nomemcg in terms > # of both system time and elapsed time is quite drastic, and consistent > # with the figures shown by Mel Gorman in the Kernel summit. This is a > # ~7 % drop in performance, just by having memcg enabled. memcg > # functions appear heavily in the profiles, even if all tasks lives in > # the root memcg. > Yick! However, I would expect the community to reduce that impact over time. If systems don't want that capability, then they can and should disable it. On the other hand, if they do want it, then we want to support it. > cheers, > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ > > iEYEARECAAYFAlCXMlUACgkQO2KABBYQAh8eTgCgkruuxIKc3mqpoxwMaeQBI1hR > /osAn225q4G6FWs1b4Lm6F/9GHDgw9JB > =jkm0 > -END PGP SIGNATURE- > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Cgroup resource limits
Hi Brice I think Linux cgroups makes the most sense in terms of a mechanism for doing this. We don't already do it, but it is something our customers want to see in the platform - so we have to provide it. The basic use-case is for an application to specify a max memory requirement, thus allowing us to subdivide the node when allocating resources. In that case, we need to ensure that the application remains within that memory limit so we don't start swapping. This is a typical "big data" requirement, and the apps know how to handle the situation where they run up against the limit (e.g., what to do when malloc returns NULL). System resource managers don't usually provide this capability, so we will do it at the ORTE level. We already use hwloc there for resource discovery and process placement, so it seems natural to include the ability to specify limits. Since ORTE also does the process launching, it could do the final cgroup definition and pass it to Linux. We envision an API that basically is modeled after the cgroup structure. What we would want hwloc to do is the final step - we pass in the resource constraints, including bind and memory policy specs, and hwloc does the "magic" to tell Linux what needs to be done. Make sense? Ralph On Nov 2, 2012, at 2:18 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > Hello Ralph, > > I am not very familiar with these features. What system mechanism do you > currently use for this? Linux cgroups? Any concrete example of what you > would like to do? > > Brice > > > > Le 02/11/2012 22:12, Ralph Castain a écrit : >> Hi folks >> >> We (Greenplum) have a need to support resource limits (e.g., memory and cpu >> usage) on processes running under Open MPI's RTE. OMPI uses hwloc for >> processor and memory affinity, so this seems a likely place to add the >> required support. Jeff tells me that it doesn't yet exist in hwloc - I'm >> wondering if you would welcome and/or be willing to consider contributions >> from our engineers towards adding this capability? >> >> Obviously, we'd need to discuss how and where to do the extension. Just >> wanted to first see if this is an option, or if we should do it directly in >> OMPI. >> Ralph >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >
[hwloc-devel] Cgroup resource limits
Hi folks We (Greenplum) have a need to support resource limits (e.g., memory and cpu usage) on processes running under Open MPI's RTE. OMPI uses hwloc for processor and memory affinity, so this seems a likely place to add the required support. Jeff tells me that it doesn't yet exist in hwloc - I'm wondering if you would welcome and/or be willing to consider contributions from our engineers towards adding this capability? Obviously, we'd need to discuss how and where to do the extension. Just wanted to first see if this is an option, or if we should do it directly in OMPI. Ralph
Re: [hwloc-devel] lstopo-nox strikes back
I don't have a strong opinion, but the historical "standard practice" for Linux/Unix has always been to default to cmd line, non-graphical interfaces. Graphical output was optional. Of course, that stemmed from the days before everyone had a graphical display, but it is still generally followed. On Apr 25, 2012, at 3:38 AM, Brice Goglin wrote: > Hello, > > We recently got some complains from redhat/centos users that wanted to > install hwloc on their cluster but couldn't because it brought so many X > libraries that they don't care about. > > Debian solves this by having two hwloc packages: the main hwloc one, and > hwloc-nox where cairo is disabled. You just install one of them, packages are > marked as conflicting with each others. > > I asked Jirka, our fellow RPM hwloc packager. He feels that RPM distros don't > work that way. They usually have a core 'foo' package without X, and > something such as 'foo-gui' with the X-enabled binary. So you'd have lstopo > and lstopo-gui installed at the same time. > > I don't have any preference but RPM is much more widely used than deb in HPC, > so we must consider the issue, either in hwloc or in RPM packaging. And we > need a solution that is consistent across distros (we don't want users to get > lost because Debian/Ubuntu lstopo is graphical while RPM lstopo is not and > lstopo-gui is). > > It's not hard to build two lstopo binaries in the same hwloc (quick patch > attached). But we'd need to decide their names (lstopo/lstopo-nox, > lstopo/lstopo-nogui, lstopo-gui/lstopo), and find a good way to make the > existing packages deal with them. > > How do people feel about this? Is it ok to choose between hwloc and hwloc-nox > packages on Debian/Ubuntu? Does somebody want to *always* have a lstopo-nox > installed? Should the default lstopo be graphical/cario or not? > > Brice > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] hwloc + OMPI issue
Thanks! I'll add the latter to our code. Ralph On Oct 12, 2011, at 3:11 PM, Brice Goglin wrote: > Le 12/10/2011 22:56, Jeff Squyres a écrit : >> One of the OMPI devs found a problem when I upgraded the OMPI SVN trunk to >> the hwloc 1.2.2ompi version last week that I think I am just now beginning >> to understand. >> >> Brief reminder of our strategy: >> >> - on each compute node, OMPI launches a local "orted" helper daemon >> - this orted fork/exec's the local MPI processes >> >> To avoid the penalty of each MPI process invoking hwloc discovery >> more-or-less simultaneously upon startup (which, as we've see on this list >> before, can be painful when core counts are large), we have the orted do the >> hwloc discovery, serialize this into XML, and send it to each of its local >> processes. The local processes receive this XML and then load it into hwloc >> and run from there. >> >> However, it looks like the resulting loaded-from-XML topology->is_thissystem >> is set to 0, and therefore functions like hwloc_get_cpubind() actually get >> wired up to dontget_thisproc_cpubind() (instead of the proper Linux backend, >> for example). >> >> How do we avoid this? We need working hwloc functions after loading up an >> XML topology string. > > export HWLOC_THISSYSTEM=1 > or > hwloc_topology_set_flags(HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM) between > init() and load() > > Brice > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] 1.3 -- wait!
On Oct 11, 2011, at 7:34 AM, Brice Goglin wrote: > Le 11/10/2011 15:04, Jeff Squyres a écrit : >> Looks like Ralph's size/linesize patch didn't make it to v1.3: >> >> Index: src/traversal.c >> === >> --- src/traversal.c (revision 3883) >> +++ src/traversal.c (working copy) >> @@ -478,7 +478,7 @@ >> *assoc = '\0'; >> else >> snprintf(assoc, sizeof(assoc), "%sways=%d", separator, >> obj->attr->cache.associativity); >> - res = hwloc_snprintf(tmp, tmplen, "%s%lu%s%sline=%u%s", >> + res = hwloc_snprintf(tmp, tmplen, "%ssize=%lu%s%slinesize=%u%s", >> prefix, >> (unsigned long) >> hwloc_memory_size_printf_value(obj->attr->cache.size, verbose), >> hwloc_memory_size_printf_unit(obj->attr->cache.size, >> verbose), >> >> >> Can this go in before 1.3 is released? >> > > I didn't think it was that important. I can backport it for sure. Thanks! > Do you > want it in v1.2-ompi too? Not necessary for v1.2-ompi - I already inserted it in our local copy, and I imagine we'll update to 1.3 shortly. > > Brice > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Something lighter-weight than XML?
Thanks! On Sep 24, 2011, at 2:18 PM, Brice Goglin wrote: > I fixed one parsing bug in commit 3660 on the v1.2-ompi branch. Things > should work better now. > > Parsing XML distance matrices was broken when the XML file came from the > no-libxml exporter. That's why you had problems on your dual-amd machine > (those have distance matrices) and not on your mac (single processor, no > distances, no bug). > > The v1.2 branch doesn't report parsing failure well, so it just crashed. > Trunk exits with an error instead of crashing. > > Brice > > > > > Le 24/09/2011 20:37, Ralph Castain a écrit : >> Yep, it fails. Runs on my Mac, but not under Linux. >> >> Program terminated with signal 11, Segmentation fault. >> #0 0x2acdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> (gdb) where >> #0 0x2acdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #1 0x2acdc060 in hwloc_bitmap_asprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #2 0x2acd9b34 in hwloc__xml_export_object () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #3 0x2acda325 in hwloc___nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #4 0x2acda39c in hwloc__nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #5 0x2acda4be in hwloc_topology_export_xmlbuffer () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #6 0x004009b8 in main () at xmlbuffer.c:31 >> >> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote: >> >>> Indeed, this object contains invalid pointers. >>> >>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does >>> export+import+export+compare on the same machine. It would be good to >>> know if it fails on one of the machines you're using here. >>> >>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt >>> >>> thanks >>> Brice >>> >>> >>> >>> Le 24/09/2011 17:07, Ralph Castain a écrit : >>>> FWIW: I tried just printing out the contents of that root object >>>> immediately after importing the xml, and it clearly has a problem: >>>> >>>> (gdb) print *obj >>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 >>>> , memory = { >>>> total_memory = 46912502995240, local_memory = 46912502995240, >>>> page_types_len = 0, page_types = 0x0}, attr = 0x2, >>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, >>>> prev_cousin = 0x, parent = 0x0, >>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, >>>> children = 0x2b139738, >>>> first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = >>>> 0x0, complete_cpuset = 0x0, >>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, >>>> complete_nodeset = 0x644c90, >>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = >>>> 690, infos = 0x0, infos_count = 0} >>>> >>>> >>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: >>>> >>>>> Here's the trace: >>>>> >>>>> #0 0x2ae61737 in hwloc__xml_export_object >>>>> (output=0x7fffd890, topology=0x695f10, obj=0x2b139b28) >>>>> at topology-xml.c:1094 >>>>> #1 0x2ae61b69 in hwloc___nolibxml_prepare_export >>>>> (topology=0x695f10, >>>>> xmlbuffer=0x698a70 ">>>> encoding=\"UTF-8\"?>\n>>>> \"hwloc.dtd\">\n\n >>>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" >>>>> complete_cpuset=\"0xf...f\" onl"..., >>>>> buflen=16384) at topology-xml.c:1193 >>>>> #2 0x2ae61be0 in hwloc__nolibxml_prepare_export >>>>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c) >>>>> at topology-xml.c:1207 >>>>> #3 0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer >>>>> (topology=0x695f10, xmlbuffer=0x7fffd988, >>>>> buflen=0x7fffd97c) at topology-xml.c:1281 >>>>> #4 0x2ae
Re: [hwloc-devel] Something lighter-weight than XML?
This is 1.2-ompi, running on Linux 2.6.18-274.el5 on x86_64 $ uname -a Linux xxx 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux On Sep 24, 2011, at 12:43 PM, Brice Goglin wrote: > What platform and distribution do you have? > > Brice > > > > Le 24/09/2011 20:37, Ralph Castain a écrit : >> Yep, it fails. Runs on my Mac, but not under Linux. >> >> Program terminated with signal 11, Segmentation fault. >> #0 0x2acdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> (gdb) where >> #0 0x2acdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #1 0x2acdc060 in hwloc_bitmap_asprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #2 0x2acd9b34 in hwloc__xml_export_object () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #3 0x2acda325 in hwloc___nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #4 0x2acda39c in hwloc__nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #5 0x2acda4be in hwloc_topology_export_xmlbuffer () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #6 0x004009b8 in main () at xmlbuffer.c:31 >> >> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote: >> >>> Indeed, this object contains invalid pointers. >>> >>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does >>> export+import+export+compare on the same machine. It would be good to >>> know if it fails on one of the machines you're using here. >>> >>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt >>> >>> thanks >>> Brice >>> >>> >>> >>> Le 24/09/2011 17:07, Ralph Castain a écrit : >>>> FWIW: I tried just printing out the contents of that root object >>>> immediately after importing the xml, and it clearly has a problem: >>>> >>>> (gdb) print *obj >>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 >>>> , memory = { >>>> total_memory = 46912502995240, local_memory = 46912502995240, >>>> page_types_len = 0, page_types = 0x0}, attr = 0x2, >>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, >>>> prev_cousin = 0x, parent = 0x0, >>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, >>>> children = 0x2b139738, >>>> first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = >>>> 0x0, complete_cpuset = 0x0, >>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, >>>> complete_nodeset = 0x644c90, >>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = >>>> 690, infos = 0x0, infos_count = 0} >>>> >>>> >>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: >>>> >>>>> Here's the trace: >>>>> >>>>> #0 0x2ae61737 in hwloc__xml_export_object >>>>> (output=0x7fffd890, topology=0x695f10, obj=0x2b139b28) >>>>> at topology-xml.c:1094 >>>>> #1 0x2ae61b69 in hwloc___nolibxml_prepare_export >>>>> (topology=0x695f10, >>>>> xmlbuffer=0x698a70 ">>>> encoding=\"UTF-8\"?>\n>>>> \"hwloc.dtd\">\n\n >>>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" >>>>> complete_cpuset=\"0xf...f\" onl"..., >>>>> buflen=16384) at topology-xml.c:1193 >>>>> #2 0x2ae61be0 in hwloc__nolibxml_prepare_export >>>>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c) >>>>> at topology-xml.c:1207 >>>>> #3 0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer >>>>> (topology=0x695f10, xmlbuffer=0x7fffd988, >>>>> buflen=0x7fffd97c) at topology-xml.c:1281 >>>>> #4 0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, >>>>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183 >>>>> #5 0x2adf348c in opal_dss_compare (value1=0x695f10, >>>>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39 >>>>> #6
Re: [hwloc-devel] Something lighter-weight than XML?
Yep, it fails. Runs on my Mac, but not under Linux. Program terminated with signal 11, Segmentation fault. #0 0x2acdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 (gdb) where #0 0x2acdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #1 0x2acdc060 in hwloc_bitmap_asprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #2 0x2acd9b34 in hwloc__xml_export_object () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #3 0x2acda325 in hwloc___nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #4 0x2acda39c in hwloc__nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #5 0x2acda4be in hwloc_topology_export_xmlbuffer () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #6 0x004009b8 in main () at xmlbuffer.c:31 On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote: > Indeed, this object contains invalid pointers. > > Can you try to run tests/xmlbuffer.c from hwloc's tree? It does > export+import+export+compare on the same machine. It would be good to > know if it fails on one of the machines you're using here. > > https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837=txt > > thanks > Brice > > > > Le 24/09/2011 17:07, Ralph Castain a écrit : >> FWIW: I tried just printing out the contents of that root object immediately >> after importing the xml, and it clearly has a problem: >> >> (gdb) print *obj >> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 >> , memory = { >>total_memory = 46912502995240, local_memory = 46912502995240, >> page_types_len = 0, page_types = 0x0}, attr = 0x2, >> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, >> prev_cousin = 0x, parent = 0x0, >> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, >> children = 0x2b139738, >> first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = >> 0x0, complete_cpuset = 0x0, >> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, >> complete_nodeset = 0x644c90, >> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = >> 690, infos = 0x0, infos_count = 0} >> >> >> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: >> >>> Here's the trace: >>> >>> #0 0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, >>> topology=0x695f10, obj=0x2b139b28) >>> at topology-xml.c:1094 >>> #1 0x2ae61b69 in hwloc___nolibxml_prepare_export >>> (topology=0x695f10, >>> xmlbuffer=0x698a70 "\n>> topology SYSTEM \"hwloc.dtd\">\n\n >> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" >>> complete_cpuset=\"0xf...f\" onl"..., >>> buflen=16384) at topology-xml.c:1193 >>> #2 0x2ae61be0 in hwloc__nolibxml_prepare_export >>> (topology=0x695f10, bufferp=0x7fffd988, buflenp=0x7fffd97c) >>> at topology-xml.c:1207 >>> #3 0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer >>> (topology=0x695f10, xmlbuffer=0x7fffd988, >>> buflen=0x7fffd97c) at topology-xml.c:1281 >>> #4 0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, >>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183 >>> #5 0x2adf348c in opal_dss_compare (value1=0x695f10, >>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39 >>> #6 0x2ad9b5f7 in process_orted_launch_report (fd=-1, event=1, >>> data=0x6444d0) at base/plm_base_launch_support.c:564 >>> #7 0x2ae3881f in event_process_active_single_queue (base=0x60dd60, >>> activeq=0x6111e0) at event.c:1329 >>> #8 0x2ae38c71 in event_process_active (base=0x60dd60) at >>> event.c:1396 >>> #9 0x2ae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, >>> flags=1) at event.c:1598 >>> #10 0x2adf080d in opal_progress () at runtime/opal_progress.c:189 >>> #11 0x2ad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at >>> base/plm_base_launch_support.c:666 >>> #12 0x2ada49e1 in plm_slurm_launch_job (jdata=0x67a500) at >>> plm_slurm_module.c:404 >>> #13 0x00403822 in orterun (argc=4, argv=0x7fffe1d8) at >>> orterun.c:817 >>> #14 0x00402aa3 in main (argc=4, argv=0x7fffe1d8) at main.c:13 >>>
Re: [hwloc-devel] Something lighter-weight than XML?
FWIW: I tried just printing out the contents of that root object immediately after importing the xml, and it clearly has a problem: (gdb) print *obj $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 , memory = { total_memory = 46912502995240, local_memory = 46912502995240, page_types_len = 0, page_types = 0x0}, attr = 0x2, depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, prev_cousin = 0x, parent = 0x0, sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, children = 0x2b139738, first_child = 0x2b139738, last_child = 0x0, userdata = 0x0, cpuset = 0x0, complete_cpuset = 0x0, online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, complete_nodeset = 0x644c90, allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 690, infos = 0x0, infos_count = 0} On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: > Here's the trace: > > #0 0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, > topology=0x695f10, obj=0x2b139b28) >at topology-xml.c:1094 > #1 0x2ae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10, >xmlbuffer=0x698a70 "\n topology SYSTEM \"hwloc.dtd\">\n\n os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" > complete_cpuset=\"0xf...f\" onl"..., >buflen=16384) at topology-xml.c:1193 > #2 0x2ae61be0 in hwloc__nolibxml_prepare_export (topology=0x695f10, > bufferp=0x7fffd988, buflenp=0x7fffd97c) >at topology-xml.c:1207 > #3 0x2ae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer > (topology=0x695f10, xmlbuffer=0x7fffd988, >buflen=0x7fffd97c) at topology-xml.c:1281 > #4 0x2ae529f4 in opal_hwloc_compare (topo1=0x695f10, topo2=0x6915c0, > type=22 '\026') at base/hwloc_base_dt.c:183 > #5 0x2adf348c in opal_dss_compare (value1=0x695f10, value2=0x6915c0, > type=22 '\026') at dss/dss_compare.c:39 > #6 0x2ad9b5f7 in process_orted_launch_report (fd=-1, event=1, > data=0x6444d0) at base/plm_base_launch_support.c:564 > #7 0x2ae3881f in event_process_active_single_queue (base=0x60dd60, > activeq=0x6111e0) at event.c:1329 > #8 0x2ae38c71 in event_process_active (base=0x60dd60) at event.c:1396 > #9 0x2ae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, > flags=1) at event.c:1598 > #10 0x2adf080d in opal_progress () at runtime/opal_progress.c:189 > #11 0x2ad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at > base/plm_base_launch_support.c:666 > #12 0x2ada49e1 in plm_slurm_launch_job (jdata=0x67a500) at > plm_slurm_module.c:404 > #13 0x00403822 in orterun (argc=4, argv=0x7fffe1d8) at > orterun.c:817 > #14 0x00402aa3 in main (argc=4, argv=0x7fffe1d8) at main.c:13 > > And the error report > > Program received signal SIGSEGV, Segmentation fault. > 0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, > topology=0x695f10, obj=0x2b139b28) >at topology-xml.c:1094 > 1094 sprintf(tmp, "%llu", (unsigned long long) > obj->memory.page_types[i].count); > (gdb) print obj > $1 = (opal_hwloc122_hwloc_obj_t) 0x2b139b28 > (gdb) print *obj > $2 = {type = 2870188824, os_index = 10922, name = 0x2b139b18 > "\b\233\023\253\252*", memory = {total_memory = 6579376, >local_memory = 6579376, page_types_len = 2870188856, page_types = > 0x2b139b38}, attr = 0x2b139b48, > depth = 2870188872, logical_index = 10922, os_level = -1424778408, > next_cousin = 0x2b139b58, > prev_cousin = 0x2b139b68, parent = 0x2b139b68, sibling_rank = > 2870188920, next_sibling = 0x2b139b78, > prev_sibling = 0x2b139b88, arity = 2870188936, children = > 0x2b139b98, first_child = 0x2b139b98, > last_child = 0x2b139ba8, userdata = 0x2b139ba8, cpuset = > 0x2b139bb8, complete_cpuset = 0x2b139bb8, > online_cpuset = 0x2b139bc8, allowed_cpuset = 0x2b139bc8, nodeset = > 0x2b139bd8, > complete_nodeset = 0x2b139bd8, allowed_nodeset = 0x2b139be8, > distances = 0x2b139be8, > distances_count = 2870189048, infos = 0x2b139bf8, infos_count = > 2870189064} > (gdb) print obj->memory > $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = > 2870188856, page_types = 0x2b139b38} > (gdb) print obj->memory.page_types > $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2b139b38 > (gdb) print i > $5 = 1612 > (gdb) print obj->memory.page_types[1600] > $6 = {size = 0, count = 0} > (gdb) print obj->memory.page_types[1612] > Ca
Re: [hwloc-devel] Something lighter-weight than XML?
Here's the trace: #0 0x2ae61737 in hwloc__xml_export_object (output=0x7fffd890, topology=0x695f10, obj=0x2b139b28) at topology-xml.c:1094 #1 0x2ae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10, xmlbuffer=0x698a70 "\n\n\n memory.page_types[i].count); (gdb) print obj $1 = (opal_hwloc122_hwloc_obj_t) 0x2b139b28 (gdb) print *obj $2 = {type = 2870188824, os_index = 10922, name = 0x2b139b18 "\b\233\023\253\252*", memory = {total_memory = 6579376, local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2b139b38}, attr = 0x2b139b48, depth = 2870188872, logical_index = 10922, os_level = -1424778408, next_cousin = 0x2b139b58, prev_cousin = 0x2b139b68, parent = 0x2b139b68, sibling_rank = 2870188920, next_sibling = 0x2b139b78, prev_sibling = 0x2b139b88, arity = 2870188936, children = 0x2b139b98, first_child = 0x2b139b98, last_child = 0x2b139ba8, userdata = 0x2b139ba8, cpuset = 0x2b139bb8, complete_cpuset = 0x2b139bb8, online_cpuset = 0x2b139bc8, allowed_cpuset = 0x2b139bc8, nodeset = 0x2b139bd8, complete_nodeset = 0x2b139bd8, allowed_nodeset = 0x2b139be8, distances = 0x2b139be8, distances_count = 2870189048, infos = 0x2b139bf8, infos_count = 2870189064} (gdb) print obj->memory $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2b139b38} (gdb) print obj->memory.page_types $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2b139b38 (gdb) print i $5 = 1612 (gdb) print obj->memory.page_types[1600] $6 = {size = 0, count = 0} (gdb) print obj->memory.page_types[1612] Cannot access memory at address 0x2b13fff8 (gdb) print obj->memory.page_types[1611] $7 = {size = 0, count = 0} (gdb) The whole obj looks like trash to me. I looked a little more - the object referenced is the root object: 1193 hwloc__xml_export_object (, topology, hwloc_get_root_obj(topology)); I'm continuing to look in case I'm doing something stupid, but the code is pretty linear here - unpack, import, export for compare. On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote: > Here's some feedback from Ralph -- any idea what's going wrong here? > > - > > 1. I export a topology into xml using > > hwloc_topology_export_xmlbuffer(t, , ); > > I then pack and send the string. > > 2. I unpack the string on the other end and import it into a topology > hwloc_topology_init(); > if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, > strlen(xmlbuffer { > hwloc_topology_destroy(t); > goto cleanup; > } > hwloc_topology_load(t); > > 3. I then need to compare two topologies, so I export the topology I received > into another xml string > hwloc_topology_export_xmlbuffer(t1, , ); > > It is this export that fails, which implies to me that somehow the import > didn't work right. Note that this code worked fine with libxml2, so this is a > regression. > > > On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote: > >> Yes, I can get some testing of the ompi branch pretty quickly. I can bring >> in a new copy of this later today and see what we can see. >> >> Many thanks! >> >> >> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote: >> >>> I pushed the new minimalistic XML import/export implementation without >>> libxml2 to the nolibxml branch. If libxml2 is available, it's still used >>> by default. --disable-libxml2 or some env variables can be used for >>> force the minimalistic implementation if needed. The minimalistic implem >>> is only guaranteed to import XML files that were generated by hwloc >>> (even if libxml was enabled there). >>> >>> I also backported most of this to the new v1.2-ompi branch (required to >>> backport some other XML cleanups from trunk). This branch will now serve >>> as a base for Open MPI's embedded hwloc. The idea is to have a complete >>> v1.2 + nolibxml somewhere so that we can at least run make check (Open >>> MPI does not embed enough to run hwloc's make check). >>> >>> How do we proceed now? Can we have the OMPI guys test the new code soon? >>> Should I wait for their feedback before merging the nolibxml branch into >>> the trunk? I'd like to merge this in v1.3 too (and basically release rc2 >>> as the actual first feature-complete RC), so getting feedback early >>> might be appreciated. >>> >>> Brice >>> >>> ___ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> hwloc-devel mailing list >> hwloc-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel > >
Re: [hwloc-devel] Some practical hwloc API feedback
On Sep 24, 2011, at 5:52 AM, Jeff Squyres wrote: > On Sep 24, 2011, at 7:46 AM, Jeff Squyres wrote: > >>> The funky thing here is that the parent/child links between the first >>> socket and its core go across level 2 because nothing matches there. In >>> the first socket, you have Socket(depth1)->Core(depth3) while in the >>> second socket you have Socket(depth1)->Cache(depth2)->Core(depth3) > > Oh crap; this scenario is explicitly listed in the figure on page 21 of the > letter PDF. :-\ > > So... disregard my comments here. You might want to point that section out in hwloc.h as this is somewhat non-intuitive. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Re: [hwloc-devel] Some practical hwloc API feedback
On Sep 22, 2011, at 3:05 PM, Brice Goglin wrote: > Le 22/09/2011 22:42, Ralph Castain a écrit : >> I guess I didn't get that from your documentation. Since caches sit >> between socket and core, they appear to affect the depth of the core >> in a given socket. Thus, if there are different numbers of caches in >> the different sockets on a node, then the core/pu level would change >> across the sockets. > > No, the level always contain all elements of the same type (+depth for > caches), even if they are not at the same "distance" to the root (not > "depth"). > > Let's say you have two single-core sockets. One with no cache. One with > a L1. > What happens is: > * first level/depth is socket, contains two sockets, cover all cores. > * level 2 is L2, single element, *does not cover all cores* > * level 3 is core, two elements. > > The funky thing here is that the parent/child links between the first > socket and its core go across level 2 because nothing matches there. In > the first socket, you have Socket(depth1)->Core(depth3) while in the > second socket you have Socket(depth1)->Cache(depth2)->Core(depth3) > > So what we call "depth" in hwloc, is not the number of parent/child > links between you and the root, it's really the number of levels between > you and the root, even if you don't have any parent in some of these levels. > > Looks like we need to clarify this :) > Indeed - having the above example in hwloc.h would help. I think the key thing here is that the depth for a given type is being set across the entire node, and not by the local structure - i.e., the depth of the core in your example is determined by looking across the node at the max depth of any core in its local structure. Those of us coming from the chip world will find that confusing, as we look at things one socket at a time, but we can adapt. All that said, if I put my dictionary away and can get the code to work, hopefully we won't have to parse thru it again. :-) Thanks!
Re: [hwloc-devel] Patch for hwloc_obj_attr_snprintf
Resending - had to join list! Oh - I should have noted that I took the labels directly from the xml output. So cachesize and cacheline are what you have in the xml output. I don't care if they match, though - just pointing it out. On Sep 22, 2011, at 1:59 PM, Brice Goglin wrote: > Hello Ralph, > > Indeed, adding something before the cache size might be good. > > But if I was picky, I would say "size=32kB linesize=64". The word "Cache" is > already written above (in the object type), why would we duplicate it in > "Cachesize" and "Cacheline" ? > > Right now, lstopo shows: > L3Cache L#3 (4096KB line=64) > With your patch, it would say: > L3Cache L#3 (Cachesize=4096KB Cacheline=64) > With my variant, it would say: > L3Cache L#3 (size=4096KB linesize=64) > > Brice > > > > > Le 22/09/2011 21:27, Jeff Squyres a écrit : >> >> Ralph noticed the following when working on integrating hwloc deeply into >> OMPI, and suggests the attached patch. Does it look good to you guys? >> >> - >> >> Something isn't right with hwloc_obj_attr_snprintf() when the object is a >> cache. I get this when printing the topology of my Mac: >> >> Detected Resources: Type: Machine Number of child objects: 1 >> Name=NULL >> total=3145728KB >> Backend=Darwin >> OSName=Darwin >> OSRelease=10.8.0 >> OSVersion="Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 >> PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386" >> Architecture=i386 >> Cpuset: 0x0003 >> Online: 0x0003 >> Allowed: 0x0003 >> Type: NUMANode Number of child objects: 1 >> Name=NULL >> local=3145728KB >> total=3145728KB >> Cpuset: 0x0003 >> Online: 0x0003 >> Allowed: 0x0003 >> Type: Socket Number of child objects: 1 >> Name=NULL >> >> Cpuset: 0x0003 >> Online: 0x0003 >> Allowed: 0x0003 >> Type: L2Cache Number of child objects: 2 >> Name=NULL >> 4096KB >> line=64 >> Cpuset: 0x0003 >> Online: 0x0003 >> Allowed: 0x0003 >> Type: L1Cache Number of child objects: 1 >> Name=NULL >> 32KB >> line=64 >> Cpuset: 0x0001 >> Online: 0x0001 >> Allowed: 0x0001 >> Type: Core Number of child >> objects: 1 >> Name=NULL >> >> Cpuset: 0x0001 >> Online: 0x0001 >> Allowed: 0x0001 >> Type: PU Number of >> child objects: 0 >> Name=NULL >> >> Cpuset: >> 0x0001 >> Online: >> 0x0001 >> Allowed: >> 0x0001 >> Type: L1Cache Number of child objects: 1 >> Name=NULL >> 32KB >> line=64 >> Cpuset: 0x0002 >> Online: 0x0002 >> Allowed: 0x0002 >> Type: Core Number of child >> objects: 1 >> Name=NULL >> >> Cpuset: 0x0002 >> Online: 0x0002 >> Allowed: 0x0002 >>