Try this patch (it applies to hwloc v1.9-v1.11, it should be OK against OMPI's tree). Your bridge 22:00.0 says it contains the master bus 00. It causes a cycle in hwloc's insert algorithm, caught be the assertion. The patch just removes this invalid bridge entirely.
Brice Le 10/09/2015 21:23, George Bosilca a écrit : > It used to work. Now I don't know exactly when I last updated the > trunk version on the cluster, but not more than 10 days ago. > > lstopo complains with the same assert. Interestingly enough, the same > binary succeed on the other nodes of the same cluster ... > > George. > > > On Thu, Sep 10, 2015 at 3:20 PM, Brice Goglin <brice.gog...@inria.fr > <mailto:brice.gog...@inria.fr>> wrote: > > Did it work on the same machine before? Or did OMPI enable hwloc's > PCI discovery recently? > > Does lstopo complain the same? > > Brice > > > > Le 10/09/2015 21:10, George Bosilca a écrit : >> With the current trunk version I keep getting an assert deep down >> in orted. >> >> orted: >> >> ../../../../../../../ompi/opal/mca/hwloc/hwloc1110/hwloc/src/pci-common.c:177: >> hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp >> != HWLOC_PCI_BUSID_SUPERSET' failed. >> >> The stack looks like this: >> >> [dancer18:21100] *** Process received signal *** >> [dancer18:21100] Signal: Aborted (6) >> [dancer18:21100] Signal code: (-6) >> [dancer18:21100] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fc22ce61710] >> [dancer18:21100] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fc22caf0625] >> [dancer18:21100] [ 2] /lib64/libc.so.6(abort+0x175)[0x7fc22caf1e05] >> [dancer18:21100] [ 3] /lib64/libc.so.6(+0x2b74e)[0x7fc22cae974e] >> [dancer18:21100] [ 4] >> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fc22cae9810] >> [dancer18:21100] [ 5] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0a62)[0x7fc22ddc6a62] >> [dancer18:21100] [ 6] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0b60)[0x7fc22ddc6b60] >> [dancer18:21100] [ 7] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_insert_pci_device_list+0x8f)[0x7fc22ddc724c] >> [dancer18:21100] [ 8] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xbf2d6)[0x7fc22ddd52d6] >> [dancer18:21100] [ 9] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xd22f7)[0x7fc22dde82f7] >> [dancer18:21100] [10] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_topology_load+0x1a3)[0x7fc22dde8ee1] >> [dancer18:21100] [11] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc_base_get_topology+0x80)[0x7fc22ddb6ece] >> [dancer18:21100] [12] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_ess_base_orted_setup+0x127)[0x7fc22e0b3523] >> [dancer18:21100] [13] >> >> /home/bosilca/opt/trunk/debug/lib/openmpi/mca_ess_env.so(+0xe45)[0x7fc22c6bbe45] >> [dancer18:21100] [14] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_init+0x2c6)[0x7fc22e06b55a] >> [dancer18:21100] [15] >> >> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_daemon+0x5c1)[0x7fc22e09a895] >> [dancer18:21100] [16] >> /home/bosilca/opt/trunk/debug/bin/orted[0x40082a] >> [dancer18:21100] [17] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fc22cadcd5d] >> [dancer18:21100] [18] >> /home/bosilca/opt/trunk/debug/bin/orted[0x4006e9] >> >> Any ideas? >> >> George. >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/17993.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17994.php > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17995.php
commit c981b0f2ba7983ac2d8ca27a5d59c82e536ff376 Author: Brice Goglin <brice.gog...@inria.fr> List-Post: devel@lists.open-mpi.org Date: Thu Sep 10 23:51:11 2015 +0200 pci: workaround buggy bridge secondary-subordinate buses George Bosilca report a failed assertion http://www.open-mpi.org/community/lists/devel/2015/09/17993.php on a machine with a Pericom Semiconductor PCI Express to PCI-XPI7C9X130 PCI-X Bridge whose bus ID is 22:00.0 while primary/secondary/subordinate buses are 0 according to the config space. Primary bus bugs are not uncommon, we can workaround them by overwritting with what's in the bus ID. Secondary-subordinate bus bugs cannot be fixed, and they make cycles in the insert algorithm. Add some basic checks for all these bus numbers and ignore the bridge entirely if failed. Ideally we would also check that [secondary-subordinate] is included in parent [secondary+1:subordinate]. diff --git a/include/hwloc/plugins.h b/include/hwloc/plugins.h index b5294b5..8ef0cd3 100644 --- a/include/hwloc/plugins.h +++ b/include/hwloc/plugins.h @@ -426,6 +426,8 @@ HWLOC_DECLSPEC int hwloc_pci_find_linkspeed(const unsigned char *config, unsigne /** \brief Modify the PCI device object into a bridge and fill its attribute if a bridge is found in the PCI config space. * * This function requires 64 bytes of common configuration header at the beginning of config. + * + * Returns -1 and destroys /p obj if bridge fields are invalid. */ HWLOC_DECLSPEC int hwloc_pci_prepare_bridge(hwloc_obj_t obj, const unsigned char *config); diff --git a/src/pci-common.c b/src/pci-common.c index 2154276..8756e81 100644 --- a/src/pci-common.c +++ b/src/pci-common.c @@ -6,6 +6,7 @@ #include <private/autogen/config.h> #include <hwloc.h> #include <hwloc/plugins.h> +#include <private/private.h> #include <private/debug.h> #ifdef HWLOC_DEBUG @@ -510,5 +511,18 @@ hwloc_pci_prepare_bridge(hwloc_obj_t obj, battr->downstream.pci.secondary_bus = config[HWLOC_PCI_SECONDARY_BUS]; battr->downstream.pci.subordinate_bus = config[HWLOC_PCI_SUBORDINATE_BUS]; + if (battr->downstream.pci.secondary_bus <= pattr->bus + || battr->downstream.pci.subordinate_bus <= pattr->bus + || battr->downstream.pci.secondary_bus > battr->downstream.pci.subordinate_bus) { + hwloc_debug(" %04x:%02x:%02x.%01x bridge has invalid secondary-subordinate buses [%02x-%02x]\n", + pattr->domain, pattr->bus, pattr->dev, pattr->func, + battr->downstream.pci.secondary_bus, battr->downstream.pci.subordinate_bus); + hwloc_free_unlinked_object(obj); + return -1; + } + /* FIXME: Ideally we would also check that [secondary-subordinate] is included + * in the parent bridge [secondary+1:subordinate] + */ + return 0; } diff --git a/src/topology-linux.c b/src/topology-linux.c index 7f8430b..16bb04e 100644 --- a/src/topology-linux.c +++ b/src/topology-linux.c @@ -5147,7 +5147,8 @@ hwloc_look_linuxfs_pci(struct hwloc_backend *backend) fclose(file); /* is this a bridge? */ - hwloc_pci_prepare_bridge(obj, config_space_cache); + if (hwloc_pci_prepare_bridge(obj, config_space_cache) < 0) + continue; /* get the revision */ attr->revision = config_space_cache[HWLOC_PCI_REVISION_ID]; diff --git a/src/topology-pci.c b/src/topology-pci.c index 0f20e42..c6cd0c1 100644 --- a/src/topology-pci.c +++ b/src/topology-pci.c @@ -206,7 +206,8 @@ hwloc_look_pci(struct hwloc_backend *backend) if (offset > 0 && offset + 20 /* size of PCI express block up to link status */ <= CONFIG_SPACE_CACHESIZE) hwloc_pci_find_linkspeed(config_space_cache, offset, &obj->attr->pcidev.linkspeed); - hwloc_pci_prepare_bridge(obj, config_space_cache); + if (hwloc_pci_prepare_bridge(obj, config_space_cache) < 0) + continue; if (obj->type == HWLOC_OBJ_PCI_DEVICE) { memcpy(&tmp16, &config_space_cache[PCI_SUBSYSTEM_VENDOR_ID], sizeof(tmp16));