RE: ib_types.h moving [was: Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm]
Now I likely would agree with Ira that moving ib_types.h to libibumad is a least painful option. Do we have a better ideas? Just a random thought, but what about longer term adding a second set of interfaces to libibumad? Basically, something more like the kernel ib_sa. I don't know that we need a new library just to expand the interface. For ib_types.h, I'd rather see it broken up into separate header files, at least some of which get distributed with libibumad. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: Possible process deadlock in RMPP flow
ibnetdiscover D 80149b8d 0 26968 26544 (L-TLB) 8102c900bd88 0046 81037e8e 81037e8e02e8 8102c900bd78 000a 8102c5b50820 81038a929820 011837bf6105 0ede 8102c5b50a08 0001 Call Trace: [80064207] wait_for_completion+0x79/0xa2 [8008b4cc] default_wake_function+0x0/0xe [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424 [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6 [80012e22] __fput+0xae/0x198 [80023de6] filp_close+0x5c/0x64 [800393df] put_files_struct+0x63/0xae [80015b26] do_exit+0x31c/0x911 [8004971a] cpuset_exit+0x0/0x6c [8005e116] system_call+0x7e/0x83 From the dump it seems that the process is waits on the call to flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is OFED 1.4.2. Roland just submitted a patch in this area yesterday. I don't know if the patch would fix their issue, but it may be worth trying. What kernel does 1.4.2 map to? What RMPP messages does ibnetdiscover use? If the program is completing successfully, there may be a different race with the rmpp cleanup. I'll see if anything else stands out in that area. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH/RFC] IB/mad: Fix lock-lock-timer deadlock in RMPP code
OK so how about something like this? Just hold the lock to mark the items on the list as being canceled, and then actually cancel the delayed work without the lock. I think this doesn't leave any races or holes where the delayed work can mess up the cancel. This looks good to me. Thanks for looking at this. Reviewed-by: Sean Hefty sean.he...@intel.com ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
Although not a fit IMO, the pragmatic solution is to move ib_types,h into libibumad. I think it is better there than OpenSM which was never quite right either. That can at least start to eliminate the duplications in this area. ib_types.h includes complib header files... ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
Rough hack. Does windows have stdint.h, byteswap.h, and endian.h? If not, adding the headers with the needed definitions is trivial. +/* 16bit */ +#if __BYTE_ORDER == __LITTLE_ENDIAN +#define CL_NTOH16( x )(uint16_t)( \ + (((uint16_t)(x) 0x00FF) 8) | \ + (((uint16_t)(x) 0xFF00) 8) ) +#else +#define CL_NTOH16( x )(x) +#endif +#define CL_HTON16 CL_NTOH16 + +/* 32bit */ +#if __BYTE_ORDER == __LITTLE_ENDIAN +#define CL_NTOH32( x )(uint32_t)( \ + (((uint32_t)(x) 0x00FF) 24) | \ + (((uint32_t)(x) 0xFF00) 8) | \ + (((uint32_t)(x) 0x00FF) 8) | \ + (((uint32_t)(x) 0xFF00) 24) ) +#else +#define CL_NTOH32( x )(x) +#endif +#define CL_HTON32 CL_NTOH32 + +/* 64bit */ +#if __BYTE_ORDER == __LITTLE_ENDIAN +#define CL_NTOH64( x )(uint64_t)( \ + (((uint64_t)(x) 0x00FFULL) 56) | \ + (((uint64_t)(x) 0xFF00ULL) 40) | \ + (((uint64_t)(x) 0x00FFULL) 24) | \ + (((uint64_t)(x) 0xFF00ULL) 8 ) | \ + (((uint64_t)(x) 0x00FFULL) 8 ) | \ + (((uint64_t)(x) 0xFF00ULL) 24) | \ + (((uint64_t)(x) 0x00FFULL) 40) | \ + (((uint64_t)(x) 0xFF00ULL) 56) ) +#else +#define CL_NTOH64( x )(x) +#endif +#define CL_HTON64 CL_NTOH64 + +#if __BYTE_ORDER == __LITTLE_ENDIAN +#define cl_ntoh16(x) bswap_16(x) +#define cl_hton16(x) bswap_16(x) +#define cl_ntoh32(x) bswap_32(x) +#define cl_hton32(x) bswap_32(x) +#define cl_ntoh64(x) (uint64_t)bswap_64(x) +#define cl_hton64(x) (uint64_t)bswap_64(x) +#else /* Big Endian */ +#define cl_ntoh16(x) (x) +#define cl_hton16(x) (x) +#define cl_ntoh32(x) (x) +#define cl_hton32(x) (x) +#define cl_ntoh64(x) (x) +#define cl_hton64(x) (x) +#endif Why the different defines for cl_noth and CL_NTOH? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC] 0/5: assistant to the IB communication manager
The following collection of pseudo-patches implement a new user space package (IB ACM) designed to assist with connection establishment. A description is given below, copied from the acm_notes.txt file included with the package. The complete package is available on git.openfabrics.org/~shefty/ibacm.git and also in svn under branches/winverbs/ulp/ibacm. This is a request for both general and detailed feedback. The IB ACM has had very limited testing. Testing has been restricted to using the provided test utility, and invoking it from the windows version of the librdmacm on a single, small cluster. Calling it from the linux librdmacm is more involved and still under development. Signed-off-by: Sean Hefty sean.he...@intel.com --- Assistant for InfiniBand Communication Management (IB ACM) Note: The IB ACM should be considered experimental. Overview The IB ACM package implements and provides a framework for experimental name, address, and route resolution services over InfiniBand. It is intended to address connection setup scalability issues running MPI applications on large clusters. The IB ACM provides information needed to establish a connection, but does not implement the CM protocol. Long term, the IB ACM may support multiple resolution mechanisms. The IB ACM is focused on being scalable and efficient. The current implementation limits network traffic, SA interactions, and centralized services. As a trade-off, it is not expected to support all cluster routing configurations. However, it is anticipated that additional functionality, such as path record caching, can be incorporated into the IB ACM to support a wider range of configurations. The IB ACM package is comprised of three components: the ib_acm service, a libibacm library, and a test/configuration utility - ib_acme. All are userspace components and are available for Linux and Windows. Additional details are given below. Quick Start Guide - 1. Prerequisites: libibverbs and libibumad must be installed. The IB stack should be running with IPoIB configured 2. Install the IB ACM package This installs libibacm, ib_acm, and ib_acme. 3. Run ib_acme -A -O This will generate IB ACM address and options configuration files. (acm_addr.cfg and acm_opts.cfg) 4. Run ib_acm and leave running 5. Optionally, run ib_acme -s source_ip -d dest_ip -v This will verify that the ib_acm service is running. It also verifies the path is usable on the given cluster. 5. Install librdmacm. 6. Define the following environment variable: RDMA_CM_USE_IB_ACM=1 The librdmacm will automatically use the ib_acm service. On failures, the librdmacm will fall back to normal resolution. Details --- libibacm: The libibacm is an end-user library with simple interfaces for communicating with the ib_acm service. The libibacm implements the ib_acm client protocol. Although the interfaces to the libibacm are considered experimental, it's expected that existing calls will be supported going forward. For simplicity, all calls operate synchronously and are serialized. Possible future changes to the libibacm would be to process calls in parallel and add asynchronous interfaces. ib_acme: The ib_acme program serves a dual role. It acts as a utility to test ib_acm operation and help verify if the ib_acm is usable for a given cluster configuration. Additionally, it automatically generates ib_acm configuration files to assist with or eliminate manual setup. acm configuration files: The ib_acm service relies on two configuration files. The acm_addr.cfg file contains name and address mappings for each IB device, port, pkey endpoint. Although the names in the acm_addr.cfg file can be anything, ib_acme maps the host name and IP addresses to the IB endpoints. The acm_opts.cfg file provides a set of configurable options for the ib_acm service, such as timeout, number of retries, logging level, etc. ib_acme generates the acm_opts.cfg file using static information. A future enhancement would adjust options based on the current system and cluster size. ib_acm: The ib_acm service is responsible for resolving names and addresses to InfiniBand path information and caching such data. It is currently implemented as an executable application, but is a conceptual service or daemon that should execute with administrative privileges. The ib_acm implements a client interface over TCP sockets, which is abstracted by the libibacm library. One or more back-end protocols are used by the ib_acm service to satisfy user requests. Although the ib_acm supports standard SA path record queries on the back-end, it provides an experimental resolution protocol in hope of achieving greater scalability. Conceptually, the ib_acm service implements an ARP like protocol and uses IB multicast records to construct path record data. It makes the assumption that a unicast path between two endpoints is realizable if those endpoints can communicate
[ofa-general] [RFC] 1/5: ib_acm: linux abstractions
The following abstractions are defined to support the IB ACM running on Linux. Signed-off-by: Sean Hefty sean.he...@intel.com --- /* * Copyright (c) 2009 Intel Corporation. All rights reserved. * * This software is available to you under the OpenFabrics.org BSD license * below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #if !defined(OSD_H) #define OSD_H #include stdlib.h #include string.h #include stdio.h #include unistd.h #include errno.h #include byteswap.h #include pthread.h #include sys/socket.h #include sys/types.h #include malloc.h #include arpa/inet.h #include sys/time.h #include netinet/in.h #define LIB_DESTRUCTOR __attribute__((destructor)) #define CDECL_FUNC #define container_of(ptr, type, field) \ ((type *) ((void *) ptr - offsetof(type, field))) #define min(a, b) (a b ? a : b) #define max(a, b) (a b ? a : b) #if __BYTE_ORDER == __LITTLE_ENDIAN #define htonll(x) bswap_64(x) #else #define htonll(x) (x) #endif #define ntohll(x) htonll(x) typedef struct { volatile int val; } atomic_t; #define atomic_inc(v) (__sync_fetch_and_add((v)-val, 1) + 1) #define atomic_dec(v) (__sync_fetch_and_sub((v)-val, 1) - 1) #define atomic_get(v) ((v)-val) #define atomic_set(v, s) ((v)-val = s) #define stricmp strcasecmp #define strnicmp strncasecmp typedef struct { pthread_cond_t cond; pthread_mutex_t mutex; } event_t; static inline void event_init(event_t *e) { pthread_cond_init(e-cond, NULL); pthread_mutex_init(e-mutex, NULL); } #define event_signal(e) pthread_cond_signal((e)-cond) static inline int event_wait(event_t *e, int timeout) { struct timeval curtime; struct timespec wait; int ret; gettimeofday(curtime, NULL); wait.tv_sec = curtime.tv_sec + ((unsigned) timeout) / 1000; wait.tv_nsec = (curtime.tv_usec + (((unsigned) timeout) % 1000) * 1000) * 1000; pthread_mutex_lock(e-mutex); ret = pthread_cond_timedwait(e-cond, e-mutex, wait); pthread_mutex_unlock(e-mutex); return ret; } #define lock_t pthread_mutex_t #define lock_init(x)pthread_mutex_init(x, NULL) #define lock_acquirepthread_mutex_lock #define lock_releasepthread_mutex_unlock #define osd_init() 0 #define osd_close() #define SOCKET int #define SOCKET_ERROR -1 #define INVALID_SOCKET -1 #define socket_errno() errno #define closesocket close static inline uint64_t time_stamp_us(void) { struct timeval curtime; timerclear(curtime); gettimeofday(curtime, NULL); return (uint64_t) curtime.tv_sec * 100 + (uint64_t) curtime.tv_usec; } #define time_stamp_ms() (time_stamp_us() / 1000) static inline int beginthread(void (*func)(void *), void *arg) { pthread_t thread; return pthread_create(thread, NULL, (void *(*)(void*)) func, arg); } #endif /* OSD_H */ /* * Copyright (c) 2009 Intel Corporation. All rights reserved. * * This software is available to you under the OpenIB.org BSD license * below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT
[ofa-general] [RFC] 2/5: IB ACM: windows abstractions
The following abstractions are defined to support the IB ACM running on Windows. An attempt was made to limit the number of dependencies on external libraries, such as complib. We add Windows support for the Linux 'search' binary tree interfaces. This is implemented on Windows using complib fleximap, but gets linked in statically. Signed-off-by: Sean Hefty sean.he...@intel.com --- /* * Copyright (c) 2009 Intel Corporation. All rights reserved. * * This software is available to you under the OpenFabrics.org BSD license * below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #if !defined(OSD_H) #define OSD_H #include windows.h #include process.h #include winsock2.h #define __func__ __FUNCTION__ #define LIB_DESTRUCTOR #define CDECL_FUNC __cdecl typedef struct { volatile LONG val; } atomic_t; #define atomic_inc(v) InterlockedIncrement((v)-val) #define atomic_dec(v) InterlockedDecrement((v)-val) #define atomic_get(v) ((v)-val) #define atomic_set(v, s) ((v)-val = s) #define event_t HANDLE #define event_init(e) *(e) = CreateEvent(NULL, FALSE, FALSE, NULL) #define event_signal(e) SetEvent(*(e)) #define event_wait(e, t) WaitForSingleObject(*(e), t) #define lock_t CRITICAL_SECTION #define lock_init InitializeCriticalSection #define lock_acquireEnterCriticalSection #define lock_releaseLeaveCriticalSection static __inline int osd_init() { WSADATA wsadata; return WSAStartup(MAKEWORD(2, 2), wsadata); } static __inline void osd_close() { WSACleanup(); } #define stricmp _stricmp #define strnicmp _strnicmp #define socket_errno WSAGetLastError #define SHUT_RDWR SD_BOTH static __inline UINT64 time_stamp_us(void) { LARGE_INTEGER cnt, freq; QueryPerformanceFrequency(freq); QueryPerformanceCounter(cnt); return (UINT64) cnt.QuadPart / freq.QuadPart * 100; } #define time_stamp_ms() (time_stamp_us() * 1000) #define getpid() ((int) GetCurrentProcessId()) #define beginthread(func, arg) (int) _beginthread(func, 0, arg) #define container_of CONTAINING_RECORD #endif /* OSD_H */ /* * Copyright (c) 2009 Intel Corp, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * */ #ifndef _SEARCH_H_ #define _SEARCH_H_ #include complib/cl_fleximap.h //typedef enum //{ // preorder, // postorder, // endorder, // leaf // //} VISIT; void *tsearch(const void *key, void **rootp, int (*compar)(const void *, const void *)); void *tfind(const void *key, void *const *rootp, int (*compar)(const void *, const void *)); /* tdelete
[ofa-general] [RFC] 3/5: IB ACM: libibacm
Add an end-user library with simple interfaces for communicating with the ib_acm service. The linux and windows specific files for the library are simple and not shown for this review Signed-off-by: Sean Hefty sean.he...@intel.com --- ib_acm.h: defines library interfaces. These are the end-user application interfaces to the ib acm. /* * Copyright (c) 2009 Intel Corporation. All rights reserved. * * This software is available to you under the OpenFabrics.org BSD license * below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #if !defined(IB_ACM_H) #define IB_ACM_H #include infiniband/verbs.h #if defined(_WIN32) #define LIB_EXPORT __declspec(dllexport) #else #define LIB_EXPORT #endif #ifdef __cplusplus extern C { #endif struct ib_acm_dev_addr { uint64_t guid; uint16_t pkey_index; uint8_t port_num; uint8_t reserved[5]; }; struct ib_acm_resolve_data { uint32_t reserved1; uint8_t init_depth; uint8_t resp_resources; uint8_t packet_lifetime; uint8_t mtu; uint8_t reserved2[8]; }; /** * ib_acm_resolve_name - Resolve path data between the specified names. * Description: * Discover path information, including identifying the local device, * between the given the source and destination names. * Notes: * The source and destination names should match entries in acm_addr.cfg * configuration files on their respective systems. Typically, the * source and destination names will refer to system host names * assigned to an Infiniband port. */ LIB_EXPORT int ib_acm_resolve_name(char *src, char *dest, struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data); /** * ib_acm_resolve_ip - Resolve path data between the specified addresses. * Description: * Discover path information, including identifying the local device, * between the given the source and destination addresses. * Notes: * The source and destination addresses should match entries in acm_addr.cfg * configuration files on their respective systems. Typically, the * source and destination addresses will refer to IP addresses assigned * to an IPoIB instance. */ LIB_EXPORT int ib_acm_resolve_ip(struct sockaddr *src, struct sockaddr *dest, struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data); #define IB_PATH_RECORD_REVERSIBLE 0x80 struct ib_path_record { uint64_tservice_id; union ibv_gid dgid; union ibv_gid sgid; uint16_tdlid; uint16_tslid; uint32_tflowlabel_hoplimit; /* resv-31:28 flow label-27:8 hop limit-7:0*/ uint8_t tclass; uint8_t reversible_numpath; /* reversible-7:7 num path-6:0 */ uint16_tpkey; uint16_tqosclass_sl;/* qos class-15:4 sl-3:0 */ uint8_t mtu;/* mtu selector-7:6 mtu-5:0 */ uint8_t rate; /* rate selector-7:6 rate-5:0 */ uint8_t packetlifetime; /* lifetime selector-7:6 lifetime-5:0 */ uint8_t preference; uint8_t reserved[6]; }; /** * ib_acm_resolve_path - Resolve path data meeting specified restrictions * Description: * Discover path information using the provided path record to * restrict the discovery. * Notes: * Uses the provided path record as input into an query for path * information. If successful, fills in any missing information. The * caller must provide at least the source and destination LIDs as input. */ LIB_EXPORT int ib_acm_resolve_path(struct ib_path_record *path); /** * ib_acm_query_path - Resolve path data meeting specified restrictions * Description: * Queries the IB SA for a path record using the provided path record to * restrict the query. * Notes: * Uses the provided path record
[ofa-general] [RFC] 4/5: IB ACM: ib_acme test/configuration utility
Add a test/configuration utility to setup the ib_acm service and verify its operation. Signed-off-by: Sean Hefty sean.he...@intel.com --- One of the eventual goals is for the librdmacm library to use the ib acm, so a decision was made to avoid the ib acm package needing to depend on the librdmacm. This lead to OS specific code being needed to map IP addresses to IB endpoints. If anyone has an easier solution for handling this mapping, I'm open to alternatives here. acme.c: OS independent source file /* * Copyright (c) 2009 Intel Corporation. All rights reserved. * * This software is available to you under the OpenIB.org BSD license * below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above *copyright notice, this list of conditions and the following *disclaimer. * * - Redistributions in binary form must reproduce the above *copyright notice, this list of conditions and the following *disclaimer in the documentation and/or other materials *provided with the distribution. * * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include stdio.h #include stdlib.h #include string.h #include getopt.h #include netdb.h #include arpa/inet.h #include osd.h #include infiniband/verbs.h #include infiniband/ib_acm.h static char *dest_addr; static char *src_addr; static char addr_type = 'i'; static int verify; static int make_addr; static int make_opts; struct ibv_context **verbs; int dev_cnt; extern int gen_addr_ip(FILE *f); static void show_usage(char *program) { printf(usage 1: %s\n, program); printf( [-f addr_format] - i(p), n(ame), or l(id)\n); printf( default: 'i'\n); printf( -s src_addr - format defined by -f option\n); printf( -d dest_addr - format defined by -f option\n); printf( [-v] - verify ACM response against SA query response\n); printf(usage 2: %s\n, program); printf( -A - generate local acm_addr.cfg configuration file\n); printf( -O - generate local acm_ops.cfg options file\n); } static void gen_opts_temp(FILE *f) { fprintf(f, # InfiniBand Multicast Communication Manager for clusters configuration file\n); fprintf(f, #\n); fprintf(f, # Use ib_acme utility with -O option to automatically generate a sample\n); fprintf(f, # acm_opts.cfg file for the current system.\n); fprintf(f, #\n); fprintf(f, # Entry format is:\n); fprintf(f, # name value\n); fprintf(f, \n); fprintf(f, # log_file:\n); fprintf(f, # Specifies the location of the ACM service output. The log file is used to\n); fprintf(f, # assist with ACM service debugging and troubleshooting. The log_file can\n); fprintf(f, # be set to 'stdout', 'stderr', or the base name of a file. If a file name\n); fprintf(f, # is specified, the actual name formed by appending a process ID and '.log'\n); fprintf(f, # extension to the end of the specified file name.\n); fprintf(f, # Examples:\n); fprintf(f, # log_file stdout\n); fprintf(f, # log_file stderr\n); fprintf(f, # log_file /tmp/acm_\n); fprintf(f, \n); fprintf(f, log_file stdout\n); fprintf(f, \n); fprintf(f, # log_level:\n); fprintf(f, # Indicates the amount of detailed data written to the log file. Log levels\n); fprintf(f, # should be one of the following values:\n); fprintf(f, # 0 - basic configuration errors\n); fprintf(f, # 1 - verbose configuation errors\n); fprintf(f, # 2 - verbose operation\n); fprintf(f, \n); fprintf(f, log_level 0\n); fprintf(f, \n); fprintf(f, # server_port:\n); fprintf(f, # TCP port number that the server listens on.\n); fprintf(f, # If this value is changed, then a corresponding change is required for\n); fprintf(f, # client applications.\n); fprintf(f, \n); fprintf(f, server_port 6125\n); fprintf(f, \n); fprintf(f, # timeout:\n); fprintf(f, # Additional time, in milliseconds, that the ACM service will wait for a\n); fprintf(f, # response from a remote ACM service or the IB SA. The actual request\n); fprintf(f
RE: [ofa-general] [RFC] 3/5: IB ACM: libibacm
#define IB_PATH_RECORD_REVERSIBLE 0x80 struct ib_path_record { uint64_tservice_id; union ibv_gid dgid; union ibv_gid sgid; uint16_tdlid; uint16_tslid; uint32_tflowlabel_hoplimit; /* resv-31:28 flow label-27:8 hop limit-7:0*/ uint8_t tclass; uint8_t reversible_numpath; /* reversible-7:7 num path-6:0 */ uint16_tpkey; uint16_tqosclass_sl;/* qos class-15:4 sl-3:0 */ uint8_t mtu;/* mtu selector-7:6 mtu-5:0 */ uint8_t rate; /* rate selector-7:6 rate-5:0 */ uint8_t packetlifetime; /* lifetime selector-7:6 lifetime-5:0 */ uint8_t preference; uint8_t reserved[6]; }; I would prefer to use the structures already defined in ib_types.h... I understand your not wanting to make ACM dependant on the OpenSM packages so is it time to move ib_types.h out of the OpenSM tree and somewhere more generic? Perhaps libibumad? This also applies to ib_sa_mad in your 5th patch. OTOH, ib_types.h is a 10K line file with multiple long (10 lines) inlined functions. Perhaps it deserves it's own library? Defining some of these types in libibumad isn't a bad idea. Although, WinOF actually has 2 copies of ib_types.h (that differ...) I find using ib_types.h painful given its size; separate header files may help. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
I'm not sure this is a good idea. ibutils (ibis and ibmgtsim) wants ib_types.h but does not want libibumad. Well, libibumad is pretty useless without some network structure definitions. Currently, the alternatives are to install opensm, which also requires installing libibmad, libibcommon, and complib, or for the app to define what they need, which is what was done here. I'm not sure how you pick up ib_types.h without libibumad getting installed, but you can make a reasonable argument that libibumad should define the MAD and SA attribute structures. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
libibcm needs to learn how to do PR queries, it should have a good PR query API since libibcm is pretty useless without being able to do PR queries.. PR queries don't work - regardless of what the API looks like or where it resides. Plus adding PR queries to libibcm doesn't solve the problem of where the structure definitions reside. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
PR queries work fine, I don't understand your comment. MPI does not use PR queries because it does not scale. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm
Not all the world is MPI. The focus of this package is for MPI though. The librdmacm interface does perform standard PR queries for applications that use that interface. I'm not fond the mad interfaces, but I'm not trying to fix them with this. We can debate whether an application should use an interface that exposes path records and the IB CM protocol directly, but the feedback from MPI and other developers is that connection establishment over IB requires too much code and is too difficult. Short term, while the ib_acm is considered experimental, I want to call the ib_acm from under the librdmacm interface. This allows it to be used without applications needing to change. Long term, if the ib_acm can to prove itself, then accessing it directly from the kernel is a possibility. Your new acm stuff still does PR queries. The primary reason for adding PR query was to verify that the path information returned by the ib_acm was usable. A user needs some way to know if the ib_acm can be used on their cluster. This was one of the last things that I added, and I think it has value, even if only for verification purposes. The central mechanism the ib_acm employs to acquire path data uses multicast. Anyone using libibverbs multicast needs to do PR queries from userspace. The ib_acm uses libibverbs multicast and does not do PR queries. Anyone using libibcm needs to do PR queries from userspace. Open MPI has coded to the libibcm and does not perform PR queries. What's needed in either of the above cases is path information; however, there are alternate ways of obtaining this information without involving a direct query to the SA. MPI and DAPL can connect over IB today without doing PR queries. While there are limitations to determining path information without doing a PR query, there are also limitations to obtaining path information doing one. Looking at current implementations, I would deduce that the latter is more limiting than the former in practice. Therefore we should just jam the PR query stuff in libibcm, everyone can use that, and your acm can ride on the PR query code from libibcm for its own needs too. These are the calls exposed through libibacm: int ib_acm_resolve_name(char *src, char *dest, struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data); int ib_acm_resolve_ip(struct sockaddr *src, struct sockaddr *dest, struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data); int ib_acm_resolve_path(struct ib_path_record *path); int ib_acm_query_path(struct ib_path_record *path); int ib_acm_convert_to_path(struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data, struct ib_path_record *path); Of these, the one of most importance to the problem I'm trying to solve is ib_acm_resolve_ip(). I do not believe that we want to add what should be considered an experimental interface to libibcm, libibumad, or librdmacm based on socket addresses that would then need to be maintained. If your objection is that ib_acm_query_path() should be moved to libibcm, that's a possibility. libibacm already interfaces to libibumad, and it was trivial to add support for PR queries. libibcm does not currently depend on libibumad. And if you take a step back in the connection process, I don't know that support for just PR queries is sufficient for establishing a connection over IB. You first need to identify the endpoint, which opens up the possibility of other SA queries. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: Does the CMA user space support join multicast for IPv6 too?
Does rdma_join_multicast supports IPv6 addresses? If yes from which version on the librdmacm? Hmm... I don't think so. It looks like the librdmacm and rdma_cm kernel modules could support it with a small change. The kernel module calls ip_ib_mc_map() to map IP addresses to MGIDs, which only works with IPv4. Does ipoib map IPv6 multicast addresses to MGIDs directly? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH/RFC] IB/mad: Fix lock-lock-timer deadlock in RMPP code (was: [NEW PATCH] IB/mad: Fix possible lock-lock-timer deadlock)
Holding agent-lock across cancel_delayed_work() (which does del_timer_sync()) in ib_cancel_rmpp_recvs() leads to lockdep reports of possible lock-timer deadlocks if a consumer ever does something that connects agent-lock to a lock taken in IRQ context (cf http://marc.info/?l=linux-rdmam=125243699026045). However, it seems this locking is not necessary here, since the locking did not prevent the rmpp_list from having an item added immediately after the lock is dropped -- so there must be sufficient synchronization protecting the rmpp_list without the locking here. Therefore, we can fix the lockdep issue by simply deleting the locking. The locking is needed to protect against items being removed from rmpp_list in recv_timeout_handler() and recv_cleanup_handler(). No new items should be added to the rmpp_list when ib_cancel_rmpp_recvs() is running (or there's a separate bug). - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] performance to call ibv_poll_cq() vs. call select() on completion channel
But I just check the source code, ibv_poll_cq() is actually ibv_cmd_poll_cq(), and ibv_cmd_poll_cq() calls write() system call on the IB device. Doesn't this write() system call switch to kernel mode and possiblely casuse a context switch ? See verbs.h: static inline int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc) { return cq-context-ops.poll_cq(cq, num_entries, wc); } The userspace provider library sets poll_cq to an internal call. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Help - RDMA event files remain open after acknowledging them
What am I doing wrong? Is there something more I need to do than calling rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be closed? As an fyi, I have even tried closing the rdma_id and destroying the event channel when the connection fails to force the event files to be closed without success. The following calls result in opening files to the kernel: ibv_create_comp_channel() - used to report cq events rdma_create_event_channel() - used to report rdma cm events Be sure that there are corresponding calls to: ibv_destroy_comp_channel() rdma_destroy_event_channel() These are the calls that close the opened files. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] will opensm respond to requests that do not originate from qp1
Based on a code audit, I've confirmed that this should work (osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure it's been tried for SA but it has been exercised for other GS classes (sending to some QP other than QP1). Thanks for checking and pointing me at the right source file. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: How to destroy IB resources (was Re: [ofa-general] Help - RDMA event files remain open after acknowledging them)
I guess my question is, what's the best way to destroy IB resources? (Perhaps even, what's the best way to init them in the first place). If you're destroying the CQ, there's no need to call ibv_get_cq_event() or ibv_poll_cq(), unless you need completion information (for example, from flushed receives). However, every successful call to ibv_get_cq_event() needs a corresponding call to ibv_ack_cq_event(). You can call ack(1) for each cq event, or count the number of times that get returns success and call ack(get_cnt) once before calling destroy. Note that the count refers to the number of cq events, and not the number of completions returned through ibv_poll_cq. For your drain_cq() function, you should be safe doing something like this: while (ibv_poll_cq(...) 0) /* optional processing of any left over completions */; ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */ ibv_destroy_cq(...); ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return error EBUSY This sounds like a QP isn't being destroyed. I'm not sure that anything else fails CQ destruction with EBUSY. Btw, if you're using the rdma_cm interface, then it's simpler to use the rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP state transitions for you. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock)
How about this approach? Basically it just open-codes delayed work by splitting the timer and the work struct, and switches to mod_timer() instead of del_timer() + add_timer(). It passes very light testing here (basically I started ipoib and nothing blew up). The approach looks okay to me. @@ -512,7 +523,8 @@ static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) */ cancel_mads(mad_agent_priv); port_priv = mad_agent_priv-qp_info-port_priv; - cancel_delayed_work(mad_agent_priv-timed_work); + del_timer_sync(mad_agent_priv-timeout_timer); + cancel_work_sync(mad_agent_priv-timeout_work); I had to check if there was a race between del_timer_sync() and the worker thread, but the call to cancel_mads() looks like it prevents any issues. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
Call Trace: 882fb6d5{:rdma_cm:rdma_init_qp_attr+209} 88309285{:rdma_ucm:ucma_init_qp_attr+160} 802ea55a{thread_return+0} 8830832e{:rdma_ucm:ucma_write+115} 80186662{vfs_write+215} 80186c2b{sys_write+69} 8010adba{system_call+126} The rdma_cm is being used, so alternate path information is not used. static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv, struct ib_qp_attr *qp_attr, int *qp_attr_mask) { if (cm_id_priv-id.lap_state == IB_CM_LAP_UNINIT) { . } else { *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; qp_attr-alt_port_num = cm_id_priv-alt_av.port-port_num; -die The rdma_cm should always send us through the if portion, and I would expect alt_av to be NULL. Maybe the cm_id is corrupted..? Is there any chance that the remote side is trying to load an alternate path? Getting the value of the lap_state may help, to see if it's at least a valid lap_state value. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?
Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT. Errr... not sure how that happened. I don't know if ofed 1.3 has this feature or not, but can you cat: /sys/class/infiniband_cm/device/port_num/cm_tx_msgs/lap if it exists? Are both sides using the rdma_cm to communicate? Does anything in the app (either side) try to do something with alternate paths? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports
Might there be some GS service to expose ? Vendor MADs perhaps ? If not, then not exposing QP1 should be OK. At some point, exposing QP1 may make sense. I was thinking more along the lines of limiting the user space interfaces until things can be standardized. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type
Can resources (PDs, CQs, MRs, etc.) between the different transports be shared? Does QP failover between transports work? There is nothing in the architecture that precludes this; we are not currently focusing on this. Does the implementation allow this? Right now PDs, CQs, etc are allocated per device, not per port. I'm not immediately concerned about QP failover. However, I believe there needs to be some level of coordination between the Infiniband side of the CM and the Ethernet side of the CM, since QPs are associated with CA GUIDs. I'm just trying to understand the impact of this coordination. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries
- ctx-cq = ibv_create_cq(ctx-context, ctx-rx_depth, NULL, ctx-channel, 0); + ctx-cq = ibv_create_cq(ctx-context, ctx-tx_depth + ctx-rx_depth, + NULL, ctx-channel, 0); I'm looking at a windows port of this test, but at least there, rx_depth is set to rx_depth + tx_depth. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries
Sure. Just above the call to ibv_create_cq(), ctx-rx_depth is set to ctx-rx_depth = rx_depth + tx_depth but the rest of the code does ibv_post_send() and ibv_post_recv() based on ctx-tx_depth and ctx-rx_depth which means the CQ needs to be ctx-tx_depth + ctx-rx_depth big. If the tx_depth is the same on both sides, why would there ever be more than the initial tx_depth and rx_depth completions on the CQ? How many receive completions can there be on the CQ, and what throttles the sender? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries
Remember that this fix only affects the bi-directional test. Both client and sever are going to post ctx-rx_depth receives and ctx-tx_depth sends and then check for completions. It won't post more sends or receives until the completions are seen. Okay - I think I understand what's happening. The maximum number of outstanding sends is limited to tx_depth / 2. After posting that many sends, the code waits for completions. Once some sends complete, additional sends may be posted, up to the iteration count. There's nothing that coordinates posting the sends with completing receives on the remote side. (This is what I was missing.) Eventually, all posted receives could be complete and generate CQ entries. The send side is basically throttled by RNR NACKs. Now I don't understand the purpose behind doubling the rx_depth... - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] cma: fix access to freed memory
rdma_join_multicast() allocates struct cma_multicast and then proceeds to join to a multicast address. However, the join operation completes in another context and the allocated struct could be released if the user destroys either the rdma_id object or decides to leave the multicast group while the join is in progress. This patch uses reference counting to to avoid such situation. It also protects removal from id_priv-mc_list in cma_leave_mc_groups(). rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast. This call will block until the join callback completes or is canceled. Can you describe the race with cma_ib_mc_handler in more detail? Also, cma_leave_mc_groups is only called from rdma_destroy_id. Locking around the mc-list shouldn't be required, since calls to join/leave aren't allowed. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type
As a preparation to devices that, in general, support different transport protocol for each port, specifically RDMAoE, this patch defines transport type for each of a device's ports. As a result rdma_node_get_transport() has been unexported and is used internally by the implementation of the new API, rdma_port_get_transport() which gives the transport protocol of the queried port. All references to rdma_node_get_transport() are changed to to use rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum rdma_transport_type. Can resources (PDs, CQs, MRs, etc.) between the different transports be shared? Does QP failover between transports work? diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 5130fc5..f930f1d 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device) unsigned long flags; int ret; u8 i; - - if (rdma_node_get_transport(ib_device-node_type) != RDMA_TRANSPORT_IB) - return; Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask of the supported transports available on the device? I'm wondering if something like this makes sense, to allow skipping devices that are not of interest to a particular module. This would be in addition to the rdma_port_get_transport call. There's just a lot of new checks to handle the transport on a port by port basis. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] cma: fix access to freed memory
So where does this leave things? Is any part of Eli's patch needed? I don't believe the patch is needed, and Eli agreed with this. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] IB: Possible write outside array bounds
@@ -132,6 +136,9 @@ enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, hop_ptr = smp-hop_ptr; hop_cnt = smp-hop_cnt; + if (hop_cnt = IB_SMP_MAX_PATH_HOPS) + return IB_SMI_DISCARD; + /* See section 14.2.2.2, Vol 1 IB spec */ if (!ib_get_smp_direction(smp)) { /* C14-9:1 -- sender should have incremented hop_ptr */ @@ -140,7 +147,8 @@ enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, /* C14-9:2 -- intermediate hop */ if (hop_ptr hop_ptr hop_cnt) { - if (node_type != RDMA_NODE_IB_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH || + hop_ptr + 1 = IB_SMP_MAX_PATH_HOPS) I believe at this point: hop_ptr hop_cnt IB_SMP_MAX_PATH_HOPS so, this test will always fail. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] perftest Add rdma_cm retries
Here is version 2 of the patch. Based on observations of tests, I believe Steve Wise's comments are reasonable, so I removed the rdma_resolve_addr retry and simply changed the timeout value. Feel free to use whichever one of these patches you like best. However, I urge you to apply one of these, since the programs fail in a busy large fabric. Why not just make the retry and timeout values command line parameters and allow adjusting both? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] perftest Add rdma_cm retries
I'm not sure we need the retry. On IB, resolve route is done using unreliable datagram with no lower level timeout or retry. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: Running more than 894 processes doing rdma_listen
Is there an explicit limit on the number of ports that can be listening using rdma_cm? There's no inherent limit built into the code. It prints out CMA: unable to open RDMA device It then doesn't gracefully handle that problem, ending in Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 47695401269920 (LWP 30003)] __ibv_close_device (context=0x0) at src/device.c:154 154 int async_fd = context-async_fd; (gdb) where #0 __ibv_close_device (context=0x0) at src/device.c:154 #1 0x0034e360184f in ucma_cleanup () at src/cma.c:165 #2 0x0034e3601a13 in ucma_init () at src/cma.c:257 #3 0x0034e3602080 in rdma_create_event_channel () at src/cma.c:299 #4 0x00403077 in main (argc=4, argv=0x7fffb739fcc8) at rdma_bw.c:1057 Thanks - I see where the bug is for this. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] sending mad in parallel mode and perfquery
ibumad library has functions send_mad and recv_mad which should be send sequentially. Is it possible to create function which would send several MADs to several destinations and then waits for replies(in terms of ib driver)? I'm not sure that send_mad and recv_mad don't do what you want. To send to multiple destinations, call send_mad multiple times. The call returns after posting or queuing the send operation to the QP. It does not wait for a response or guarantee that the send has actually been placed on the wire before returning. recv_mad blocks until any response is received, and it can be called from multiple threads. recv_mad only has multi-threaded issues if MADs 256 bytes are received. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] rdma_listen() backlog
Maybe I've missed something, but the last time I checked it appeared to me that for kernel RDMA CM the 'backlog' parameter was not used at all unless for iWarp transport. It's not used for kernel IB connections. Since connection requests are reported through a callback, there's nothing to queue and it's unneeded. It is used for userspace connections. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: Question on rdma_resolve_route and retries
We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband fabric using OFED 1.4.1. When the MPI jobs get large enough, the event response to rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT. Yep - you pretty much need to connect out of band with all large MPI jobs using made up path data, or enable some sort of PR caching. It seems pretty clear that the SA path record requests are being synchronized and bunching together, and in the end exhausting the resources of the subnet manager node so only the first N are actually received. In our testing, we discovered that the SA almost never dropped any queries. The problem was that the backlog grew so huge, that all requests had timed out before they could be acted on. There's probably something that could be done here to avoid storing received MADs for extended periods of time. The sequence seems to be: call librdmacm-1.0.8/src/cma.c's rdma_resolve_route which translates directly into a kernel call into infiniband/core/cma.c's rdma_resolve_route with an IB fabric becomes a call into cma_resolve_ib_route which leads to a call to cma_query_ib_route which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with the callback pointing to cma_query_handler When cma_query_handler gets a callaback with a bad status, it sets the returned event to RDMA_CM_EVENT_ROUTE_ERROR Nowhere in there do I see any retry attempts. If the SA path record query packet, or it's response packet, gets lost, then the timeout eventually happens and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT. The kernel sa_query module does not issue retries. All retries are the responsibility of the caller. This gives greater flexibility to how timeouts are handled, but has the drawback that all 'retries' are really new transactions. First question: Did I miss a retry buried somewhere in all of that? I don't believe so. Second question: How does somebody come up with a timeout value that makes sense? Assuming retries are the responsibility of the rdma_resolve_route caller, you would like to have a value that is long enough to avoid false timeouts when a response is eventually going to make it, but not any longer. This value seems like it would be dependent on the fabric and the capabilities of the node running the subnet manager, and should be a fabric-specific parameter instead of something chosen at random by each caller of rdma_resolve_route. The timeout is also dependent on the load hitting the SA. I don't know that a fabric-specific parameter can work. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: Question on rdma_resolve_route and retries
This is encouraging. I did try testing with 10,000 ms timeouts and still got the failure with only 800 different processes, so I jumped to the conclusion that the queries were being dropped. Do you have a guess as to a timeout value that would always succeed? We ended up around a 60 second timeout based on the number of connections and how quickly our SM node could process queries. This was done a while ago, and there have been a lot of improvements to opensm since then. I don't know of an easy way to test the performance of the SM. It's also possible that our test staggered the queries just enough that the SM could keep up receiving them. Maybe I should have come up with a better name. By fabric-specific, I meant a specific implentation of the fabric, including the capability of the subnet manager node. How does somebody writing rdma_cm code come up with a number? That particular program might not put much of a load on the SA, but could run concurrently with other jobs that do (or don't). It would be nice to have a way to set up the retry mechanism so that it would work on any system it ran on. Maybe the SA service could track the SA response time and adjust the timeout accordingly. E.g. guess = .2(last response) + .8(last guess). Users could indicate that the default timeout could be used. Apps could also help by staggering their start times to avoid hitting the SA with hundreds of thousands of queries at once. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] cmatose fails whereas rping passes on iWarp
I did this change and the hang went away as well. I think cmatose.c needs this fix. ucmatose completes when I change the following line: send_wr.send_flags = 0; to send_wr.send_flags = IBV_SEND_SIGNALED; cmatose sets init_qp_attr.sq_sig_all = 1 when initializing the QP, so I wouldn't expect this flag to be used. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] cmatose fails whereas rping passes on iWarp
If this test sends data from server side first you could be running into the iWARP requirement of sending from client first. This was my thought as well. I think Chelsio supports sending from the server side first, but I'm not sure, or if it's enabled by default. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] ib_rdma_bw - memory leaks?
As mentioned in my previous email, there are other 3 places of memory leaks, should I proceed and fix them up in rdma_bw.c file? I think that makes sense; I was only commenting on the code that I maintain. Based on looking at the git trees, it appears that Owen Meron is the maintainer of ib_rdma_bw. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] cmatose fails whereas rping passes on iWarp
I'm wondering if anybody else has seen this behavior. Is cmatose expected to work on iWarp? It's intended to work on iWarp. [r...@lv2 examples]# ./ucmatose -s 192.168.10.30 cmatose: starting client cmatose: connecting cmatose: event: RDMA_CM_EVENT_CONNECT_ERROR, error: -22 This looks like an asynchronous error occurring while trying to connect. I don't see anything obvious in cmatose.c that would lead to a connect error. Does anything occur on the server side? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] ib_rdma_bw - memory leaks?
3. rdma_create_event_channel() calls ucma_init() but rdma_destroy_event_channel() does not call ucma_cleanup(), this results into memory leak at provider's library since it does not call ibv_close_device() and thus unable to do *-free_context(). This is the correct behavior. ucma_init() is called from several routines to ensure that the library performs proper initialization. Once initialized, it remains initialized until the library is no longer used. The cleanup is done in rdma_cma_fini(). - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Sending two integers via RDMA_WRITE
I want to use completion queue element on completion queue associated with received queue (on remote hca) to allow reading databuffer. But I get nothing from the completion queue. You need to send immediate data with an RDMA write to generate a completion on the remote side. Otherwise, a receive work request is not consumed. In the specification, it says that a CQE should be created (in the remote hca) after performing a rdma write See C10-87 (page 511) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] dapl/windows: remove dlist.c
All dlist functions have been moved to the header file. Remove references to dlist.c. Signed-off-by: Sean Hefty sean.he...@intel.com --- dapl/openib_cma/dapl_ib_util.c |1 - dapl/openib_scm/dapl_ib_cq.c |1 - 2 files changed, 0 insertions(+), 2 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index bf23d43..f48c1cb 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -56,7 +56,6 @@ struct dapl_llist_entry *g_hca_list; #if defined(_WIN64) || defined(_WIN32) #include ..\..\..\..\..\etc\user\comp_channel.cpp -#include ..\..\..\..\..\etc\user\dlist.c #include rdma\winverbs.h struct ibvw_windata windata; diff --git a/dapl/openib_scm/dapl_ib_cq.c b/dapl/openib_scm/dapl_ib_cq.c index 2af1889..8a9a2ab 100644 --- a/dapl/openib_scm/dapl_ib_cq.c +++ b/dapl/openib_scm/dapl_ib_cq.c @@ -55,7 +55,6 @@ #if defined(_WIN64) || defined(_WIN32) #include ..\..\..\..\..\etc\user\comp_channel.cpp -#include ..\..\..\..\..\etc\user\dlist.c void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] verb level interoperability between vendor's hcas
Is a mixed HCA environment cluster not ready for prime time - yet? Are the crashes in the kernel or userspace? Is there a specific HCA on the nodes that crash? Interop testing is done, but I do not know the details of the configurations and tests that are run. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: [PATCH 0/2] Opensm support for external routing engines
The idea is to include non-open source routing algorithms into opensm on demand, which is permitted by the BSD license. It is permitted, but I don't think that we as open source community need to support such efforts. I agree with this. This sets a precedence of opening up the source code to all sorts of changes that become difficult to test and maintain. Anyone is free to take opensm, integrate their own changes, and release separately, but the burden of maintaining those changes should not rest on the open source community at large. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet
LL: the RDMA stack will see that the port has different link types. SLs map cleanly to VLAN user priorities. LL: you need to emulate *enough* so that typical applications don't need to worry about the link type. SA path queries is the best example. Otherwise, every RDMA application (not necessarily a CMA app) will need to have different code paths depending on the link type. Let's just say that at this point I completely disagree with where these patches try to abstract the differences, which are many. RDMA apps that want to use this and IB without going through an abstraction will need different code -- just like they would for iWarp, which also provides RDMA over Ethernet, and is a standard. IB mad and SA query modules are not appropriate places for abstracting the differences between IB, iWarp, and whatever name we give this. This could change depending on whether this is really trying to be IB with a different L2, or is just another RDMA protocol that runs on Ethernet. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ewg] RE: [ofa-general] [PATCH 4/9] ib_core: Add RDMAoE SA support
How can a user control this? An app needs the same qkey for unicast traffic. In RDMAoE, the qkey has a fixed well-known value, which will be returned both by multicast and path queries. The rdma_cm defines and uses a different well-known qkey. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet
RDMA over Ethernet (RDMAoE) allows running the IB transport protocol over Ethernet, providing IB capabilities for Ethernet fabrics. The packets are standard Ethernet frames with an Ethertype, an IB GRH, unmodified IB transport headers and payload. HCA RDMAoE ports are no different than regular IB ports from the RDMA stack perspective. I would refer to this as IBoE, not RDMAoE. The RDMA stack should see these ports different than regular IB HCA ports. There are a lot of differences that should not simply be hidden or incorrectly assumed: QP0, QoS, multiple paths, routing(?), no SA, etc. IB subnet management and SA services are not required for RDMAoE operation; Then I would not try to emulate it at all. As Hal mentioned in a separate post, there are too many ways to interact with the SA that an emulation won't cover. Ethernet management practices are used instead. In Ethernet, nodes are commonly referred to by applications by means of an IP address. RDMAoE treats IP addresses that were assigned to the corresponding Ethernet port as GIDs, and makes use of the IP stack to bind a destination address to the corresponding netdevice (just as the CMA does today for IB and iWARP) and to obtain its L2 MAC addresses. Is the actual L3 address an IP address, or just an encoded IP address in an IBoE L3 address? What L3 protocol is being used and will it interoperate with some peer L3 protocol (IP or IB)? - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID -- MAC translations
A few support functions are added to allow the translation from GID to MAC which is required by hw drivers supporting RDMAoE. Why not just use IP to MAC calls? Or use the MAC as the GUID? Do the GIDs follow the IB GID format? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH 4/9] ib_core: Add RDMAoE SA support
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 107f170..2417f6b 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -488,6 +488,36 @@ retest: } } +struct eth_work { + struct work_struct work; + struct mcast_member *member; + struct ib_device*device; + u8 port_num; +}; + +static void eth_mcast_work_handler(struct work_struct *work) +{ + struct eth_work *w = container_of(work, struct eth_work, work); + int err; + struct ib_port_attr port_attr; + int status = 0; + + err = ib_query_port(w-device, w-port_num, port_attr); + if (err) + status = err; + else if (port_attr.state != IB_PORT_ACTIVE) + status = -EAGAIN; + + w-member-multicast.rec.qkey = cpu_to_be32(0xc2c); How can a user control this? An app needs the same qkey for unicast traffic. + atomic_inc(w-member-refcount); This needs to be moved below... + err = w-member-multicast.callback(status, w-member-multicast); + deref_member(w-member); + if (err) + ib_sa_free_multicast(w-member-multicast); + + kfree(w); +} + /* * Fail a join request if it is still active - at the head of the pending queue. */ @@ -586,21 +616,14 @@ found: return group; } -/* - * We serialize all join requests to a single group to make our lives much - * easier. Otherwise, two users could try to join the same group - * simultaneously, with different configurations, one could leave while the - * join is in progress, etc., which makes locking around error recovery - * difficult. - */ -struct ib_sa_multicast * -ib_sa_join_multicast(struct ib_sa_client *client, - struct ib_device *device, u8 port_num, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, gfp_t gfp_mask, - int (*callback)(int status, - struct ib_sa_multicast *multicast), - void *context) +static struct ib_sa_multicast * +ib_join_multicast(struct ib_sa_client *client, +struct ib_device *device, u8 port_num, +struct ib_sa_mcmember_rec *rec, +ib_sa_comp_mask comp_mask, gfp_t gfp_mask, +int (*callback)(int status, +struct ib_sa_multicast *multicast), +void *context) { struct mcast_device *dev; struct mcast_member *member; @@ -647,9 +670,81 @@ err: kfree(member); return ERR_PTR(ret); } + +static struct ib_sa_multicast * +eth_join_multicast(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_sa_multicast *multicast), + void *context) +{ + struct mcast_device *dev; + struct eth_work *w; + struct mcast_member *member; + int err; + + dev = ib_get_client_data(device, mcast_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + w = kzalloc(sizeof *w, gfp_mask); + if (!w) { + err = -ENOMEM; + goto out1; + } + w-member = member; + w-device = device; + w-port_num = port_num; + + member-multicast.context = context; + member-multicast.callback = callback; + member-client = client; + member-multicast.rec.mgid = rec-mgid; + init_completion(member-comp); + atomic_set(member-refcount, 1); + + ib_sa_client_get(client); + INIT_WORK(w-work, eth_mcast_work_handler); + queue_work(mcast_wq, w-work); + + return member-multicast; The user could leave/destroy the multicast join request before the queued work item runs. We need to hold an additional reference on the member until the work item completes. + +out1: + kfree(member); + return ERR_PTR(err); +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_sa_multicast * +ib_sa_join_multicast(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_sa_multicast *multicast), + void *context) +{ + return
RE: [ofa-general] [PATCH 1/9] ib_core: Add API to query port link type
This allows to get the type of a port to be either Ethernet or IB which is required by following patches for implementing RDMA over Ethernet - RDMAoE. I don't know if this makes more sense without studying the changes in more detail, but was there a reason why node_type just wasn't extended instead? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] spin_lock_irqsave in ib_send_mad
spin_lock_irqsave(qp_info-send_queue.lock, flags); if (qp_info-send_queue.count qp_info-send_queue.max_active) { + qp_info-send_queue.count++; + spin_unlock_irqrestore(qp_info-send_queue.lock, flags); ret = ib_post_send(mad_agent-qp, mad_send_wr-send_wr, bad_send_wr); + spin_lock_irqsave(qp_info-send_queue.lock, flags); list = qp_info-send_queue.list; } else { ret = 0; + qp_info-send_queue.count++; list = qp_info-overflow_list; } if (!ret) list_add_tail(mad_send_wr-mad_list.list, list); +else + qp_info-send_queue.count--; It's not quite this simple. Once the lock is released before calling ib_post_send, another thread could come down and queue a MAD to the overflow list. If ib_post_send fails, the overflow list must be checked to see if a queued mad should now be sent. As for being able to hold a lock when calling ib_post_send, that's something that should be allowed. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] spin_lock_irqsave in ib_send_mad
Why check the overflow list only when the ib_post_send fails? Don't you need to do this regardless? It looks like you could get stuff into the overflow list even with the existing code... You only need to check it when decrementing send_queue.count, which is currently only after a send completes. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [ofw] [PATCH-resend] ib-mgmt/libibnetdisc: fix typecast warning
Signed-off-by: Sean Hefty sean.he...@intel.com --- I tried converting ib_portid_t lid to a uint16_t, but that resulted in a cascade of changes, so I kept the simple approach. :) Resending - I didn't see a response to this. infiniband-diags/libibnetdisc/src/ibnetdisc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 1e93ff8..baea98e 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -188,7 +188,7 @@ extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) f-fabric.ibmad_port) 0) return -1; - portid-drpath.drslid = f-selfportid.lid; + portid-drpath.drslid = (uint16_t) f-selfportid.lid; portid-drpath.drdlid = 0x; } ___ ofw mailing list o...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] spin_lock_irqsave in ib_send_mad
mad.c:ib_send_mad() calls ib_post_send() after taking spin_lock_irqsave(). Is it really necessary to take the spinlock during the entire time of ib_post_send()? It appears like it is only necessary for list manipulation! It protects the list and the counters. It's technically not needed around ib_post_send. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
Are there any comparable Windows plans? I believe that Windows already provides an equivalent functionality as part of the OS (Windows 2008 / Vista). See SecureMemoryCacheCallback. There are no plans for WinOF to provide anything separately from this. (It's likely impossible without OS support anyway.) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [ofw] skipping QP states during transitions
No, you need to move from reset to init to RTR and only than to RTS. Ok - thanks. Look at the IB spec on section 10.3 I was just exploring whether any hardware, separate from the existing software stacks, supported 'skipping' QP states -- assuming necessary values for the other states were also given. In theory, hardware could walk through the states internally. The motivation is to decrease the time to connect QPs by reducing the number of commands that need to be issued to the hardware. And to be clear, I'm not suggesting that such a feature is all that important. I'm just exploring ideas. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] skipping QP states during transitions
Does anyone know if the HCAs are capable of transitioning directly from reset to RTS using a single command? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] dapl/windows cma provider: add support for network devices based on index
The linux cma provider provides support for named network devices, such as 'ib0' or 'eth0'. This allows the same dapl configuration file to be used easily across a cluster. To allow similar support on Windows, allow users to specify the device name 'rdma_devN' in the dapl.conf file. The given index, N, is map to a corresponding IP address that is associated with an RDMA device. Signed-off-by: Sean Hefty sean.he...@intel.com --- diff -up -r -X \mshefty\scm\winof\trunk\docs\dontdiff.txt -I '\$Id:' trunk\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c branches\winverbs\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c --- trunk\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c 2009-05-01 10:18:28.0 -0700 +++ branches\winverbs\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c 2009-06-02 15:26:19.534649800 -0700 @@ -57,10 +57,50 @@ struct dapl_llist_entry *g_hca_list; #if defined(_WIN64) || defined(_WIN32) #include ..\..\..\..\..\etc\user\comp_channel.cpp #include ..\..\..\..\..\etc\user\dlist.c +#include rdma\winverbs.h -#define getipaddr_netdev(x,y,z) -1 struct ibvw_windata windata; +static int getipaddr_netdev(char *name, char *addr, int addr_len) +{ + IWVProvider *prov; + WV_DEVICE_ADDRESS devaddr; + struct addrinfo *res, *ai; + HRESULT hr; + int index; + + if (strncmp(name, rdma_dev, 8)) { + return EINVAL; + } + + index = atoi(name + 8); + + hr = WvGetObject(IID_IWVProvider, (LPVOID *) prov); + if (FAILED(hr)) { + return hr; + } + + hr = getaddrinfo(..localmachine, NULL, NULL, res); + if (hr) { + goto release; + } + + for (ai = res; ai; ai = ai-ai_next) { + hr = prov-lpVtbl-TranslateAddress(prov, ai-ai_addr, devaddr); + if (SUCCEEDED(hr) (ai-ai_addrlen = addr_len) (index-- == 0)) { + memcpy(addr, ai-ai_addr, ai-ai_addrlen); + goto free; + } + } + hr = ENODEV; + +free: + freeaddrinfo(res); +release: + prov-lpVtbl-Release(prov); + return hr; +} + static int dapls_os_init(void) { return ibvw_get_windata(windata, IBVW_WINDATA_VERSION); diff -up -r -X \mshefty\scm\winof\trunk\docs\dontdiff.txt -I '\$Id:' trunk\ulp\dapl2/dapl/openib_cma/SOURCES branches\winverbs\ulp\dapl2/dapl/openib_cma/SOURCES --- trunk\ulp\dapl2/dapl/openib_cma/SOURCES 2009-05-27 07:25:19.0 -0700 +++ branches\winverbs\ulp\dapl2/dapl/openib_cma/SOURCES 2009-06-02 10:38:04.799012200 -0700 @@ -45,10 +45,12 @@ TARGETLIBS= \ $(SDK_LIB_PATH)\ws2_32.lib \ !if $(FREEBUILD) $(TARGETPATH)\*\dat2.lib \ + $(TARGETPATH)\*\winverbs.lib \ $(TARGETPATH)\*\libibverbs.lib \ $(TARGETPATH)\*\librdmacm.lib !else $(TARGETPATH)\*\dat2d.lib \ + $(TARGETPATH)\*\winverbsd.lib \ $(TARGETPATH)\*\libibverbsd.lib \ $(TARGETPATH)\*\librdmacmd.lib !endif ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] SubnAdmGet (6777)
I could not find anywhere in the spec how should the SA respond to SubnAdmGet() in case there is more than one record. What I did find is an example of path query mad, and it was with SubnAdmGetTable(). PR NumbPath - 'In a SubnAdmGet() query request, ignored; a value of 1 is used.' I'm not sure how else you can interpret this except to mean the same as for SubAdmGetTable: 'If more paths that satisfy the PathRecord query exist for a given SGID-DGID combination, only NumbPath paths shall be returned (implementation defined).' - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] SubnAdmGet (6777)
No, it is correct as is (returning an error of too many records for this case). See p.944: 15.4.6 SUBNADMGET() / SUBNADMGETRESP(): GET AN ATTRIBUTE C15-0.1.30: Ine response to a SubnAdmGet(), if a single attribute would be returned based on the access rules specified in 15.4.1 Restrictions on Access on page 938 and the matching of components specified by the ComponentMask, then SubAdmGetResp() shall return that attribute with a zero status value. C15-0.1.31: If SubnAdmGet() fails to satisfy C15-0.1.30:, SubnAdmGet- Resp() shall return with the status field providing the reason for failure (see Table 190 SA MAD Class-Specific Status Encodings on page 900). This ignores NumbPath = 1 (or defines NumbPath differently for PR SubAdmGet versus SubAdmGetTable). With NumbPath = 1, only a single attribute should be returned. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] SubnAdmGet (6777)
Yes, it is different from GetTable in that SA pares the responses down to that but Get doesn't (have that additional language to pare them down). This seems like an implementation issue (aka bug) with the SA to me. The language about NumbPath for Get was originally added to indicate that the NumbPath was ignored on a Get even if it was included in the component mask. It states that it's ignored and a value of 1 is used. What else would a NumbPath value of 1 mean if it's completely ignored? I consider this a spec bug. :) From an implementation view, requiring users to use SubnAdmGetTable to get a single path record is less efficient than returning a single PR from SubnAdmGet. How have other SM implementations (not based on opensm) interpreted NumbPath for PR SubnAdmGet? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] RDMA_CM--how to include an SRQ?
Is there an example of how to incorporate an SRQ into using RDMA CM and IB verbs? Thank you for any assistance or suggestions. I don't believe so. libibverbs has an srq_pingpong example program that uses an SRQ. Using an SRQ with the rdma_cm is basically trivial if the QP is created using rdma_create_qp. The rdma_cm reads the struct ibv_srq * field from the struct ibv_qp when establishing a connection. If the QP is created directly from libibverbs (ibv_create_qp), then the user should just indicate that an SRQ is in use when connecting. Note that I don't believe there's no real requirement to indicate to the remote side of a connection that an SRQ is in use. The remote QP doesn't use this information. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state
Note that a lot (most?) connections between QPs are established out of band using TCP, and these are not tracked by the CM or go through any sort of timewait before potentially being reused. I don't quite understand this. Could you please point me to places (code, IB spec, so on) where I could poke around? MPIs typically connect QPs by connecting over sockets and exchanging the QP information that way. The QPs are then modified directly using a combination of locally read and hard-coded values. The libibverb examples along with the perftest programs can connect QPs in this fashion. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state
In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP could enter the TimeWait state without having entered the Established state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side. Yes, it's possible to enter timewait without going through established. I'd have to walk through the code at this point to identify all of the cases. Note that a lot (most?) connections between QPs are established out of band using TCP, and these are not tracked by the CM or go through any sort of timewait before potentially being reused. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages
please copy the ofw mail list on dapl changes diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 1c098c5..0378a70 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) -#define dapl_os_getpid getpid - +#define dapl_os_getpid (long int)pthread_self Maybe add a new call, dapl_os_get_thread_id or something similar, to avoid confusion with the name and what the call returns. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages
diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index 20ee405..6c6eeb5 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...) if (DAPL_DBG_DEST_STDOUT g_dapl_dbg_dest) { va_start(args, fmt); fprintf(stdout, %s:%lx: , _ptr_host_, - dapl_os_getpid()); + dapl_os_gettid()); dapl_os_vprintf(fmt, args); va_end(args); } diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 0378a70..e0e30bf 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) -#define dapl_os_getpid (long int)pthread_self +#define dapl_os_getpid (int)getpid +#define dapl_os_gettid (long int)pthread_self That's fine - what about Windows? :) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] How to establish IB communcation more effectively?
Just to make sure we're on the same page: both IPoIB and the RDMA-CM use SA path queries But ipoib caches its path records... - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] How to establish IB communcation more effectively?
Yes, of-course. But, to start with, lets analyze the case of each node running --one-- rank and then take it from there to the case where each node runs C ranks. The caching is independent of running MPI though. To get a fair comparison, you'd probably have to reboot the entire cluster before running the test and ensure that no other communication between the nodes occurs over ipoib. For myself, I'm not sure that the tests are the same. The DAPL providers create and modify the QPs differently. I'd need to walk through the code to see whether QP creation time is included and verify that the QP modify calls are the same. As for responding to the initial question, using sockets with hard-coded values seems to be the most common way to establish IB connections at scale, though I would guess that using the ib_cm with hard-coded values would work about the same. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] ib-mgmt: fixup ibsendtrap for windows
Fix some typecast issues. Signed-off-by: Sean Hefty sean.he...@intel.com --- infiniband-diags/src/ibsendtrap.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 469bc39..7ad588e 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port) static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) { n-generic_type = 0x80 | IB_NOTICE_TYPE_INFO; - n-g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); + n-g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port)); n-g_or_v.generic.trap_num = cl_hton16(144); - n-issuer_lid = cl_hton16(port-lid); - n-data_details.ntc_144.lid = cl_hton16(port-lid); + n-issuer_lid = cl_hton16((uint16_t) port-lid); + n-data_details.ntc_144.lid = n-issuer_lid; n-data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; n-data_details.ntc_144.change_flgs = @@ -79,10 +79,10 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port) { n-generic_type = 0x80 | IB_NOTICE_TYPE_URGENT; - n-g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); + n-g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port)); n-g_or_v.generic.trap_num = cl_hton16(129); - n-issuer_lid = cl_hton16(port-lid); - n-data_details.ntc_129_131.lid = cl_hton16(port-lid); + n-issuer_lid = cl_hton16((uint16_t) port-lid); + n-data_details.ntc_129_131.lid = n-issuer_lid; n-data_details.ntc_129_131.pad = 0; n-data_details.ntc_129_131.port_num = (uint8_t) error_port; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support
+#include infiniband/mad_osd.h Why is this inclusion needed? mad_osd.h is included via mad.h. It's not then, but I prefer to include necessary files directly, rather than relying on other include files to pick them up. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support
I would agree in general, but in this specific case it is *_osd.h - system dependent file which is not included directly, at least not in libibmad and infiniband-diags up to now (hypothetically in some implementations it may not exist at all). libibmad mad.h includes mad_osd.h directly. I added it to ibnetdisc.h, because libibnetdisc is a new library and requires OS dependent mechanisms (i.e. MAD_EXPORT) to export the new interfaces. I agree in trying to keep mad_osd.h out of the diags, but libibnetdisc is special within the diags... I really don't have a strong preference on this, so whatever you want is fine. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7
OTOH I cannot understand why port-capmask is defined as uint64_t and not as 32-bit. Kernel uses 32-bit value and it is shown in this file as 0x%0x. What about to convert type of port-capmask to uint32_t? I think that makes the most sense. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7
diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 7985be1..99af9a8 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) printf(%sBase lid: %d\n, pre, port-base_lid); printf(%sLMC: %d\n, pre, port-lmc); printf(%sSM lid: %d\n, pre, port-sm_lid); - printf(%sCapability mask: 0x%08x\n, pre, (unsigned)ntohll(port- capmask)); + printf(%sCapability mask: 0x%08x\n, pre, (unsigned)(ntohl((uint32_t)(port-capmask; Casting from 64-bit to 32-bit, then byte swapping doesn't look right. I think the problem may be in libibumad, umad.c, line 166: if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, port-capmask) 0) goto clean; port-capmask = htonl(port-capmask); capmask is read as a 64-bit value, but only 32-bit swap is used. (libibumad is not shared between Linux and Windows, so this problem doesn't show up on Windows.) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc
Where does the definition for ibdebug come from? It is in ibdiag_common.c. Every infiniband-ibdiag tool is linked with it. And yes, using this in this library can be problematic since introduces a hidden dependency. How does that work? The library doesn't link ibdiag_common.c, so I'm not sure what definition it picks up. Maybe it defaults to undefined, assumed int... To get things to build and run on Windows, I defined it as a static in the library. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc
There is also an ibdebug defined in libibmad. extern int ibdebug; This is the one it is using... :-/ I think there should be a wrapper function. Perhaps madrpc_show_errors? Yes - that's the one it picks up. Adding a wrapper makes sense to me. (I don't think that declaring a variable as extern is sufficient to share it across library boundaries in windows.) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections
The output is much easier to read. :) @@ -59,6 +62,10 @@ MODULE_LICENSE(Dual BSD/GPL); #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +#define CASE_RET(val, ret) case val: return #ret; I would just drop this abstraction. +static const char *format_node_type(enum rdma_node_type nt) +{ + enum rdma_transport_type tt; + if (nt) { + tt = rdma_node_get_transport(nt); + switch (tt) { We don't really need the local variable tt. +static int cma_rdma_id_seq_show(struct seq_file *file, void *v) +{ + struct rdma_id_private *id_priv; + char local_addr[64], remote_addr[64]; + + if (!v) + return 0; + if (v == SEQ_START_TOKEN) { + seq_printf(file, + %-5s + %-8s + %-5s + %-8s + %-52s + %-52s + %-6s + %-15s + %-8s + \n, + TYPE, DEVICE, PORT, NET_DEV, SRC_ADDR, DST_ADDR, SPACE, STATE, QP_NUM); + } else { + id_priv = list_entry(v, struct rdma_id_private, list); + format_addr((struct sockaddr *)id_priv-id.route.addr.src_addr, + local_addr); + format_addr((struct sockaddr *)id_priv-id.route.addr.dst_addr, + remote_addr); + + seq_printf(file, + %-5s + %-8s + %-5d + %-8s + %-52s + %-52s + %-6s + %-15s + %-8d + \n, + format_node_type(id_priv- id.route.addr.dev_addr.dev_type), + (id_priv-id.device) ? id_priv-id.device-name : , + id_priv-id.port_num, + (id_priv-id.route.addr.dev_addr.src_dev) ? id_priv- id.route.addr.dev_addr.src_dev-name : , + local_addr, + remote_addr, + format_port_space(id_priv-id.ps), + format_cma_state(id_priv-state), + id_priv-qp_num); + } I still think this requires a lot of scrolling to get past a couple of print statements. Can we at least collapse the %-5s ... \n stuff down to a single line? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/4] ib-mgmt/ibn3 branch: diags updated for continued windows support
Signed-off-by: Sean Hefty sean.he...@intel.com --- This patch is based on the ibn3 branch infiniband-diags/src/ibaddr.c|1 + infiniband-diags/src/iblinkinfo.c|4 ++-- infiniband-diags/src/ibnetdiscover.c |2 +- infiniband-diags/src/ibsendtrap.c|4 ++-- infiniband-diags/src/vendstat.c |4 ++-- 5 files changed, 8 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index bb22be9..7909a52 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -39,6 +39,7 @@ #include stdlib.h #include unistd.h #include getopt.h +#include arpa/inet.h #include infiniband/umad.h #include infiniband/mad.h diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 1e43788..c6ce81b 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -48,7 +48,7 @@ #include errno.h #include inttypes.h -#include infiniband/complib/cl_nodenamemap.h +#include complib/cl_nodenamemap.h #include infiniband/ibnetdisc.h char *argv0 = iblinkinfotest; @@ -284,7 +284,7 @@ main(int argc, char **argv) { compat, 0, 0, 3}, { from, 1, 0, 'f'}, { R, 0, 0, 'R'}, - { } + { 0 } }; f = stdout; diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 99750f0..2ca696e 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -210,7 +210,7 @@ out_chassis(ibnd_fabric_t *fabric, int chassisnum) uint64_t guid; fprintf(f, \nChassis %d, chassisnum); - guid = ibnd_get_chassis_guid(fabric, chassisnum); + guid = ibnd_get_chassis_guid(fabric, (unsigned char) chassisnum); if (guid) fprintf(f, (guid 0x% PRIx64 ), guid); fprintf(f, \n); diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index d0afca0..13f125f 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -73,7 +73,7 @@ static void build_trap129(ib_mad_notice_attr_t * n, uint16_t lid) n-issuer_lid = cl_hton16(lid); n-data_details.ntc_129_131.lid = cl_hton16(lid); n-data_details.ntc_129_131.pad = 0; - n-data_details.ntc_129_131.port_num = error_port; + n-data_details.ntc_129_131.port_num = (uint8_t) error_port; } static int send_trap(const char *name, @@ -100,7 +100,7 @@ static int send_trap(const char *name, trap_rpc.dataoffs = IB_SMP_DATA_OFFS; memset(notice, 0, sizeof(notice)); - build(notice, selfportid.lid); + build(notice, (uint16_t) selfportid.lid); return mad_send_via(trap_rpc, sm_port, NULL, notice, srcport); } diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 240c4cb..0bf9616 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -184,8 +184,8 @@ void config_counter_groups(ib_portid_t *portid, int port) cg_config = (is4_config_counter_groups_t *)buf; printf(counter_groups_config: configuring group0 %d group1 %d\n, cg0, cg1); - cg_config-group_selects[0].group_select = cg0; - cg_config-group_selects[1].group_select = cg1; + cg_config-group_selects[0].group_select = (uint8_t) cg0; + cg_config-group_selects[1].group_select = (uint8_t) cg1; if (!ib_vendor_call_via(buf, portid, call, srcport)) IBERROR(config counter group set); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2/4] ib-mgmt/ibn3 branch: libibmad update for windows support
Signed-off-by: Sean Hefty sean.he...@intel.com --- libibmad/src/portid.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c index de9e2d3..6f8fea2 100644 --- a/libibmad/src/portid.c +++ b/libibmad/src/portid.c @@ -38,6 +38,7 @@ #include stdio.h #include stdlib.h #include string.h +#include arpa/inet.h #include infiniband/mad.h ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 3/4] ib-mgmt/ibn3 branch: libibmad: remove ib_resolve_guid function prototype
This function isn't implemented. Signed-off-by: Sean Hefty sean.he...@intel.com --- libibmad/include/infiniband/mad.h |3 --- libibmad/src/libibmad.map |1 - 2 files changed, 0 insertions(+), 4 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..188b66b 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -844,9 +844,6 @@ MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, /* resolve.c */ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) DEPRECATED; -MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout) - DEPRECATED; MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest, ib_portid_t * sm_id) DEPRECATED; diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 4306dbc..daa9319 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -58,7 +58,6 @@ IBMAD_1.3 { mad_register_server; mad_register_client_via; mad_register_server_via; - ib_resolve_guid; ib_resolve_portid_str; ib_resolve_self; ib_resolve_smlid; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support
Allow libibnetdisc to build and run on Windows as part of the WinOF distribution Signed-off-by: Sean Hefty sean.he...@intel.com --- .../libibnetdisc/include/infiniband/ibnetdisc.h| 48 --- infiniband-diags/libibnetdisc/src/chassis.c|4 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 18 infiniband-diags/libibnetdisc/src/libibnetdisc.map |8 --- 4 files changed, 39 insertions(+), 39 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..370ae31 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -37,6 +37,7 @@ #include stdio.h #include infiniband/mad.h #include iba/ib_types.h +#include infiniband/mad_osd.h struct ib_fabric; /* forward declare */ struct chassis; /* forward declare */ @@ -140,11 +141,12 @@ typedef struct ib_fabric { /** = * Initialization (fabric operations) */ -void ibnd_debug(int i); -void ibnd_show_progress(int i); +MAD_EXPORT void ibnd_debug(int i); +MAD_EXPORT void ibnd_show_progress(int i); -ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, - int timeout_ms, ib_portid_t *from, int hops); +MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, + int timeout_ms, + ib_portid_t *from, int hops); /** * dev_name: (required) local device name to use to access the fabric * dev_port: (required) local device port to use to access the fabric @@ -156,33 +158,35 @@ ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, * hops: (optional) Specify how much of the fabric to traverse. * negative value == scan entire fabric */ -void ibnd_destroy_fabric(ibnd_fabric_t *fabric); +MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t *fabric); /** = * Node operations */ -ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); -ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); -ibnd_node_t *ibnd_update_node(ibnd_node_t *node); +MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); +MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); +MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_node_t *node); typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data); -void ibnd_iter_nodes(ibnd_fabric_t *fabric, - ibnd_iter_node_func_t func, - void *user_data); -void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, - ibnd_iter_node_func_t func, - int node_type, - void *user_data); +MAD_EXPORT void ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data); +MAD_EXPORT void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, +ibnd_iter_node_func_t func, +int node_type, +void *user_data); /** = * Chassis queries */ -uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum); -char *ibnd_get_chassis_type(ibnd_node_t *node); -char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size); - -int ibnd_is_xsigo_guid(uint64_t guid); -int ibnd_is_xsigo_tca(uint64_t guid); -int ibnd_is_xsigo_hca(uint64_t guid); +MAD_EXPORT uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, + unsigned char chassisnum); +MAD_EXPORT char *ibnd_get_chassis_type(ibnd_node_t *node); +MAD_EXPORT char *ibnd_get_chassis_slot_str(ibnd_node_t *node, + char *str, size_t size); + +MAD_EXPORT int ibnd_is_xsigo_guid(uint64_t guid); +MAD_EXPORT int ibnd_is_xsigo_tca(uint64_t guid); +MAD_EXPORT int ibnd_is_xsigo_hca(uint64_t guid); #endif /* _IBNETDISC_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 6b4930e..dbb0abe 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -156,6 +156,8 @@ static int is_xsigo_switch(uint64_t guid) static uint64_t xsigo_chassisguid(ibnd_node_t *node) { uint64_t sysimgguid
RE: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections
rdma_id is a suffix that leaves room for more, or in other works - I just wanted to leave room for other debug information in the future (e.g. number of count of total incoming connection on device) ok - makes sense TP=TyPe (Device type) PO=POrt (Port Number) PS=PortSpace ST=STate I tried to shorten the output line as much as possible to make the output looks as easy to read table (on most screen the output will be one line per rdma_id) The same thought made me print only the numeric value and not it's string value. I was able to figure these out by looking at the code, but if I look at the output of netstat, the headings and values are easy to interpret without needing to refer to source code. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections
If the path is: /sys/kernel/debug/rdma_cm/mthca0_rdma_id do we really need to append '_rdma_id' at the end? (I'll defer to others if debugfs is the right location or not.) + if (v == SEQ_START_TOKEN) { + seq_printf(file, + %-3s + %-8s + %-3s + %-5s + %-52s + %-52s + %-5s + %-3s + %-8s + \n, + TP, DEV, PO, NDEV, SRC, DST, PS, ST, QPN); {snip} + seq_printf(file, + %-3d + %-8s + %-3d + %-5s + %-52s + %-52s + %-5d + %-3d + %-8d + \n, + id_priv-id.route.addr.dev_addr.dev_type, + (id_priv-id.device) ? id_priv-id.device-name : , + id_priv-id.port_num, + (id_priv-id.route.addr.dev_addr.src_dev) ? id_priv- id.route.addr.dev_addr.src_dev-name : , + local_addr, + remote_addr, + id_priv-id.ps, + id_priv-state, + id_priv-qp_num); nit: I'm not a big fan of one parameter per line. :) It's not readily apparent to me what several of the headings are (TP, PO, PS, ST) or what the numeric values map to (for TP, PS, ST). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] RDMA over infiniband, diffrences between rdam_cm and libmthca-rdmav2
I am new in infiniband, and I am doing some research on rdma. I have found two diffrents way of sending data on infiniband protucts using rdma. The first one use rdam_cm module (from kernel source code), and second one use libmthca-rdmav2/libibverbs. If someone can explain me the diffrences between this two types of programming. The library to send data is libibverbs. The rdma_cm (or librdmacm) is one method that can be used to setup the QPs for communication. I.e. exchange the QP numbers, LIDs, etc. You could also setup the QPs using the libibcm or just exchange the data over a standard socket. If you look at the librdmacm code, you will see that it calls the libibverbs functions to allocate and modify the QP. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library.
This new series uses the current master version ibmad to decode the data. If you accept the mad_*printf functions then I can convert later. For now I want to get this library in! :-D It would be helpful to check libibnetdisc into a branch in the management.git tree. I need some time to add libibnetdisc to windows. (Where exactly is this library?) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library.
The patch creates a subdirectory in infiniband-diags call libibnetdisc. Is that what you mean? Unfortunately I don't have a public git tree I can point you to here at the lab. :-( My mailer tossed patch 1/3 into my junk mail folder, so I missed the patch for the actual library itself... If it's possible, I'd like for Sasha to add these to a branch in his management.git tree until I can setup the windows build and verify that everything compiles. I should only need a few days to do this. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: QoS setting and propagation
responding on general list: do we set QoS parameters in SM only? The SM must be configured with QoS. You'll need to look in the opensm QoS documentation to see how to setup QoS. (I don't know those details.) I looked in cma.c and ib_cm and iw_cm and do not see any parameter passing for QoS. Am I missing something? IB specifies qos using the service ID and qos_class fields in the PR query. This is done during 'route resolution'. See cma_query_ib_route(). Can we set it in transport independent way? See rdma_set_service_type(). This call is intended to be generic. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH] rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast
When joining IPoIB multicast group, use the same rate as in the broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does, it might get a different rate. This will cause IPoIB to fail joining to the same group later on, because IPoIB has a strict rate selection. Should the rdma_cm be creating IPoIB multicast groups? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH] rdma_cm: create cm id even when port is down
When doing rdma_resolve_addr() and relevant port is down, the function fails and rdma_cm id is not bound to the device. Therefore, application does not have device handle and cannot wait for the port to become active. The function fails because ipoib is not joined to the multicast group and therefore sa does not have a multicast record to take a qkey from. The patch here is to make lazy qkey resolution - cma_set_qkey will set id_priv-qkey if it was not set, and will be called just before the qkey is really required. Signed-off-by: Yossi Etigin yos...@voltaire.com Acked-by: Sean Hefty sean.he...@intel.com --- Roland, a thread that discussed this starts here: http://lists.openfabrics.org/pipermail/general/2009-February/056895.html The subject never contained '[PATCH]', so it was probably missed, but Yossi's patch should be good for 2.6.30. drivers/infiniband/core/cma.c | 41 +++-- 1 file changed, 27 insertions(+), 14 deletions(-) Index: b/drivers/infiniband/core/cma.c === --- a/drivers/infiniband/core/cma.c2009-03-10 18:21:47.0 +0200 +++ b/drivers/infiniband/core/cma.c2009-03-10 19:22:18.0 +0200 @@ -297,21 +297,25 @@ static void cma_detach_from_dev(struct r id_priv-cma_dev = NULL; } -static int cma_set_qkey(struct ib_device *device, u8 port_num, - enum rdma_port_space ps, - struct rdma_dev_addr *dev_addr, u32 *qkey) +static int cma_set_qkey(struct rdma_id_private *id_priv) { struct ib_sa_mcmember_rec rec; int ret = 0; - switch (ps) { + if (id_priv-qkey) + return; + + switch (id_priv-id.ps) { case RDMA_PS_UDP: - *qkey = RDMA_UDP_QKEY; + id_priv-qkey = RDMA_UDP_QKEY; break; case RDMA_PS_IPOIB: - ib_addr_get_mgid(dev_addr, rec.mgid); - ret = ib_sa_get_mcmember_rec(device, port_num, rec.mgid, rec); - *qkey = be32_to_cpu(rec.qkey); + ib_addr_get_mgid(id_priv-id.route.addr.dev_addr, rec.mgid); + ret = ib_sa_get_mcmember_rec(id_priv-id.device, + id_priv-id.port_num, rec.mgid, + rec); + if (!ret) + id_priv-qkey = be32_to_cpu(rec.qkey); break; default: break; @@ -341,12 +345,7 @@ static int cma_acquire_dev(struct rdma_i ret = ib_find_cached_gid(cma_dev-device, gid, id_priv-id.port_num, NULL); if (!ret) { - ret = cma_set_qkey(cma_dev-device, - id_priv-id.port_num, - id_priv-id.ps, dev_addr, - id_priv-qkey); - if (!ret) - cma_attach_to_dev(id_priv, cma_dev); + cma_attach_to_dev(id_priv, cma_dev); break; } } @@ -578,6 +577,10 @@ static int cma_ib_init_qp_attr(struct rd *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT; if (cma_is_ud_ps(id_priv-id.ps)) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; + qp_attr-qkey = id_priv-qkey; *qp_attr_mask |= IB_QP_QKEY; } else { @@ -2201,6 +2204,12 @@ static int cma_sidr_rep_handler(struct i event.status = ib_event-param.sidr_rep_rcvd.status; break; } + ret = cma_set_qkey(id_priv); + if (ret) { + event.event = RDMA_CM_EVENT_ADDR_ERROR; + event.status = -EINVAL; + break; + } if (id_priv-qkey != rep-qkey) { event.event = RDMA_CM_EVENT_UNREACHABLE; event.status = -EINVAL; @@ -2480,10 +2489,14 @@ static int cma_send_sidr_rep(struct rdma const void *private_data, int private_data_len) { struct ib_cm_sidr_rep_param rep; + int ret; memset(rep, 0, sizeof rep); rep.status = status; if (status == IB_SIDR_SUCCESS) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; rep.qp_num = id_priv-qp_num; rep.qkey = id_priv-qkey; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [ofw] What is the current support level for QoS in WinOF?
I do have urgent time-sensitive traffic, and non-urgent traffic. The urgent traffic and the non-urgent traffic is generated from different hosts. I would like to differentiate them by using different SL, then configure QoS to give maximum priority to the urgent time-sensitive traffic, and minimum priority to the non-urgent traffic. Are you saying I can't do this in WinOF? I don't think the WinOF opensm will support this, but I'm not certain. And I can't do that even adding a Linux host that runs opensm (OFED version)? I would expect that this is possible. Traffic separation based on HCA port could be an option, but I need to think more about that. What can you do with that kind of QoS? More simply, this would allow you to group hosts into different traffic priority groups. Do you mark this HCA port as high-priority, that HCA port as low-priority, etc? What happens when a high-pri port sends traffic to a low-pri port? And vice- versa? There should be rules in the opensm QoS config file that will determine this. I've copied the general list on this reply. Sasha, Hal, or someone that deals more directly with opensm will be able to direct you better. What happens when a high-pri port sends traffic to a normal port (a port that is not marked as high-priority nor low-priority)? I'm using only RDMA Write with Imm in my system, although I'm interested in what happens on all types of traffic. If you know of a document that explains that, please let me know, I haven't found it by now. The OFED opensm includes documentation on setting up QoS. It's in opensm/doc in the management source tree. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH] add c99 definitions within the ib_mad_f structure
this knowingly breaks the windows build... ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH] add c99 definitions within the ib_mad_f structure
So what do you suggest? Changing the WinOF build environment is something that could be brought up in Sonoma, if there will be enough representatives there. Alternatively, WinOF schedules regular con-calls. Ira replied that he has no problems with it. I remember Ira stating that he couldn't build or test his patches on Windows. I have no problem with that. I don't pull the ib-mgmt.git tree every day. When I do pull, if I hit into any build issues, I'll just correct them and submit a patch. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: ***SPAM*** Re: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/mcm_rereg_test.c: Add missing mad_rpc_close_port call
Someone made the decision to want to be able to switch back and forth earlier. This should be directed to them. It's certainly easy to eliminate the old code. I'm wasn't suggesting that you fix the existing code, just not add to it. If someone wants to be able to switch back and forth, it makes way more sense to use an #if something_that_can_be_set_during_the_build, than #if 1, which requires source code changes in multiple places. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: merging madeye into mainline
Yes, exposing snooping capabilities to user space and writing a user space app that does snooping sounds reasonable - what would it take to expose this capability to user-space - will it fine smoothly into the ib_umad and libibumad design/structure? libibumad needs a way for the user to indicate that they want to snoop mads, so ib_umad calls ib_register_mad_snoop(). ib_umad would also need to store copies of the mad data, rather than queuing the actual mad. I wouldn't think it would be that difficult to add, though RMPP may cause a small head-ache. (I don't remember if snooping occurs before or after RMPP packets are reassembled. If it's before, it'll be easier to copy the mad.) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: [PATCH] infiniband-diags: Fix memory leaks on IBERROR and IBPANIC
It's not a matter of relying on exit for open fds but rather the allocated memory under the covers of mad_rpc_open_port so no longer can one rely on just exit and this needs to be made explicit. The OS should reclaim any allocated memory not freed by the app when it exits. Is this your concern? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: merging madeye into mainline
Have you ever considered to push the madeye module (below) into the kernel to ease with fabric debugging? I have tested it now against Linus tree and it works fine. I hadn't really thought about it, but I don't have any objection to someone submitting it. There may be a better way of doing this if we want to include this upstream - for example, expose snooping capabilities to userspace. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general