from:"Sean Hefty"

RE: ib_types.h moving [was: Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm]

2009-09-25 Thread Sean Hefty

Now I likely would agree with Ira that moving ib_types.h to libibumad
is a least painful option. Do we have a better ideas?

Just a random thought, but what about longer term adding a second set of
interfaces to libibumad?  Basically, something more like the kernel ib_sa.  I
don't know that we need a new library just to expand the interface.

For ib_types.h, I'd rather see it broken up into separate header files, at least
some of which get distributed with libibumad.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: Possible process deadlock in RMPP flow

2009-09-23 Thread Sean Hefty

ibnetdiscover D 80149b8d 0 26968  26544
(L-TLB)
 8102c900bd88 0046 81037e8e 81037e8e02e8
 8102c900bd78 000a 8102c5b50820 81038a929820
 011837bf6105 0ede 8102c5b50a08 0001
Call Trace:
 [80064207] wait_for_completion+0x79/0xa2
 [8008b4cc] default_wake_function+0x0/0xe
 [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
 [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
 [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6
 [80012e22] __fput+0xae/0x198
 [80023de6] filp_close+0x5c/0x64
 [800393df] put_files_struct+0x63/0xae
 [80015b26] do_exit+0x31c/0x911
 [8004971a] cpuset_exit+0x0/0x6c
 [8005e116] system_call+0x7e/0x83

From the dump it seems that the process is waits on the call to
flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
OFED 1.4.2.

Roland just submitted a patch in this area yesterday.  I don't know if the patch
would fix their issue, but it may be worth trying.  What kernel does 1.4.2 map
to?

What RMPP messages does ibnetdiscover use?  If the program is completing
successfully, there may be a different race with the rmpp cleanup.  I'll see if
anything else stands out in that area.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH/RFC] IB/mad: Fix lock-lock-timer deadlock in RMPP code

2009-09-22 Thread Sean Hefty

OK so how about something like this?  Just hold the lock to mark the
items on the list as being canceled, and then actually cancel the
delayed work without the lock.  I think this doesn't leave any races or
holes where the delayed work can mess up the cancel.

This looks good to me.  Thanks for looking at this.

Reviewed-by: Sean Hefty sean.he...@intel.com

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-18 Thread Sean Hefty

Although not a fit IMO, the pragmatic solution is to move ib_types,h into
libibumad. I think it is better there than OpenSM which was never quite right
either. That can at least start to eliminate the duplications in this area.

ib_types.h includes complib header files...

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-18 Thread Sean Hefty

Rough hack.  Does windows have stdint.h, byteswap.h, and endian.h?

If not, adding the headers with the needed definitions is trivial.

+/* 16bit */
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define CL_NTOH16( x )(uint16_t)( \
+  (((uint16_t)(x)  0x00FF)  8) |   \
+  (((uint16_t)(x)  0xFF00)  8) )
+#else
+#define CL_NTOH16( x )(x)
+#endif
+#define CL_HTON16 CL_NTOH16
+
+/* 32bit */
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define CL_NTOH32( x )(uint32_t)( \
+  (((uint32_t)(x)  0x00FF)  24) |  \
+  (((uint32_t)(x)  0xFF00)  8) |   \
+  (((uint32_t)(x)  0x00FF)  8) |   \
+  (((uint32_t)(x)  0xFF00)  24) )
+#else
+#define CL_NTOH32( x )(x)
+#endif
+#define CL_HTON32 CL_NTOH32
+
+/* 64bit */
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define CL_NTOH64( x )(uint64_t)(
\
+  (((uint64_t)(x)  0x00FFULL)  56) |
\
+  (((uint64_t)(x)  0xFF00ULL)  40) |
\
+  (((uint64_t)(x)  0x00FFULL)  24) |
\
+  (((uint64_t)(x)  0xFF00ULL)  8 ) |
\
+  (((uint64_t)(x)  0x00FFULL)  8 ) |
\
+  (((uint64_t)(x)  0xFF00ULL)  24) |
\
+  (((uint64_t)(x)  0x00FFULL)  40) |
\
+  (((uint64_t)(x)  0xFF00ULL)  56) )
+#else
+#define CL_NTOH64( x )(x)
+#endif
+#define CL_HTON64 CL_NTOH64
+
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define cl_ntoh16(x)  bswap_16(x)
+#define cl_hton16(x)  bswap_16(x)
+#define cl_ntoh32(x)  bswap_32(x)
+#define cl_hton32(x)  bswap_32(x)
+#define cl_ntoh64(x)  (uint64_t)bswap_64(x)
+#define cl_hton64(x)  (uint64_t)bswap_64(x)
+#else /* Big Endian */
+#define cl_ntoh16(x)  (x)
+#define cl_hton16(x)  (x)
+#define cl_ntoh32(x)  (x)
+#define cl_hton32(x)  (x)
+#define cl_ntoh64(x)  (x)
+#define cl_hton64(x)  (x)
+#endif

Why the different defines for cl_noth and CL_NTOH?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC] 0/5: assistant to the IB communication manager

2009-09-17 Thread Sean Hefty

The following collection of pseudo-patches implement a new user space package
(IB ACM) designed to assist with connection establishment.  A description is
given below, copied from the acm_notes.txt file included with the package.  The
complete package is available on git.openfabrics.org/~shefty/ibacm.git and also
in svn under branches/winverbs/ulp/ibacm.

This is a request for both general and detailed feedback.  The IB ACM has had
very limited testing.  Testing has been restricted to using the provided test
utility, and invoking it from the windows version of the librdmacm on a single,
small cluster.  Calling it from the linux librdmacm is more involved and still
under development.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
Assistant for InfiniBand Communication Management (IB ACM)

Note: The IB ACM should be considered experimental.


Overview

The IB ACM package implements and provides a framework for experimental name,
address, and route resolution services over InfiniBand.  It is intended to
address connection setup scalability issues running MPI applications on large
clusters.  The IB ACM provides information needed to establish a connection, but
does not implement the CM protocol.  Long term, the IB ACM may support multiple
resolution mechanisms.

The IB ACM is focused on being scalable and efficient.  The current
implementation limits network traffic, SA interactions, and centralized
services.  As a trade-off, it is not expected to support all cluster routing
configurations.  However, it is anticipated that additional functionality, such
as path record caching, can be incorporated into the IB ACM to support a wider
range of configurations.

The IB ACM package is comprised of three components: the ib_acm service, a
libibacm library, and a test/configuration utility - ib_acme.  All are userspace
components and are available for Linux and Windows.  Additional details are
given below.


Quick Start Guide
-
1. Prerequisites: libibverbs and libibumad must be installed.
   The IB stack should be running with IPoIB configured
2. Install the IB ACM package
   This installs libibacm, ib_acm, and ib_acme.
3. Run ib_acme -A -O
   This will generate IB ACM address and options configuration files.
   (acm_addr.cfg and acm_opts.cfg)
4. Run ib_acm and leave running
5. Optionally, run ib_acme -s source_ip -d dest_ip -v
   This will verify that the ib_acm service is running.
   It also verifies the path is usable on the given cluster.
5. Install librdmacm.
6. Define the following environment variable: RDMA_CM_USE_IB_ACM=1
   The librdmacm will automatically use the ib_acm service.
   On failures, the librdmacm will fall back to normal resolution.


Details
---
libibacm:
The libibacm is an end-user library with simple interfaces for communicating
with the ib_acm service.  The libibacm implements the ib_acm client protocol.
Although the interfaces to the libibacm are considered experimental, it's
expected that existing calls will be supported going forward.

For simplicity, all calls operate synchronously and are serialized.  Possible
future changes to the libibacm would be to process calls in parallel and add
asynchronous interfaces.


ib_acme:
The ib_acme program serves a dual role.  It acts as a utility to test ib_acm
operation and help verify if the ib_acm is usable for a given cluster
configuration.  Additionally, it automatically generates ib_acm configuration
files to assist with or eliminate manual setup.


acm configuration files:
The ib_acm service relies on two configuration files.  The acm_addr.cfg file
contains name and address mappings for each IB device, port, pkey endpoint.
Although the names in the acm_addr.cfg file can be anything, ib_acme maps the
host name and IP addresses to the IB endpoints.

The acm_opts.cfg file provides a set of configurable options for the ib_acm
service, such as timeout, number of retries, logging level, etc.  ib_acme
generates the acm_opts.cfg file using static information.  A future enhancement
would adjust options based on the current system and cluster size. 


ib_acm:
The ib_acm service is responsible for resolving names and addresses to
InfiniBand path information and caching such data.  It is currently implemented
as an executable application, but is a conceptual service or daemon that should
execute with administrative privileges.

The ib_acm implements a client interface over TCP sockets, which is abstracted
by the libibacm library.  One or more back-end protocols are used by the ib_acm
service to satisfy user requests.  Although the ib_acm supports standard SA path
record queries on the back-end, it provides an experimental resolution protocol
in hope of achieving greater scalability. 

Conceptually, the ib_acm service implements an ARP like protocol and uses IB
multicast records to construct path record data.  It makes the assumption that a
unicast path between two endpoints is realizable if those endpoints can
communicate

[ofa-general] [RFC] 1/5: ib_acm: linux abstractions

2009-09-17 Thread Sean Hefty

The following abstractions are defined to support the IB ACM running on Linux.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
/*
 * Copyright (c) 2009 Intel Corporation.  All rights reserved.
 *
 * This software is available to you under the OpenFabrics.org BSD license
 * below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

#if !defined(OSD_H)
#define OSD_H

#include stdlib.h
#include string.h
#include stdio.h
#include unistd.h
#include errno.h
#include byteswap.h
#include pthread.h
#include sys/socket.h
#include sys/types.h
#include malloc.h
#include arpa/inet.h
#include sys/time.h
#include netinet/in.h

#define LIB_DESTRUCTOR __attribute__((destructor))
#define CDECL_FUNC

#define container_of(ptr, type, field) \
((type *) ((void *) ptr - offsetof(type, field)))

#define min(a, b) (a  b ? a : b)
#define max(a, b) (a  b ? a : b)

#if __BYTE_ORDER == __LITTLE_ENDIAN
#define htonll(x) bswap_64(x)
#else
#define htonll(x) (x)
#endif
#define ntohll(x) htonll(x)

typedef struct { volatile int val; } atomic_t;
#define atomic_inc(v) (__sync_fetch_and_add((v)-val, 1) + 1)
#define atomic_dec(v) (__sync_fetch_and_sub((v)-val, 1) - 1)
#define atomic_get(v) ((v)-val)
#define atomic_set(v, s) ((v)-val = s)

#define stricmp strcasecmp
#define strnicmp strncasecmp

typedef struct { pthread_cond_t cond; pthread_mutex_t mutex; } event_t;
static inline void event_init(event_t *e)
{
pthread_cond_init(e-cond, NULL);
pthread_mutex_init(e-mutex, NULL);
}
#define event_signal(e) pthread_cond_signal((e)-cond)
static inline int event_wait(event_t *e, int timeout) 
{
struct timeval curtime;
struct timespec wait;
int ret;

gettimeofday(curtime, NULL);
wait.tv_sec = curtime.tv_sec + ((unsigned) timeout) / 1000;
wait.tv_nsec = (curtime.tv_usec + (((unsigned) timeout) % 1000) * 1000) 
* 1000;
pthread_mutex_lock(e-mutex);
ret = pthread_cond_timedwait(e-cond, e-mutex, wait);
pthread_mutex_unlock(e-mutex);
return ret;
}

#define lock_t  pthread_mutex_t
#define lock_init(x)pthread_mutex_init(x, NULL)
#define lock_acquirepthread_mutex_lock
#define lock_releasepthread_mutex_unlock

#define osd_init()  0
#define osd_close()

#define SOCKET int
#define SOCKET_ERROR -1
#define INVALID_SOCKET -1
#define socket_errno() errno
#define closesocket close

static inline uint64_t time_stamp_us(void)
{
struct timeval curtime;
timerclear(curtime);
gettimeofday(curtime, NULL);
return (uint64_t) curtime.tv_sec * 100 + (uint64_t) curtime.tv_usec;
}

#define time_stamp_ms() (time_stamp_us() / 1000)

static inline int beginthread(void (*func)(void *), void *arg)
{
pthread_t thread;
return pthread_create(thread, NULL, (void *(*)(void*)) func, arg);
}

#endif /* OSD_H */


/*
 * Copyright (c) 2009 Intel Corporation. All rights reserved.
 *
 * This software is available to you under the OpenIB.org BSD license
 * below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT

[ofa-general] [RFC] 2/5: IB ACM: windows abstractions

2009-09-17 Thread Sean Hefty

The following abstractions are defined to support the IB ACM running on Windows.

An attempt was made to limit the number of dependencies on external libraries,
such as complib.  We add Windows support for the Linux 'search' binary
tree interfaces.  This is implemented on Windows using complib fleximap, but
gets linked in statically.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
/*
 * Copyright (c) 2009 Intel Corporation.  All rights reserved.
 *
 * This software is available to you under the OpenFabrics.org BSD license
 * below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

#if !defined(OSD_H)
#define OSD_H

#include windows.h
#include process.h
#include winsock2.h

#define __func__ __FUNCTION__
#define LIB_DESTRUCTOR
#define CDECL_FUNC __cdecl

typedef struct { volatile LONG val; } atomic_t;
#define atomic_inc(v) InterlockedIncrement((v)-val)
#define atomic_dec(v) InterlockedDecrement((v)-val)
#define atomic_get(v) ((v)-val)
#define atomic_set(v, s) ((v)-val = s)

#define event_t HANDLE
#define event_init(e)   *(e) = CreateEvent(NULL, FALSE, FALSE, NULL)
#define event_signal(e) SetEvent(*(e))
#define event_wait(e, t) WaitForSingleObject(*(e), t)   

#define lock_t  CRITICAL_SECTION
#define lock_init   InitializeCriticalSection
#define lock_acquireEnterCriticalSection
#define lock_releaseLeaveCriticalSection

static __inline int osd_init()
{
WSADATA wsadata;
return WSAStartup(MAKEWORD(2, 2), wsadata);
}

static __inline void osd_close()
{
WSACleanup();
}

#define stricmp _stricmp
#define strnicmp _strnicmp

#define socket_errno WSAGetLastError
#define SHUT_RDWR SD_BOTH

static __inline UINT64 time_stamp_us(void)
{
LARGE_INTEGER cnt, freq;
QueryPerformanceFrequency(freq);
QueryPerformanceCounter(cnt);
return (UINT64) cnt.QuadPart / freq.QuadPart * 100;
}

#define time_stamp_ms() (time_stamp_us() * 1000)

#define getpid() ((int) GetCurrentProcessId())
#define beginthread(func, arg)  (int) _beginthread(func, 0, arg)
#define container_of CONTAINING_RECORD

#endif /* OSD_H */


/*
 * Copyright (c) 2009 Intel Corp, Inc.  All rights reserved.
 *
 * This software is available to you under a choice of one of two
 * licenses.  You may choose to be licensed under the terms of the GNU
 * General Public License (GPL) Version 2, available from the file
 * COPYING in the main directory of this source tree, or the
 * OpenIB.org BSD license below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 *
 */

#ifndef _SEARCH_H_
#define _SEARCH_H_

#include complib/cl_fleximap.h

//typedef enum
//{
//  preorder,
//  postorder,
//  endorder,
//  leaf
//
//} VISIT;

void *tsearch(const void *key, void **rootp,
  int (*compar)(const void *, const void *));
void *tfind(const void *key, void *const *rootp,
int (*compar)(const void *, const void *));
/* tdelete

[ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

Add an end-user library with simple interfaces for communicating
with the ib_acm service.

The linux and windows specific files for the library are simple and not
shown for this review

Signed-off-by: Sean Hefty sean.he...@intel.com
---

ib_acm.h: defines library interfaces.
These are the end-user application interfaces to the ib acm.

/*
 * Copyright (c) 2009 Intel Corporation.  All rights reserved.
 *
 * This software is available to you under the OpenFabrics.org BSD license
 * below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

#if !defined(IB_ACM_H)
#define IB_ACM_H

#include infiniband/verbs.h

#if defined(_WIN32)
#define LIB_EXPORT __declspec(dllexport)
#else
#define LIB_EXPORT
#endif

#ifdef __cplusplus
extern C {
#endif

struct ib_acm_dev_addr
{
uint64_t guid;
uint16_t pkey_index;
uint8_t  port_num;
uint8_t  reserved[5];
};

struct ib_acm_resolve_data
{
uint32_t reserved1;
uint8_t  init_depth;
uint8_t  resp_resources;
uint8_t  packet_lifetime;
uint8_t  mtu;
uint8_t  reserved2[8];
};

/**
 * ib_acm_resolve_name - Resolve path data between the specified names.
 * Description:
 *   Discover path information, including identifying the local device,
 *   between the given the source and destination names.
 * Notes:
 *   The source and destination names should match entries in acm_addr.cfg
 *   configuration files on their respective systems.  Typically, the
 *   source and destination names will refer to system host names
 *   assigned to an Infiniband port.
 */
LIB_EXPORT
int ib_acm_resolve_name(char *src, char *dest,
struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah,
struct ib_acm_resolve_data *data);

/**
 * ib_acm_resolve_ip - Resolve path data between the specified addresses.
 * Description:
 *   Discover path information, including identifying the local device,
 *   between the given the source and destination addresses.
 * Notes:
 *   The source and destination addresses should match entries in acm_addr.cfg
 *   configuration files on their respective systems.  Typically, the
 *   source and destination addresses will refer to IP addresses assigned
 *   to an IPoIB instance.
 */
LIB_EXPORT
int ib_acm_resolve_ip(struct sockaddr *src, struct sockaddr *dest,
struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah,
struct ib_acm_resolve_data *data);


#define IB_PATH_RECORD_REVERSIBLE 0x80

struct ib_path_record
{
uint64_tservice_id;
union ibv_gid   dgid;
union ibv_gid   sgid;
uint16_tdlid;
uint16_tslid;
uint32_tflowlabel_hoplimit; /* resv-31:28 flow label-27:8 hop 
limit-7:0*/
uint8_t tclass;
uint8_t reversible_numpath; /* reversible-7:7 num path-6:0 */
uint16_tpkey;
uint16_tqosclass_sl;/* qos class-15:4 sl-3:0 */
uint8_t mtu;/* mtu selector-7:6 mtu-5:0 */
uint8_t rate;   /* rate selector-7:6 rate-5:0 */
uint8_t packetlifetime; /* lifetime selector-7:6 
lifetime-5:0 */
uint8_t preference;
uint8_t reserved[6];
};

/**
 * ib_acm_resolve_path - Resolve path data meeting specified restrictions
 * Description:
 *   Discover path information using the provided path record to
 *   restrict the discovery.
 * Notes:
 *   Uses the provided path record as input into an query for path
 *   information.  If successful, fills in any missing information.  The
 *   caller must provide at least the source and destination LIDs as input.
 */
LIB_EXPORT
int ib_acm_resolve_path(struct ib_path_record *path);

/**
 * ib_acm_query_path - Resolve path data meeting specified restrictions
 * Description:
 *   Queries the IB SA for a path record using the provided path record to
 *   restrict the query.
 * Notes:
 *   Uses the provided path record

[ofa-general] [RFC] 4/5: IB ACM: ib_acme test/configuration utility

2009-09-17 Thread Sean Hefty

Add a test/configuration utility to setup the ib_acm service and verify
its operation.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
One of the eventual goals is for the librdmacm library to use the ib acm, so
a decision was made to avoid the ib acm package needing to depend on the
librdmacm.  This lead to OS specific code being needed to map IP addresses
to IB endpoints.  If anyone has an easier solution for handling this mapping,
I'm open to alternatives here.

acme.c: OS independent source file

/*
 * Copyright (c) 2009 Intel Corporation.  All rights reserved.
 *
 * This software is available to you under the OpenIB.org BSD license
 * below:
 *
 * Redistribution and use in source and binary forms, with or
 * without modification, are permitted provided that the following
 * conditions are met:
 *
 *  - Redistributions of source code must retain the above
 *copyright notice, this list of conditions and the following
 *disclaimer.
 *
 *  - Redistributions in binary form must reproduce the above
 *copyright notice, this list of conditions and the following
 *disclaimer in the documentation and/or other materials
 *provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AWV
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

#include stdio.h
#include stdlib.h
#include string.h
#include getopt.h
#include netdb.h
#include arpa/inet.h

#include osd.h
#include infiniband/verbs.h
#include infiniband/ib_acm.h

static char *dest_addr;
static char *src_addr;
static char addr_type = 'i';
static int verify;
static int make_addr;
static int make_opts;

struct ibv_context **verbs;
int dev_cnt;

extern int gen_addr_ip(FILE *f);


static void show_usage(char *program)
{
printf(usage 1: %s\n, program);
printf(   [-f addr_format] - i(p), n(ame), or l(id)\n);
printf(  default: 'i'\n);
printf(   -s src_addr  - format defined by -f option\n);
printf(   -d dest_addr - format defined by -f option\n);
printf(   [-v] - verify ACM response against SA query 
response\n);
printf(usage 2: %s\n, program);
printf(   -A   - generate local acm_addr.cfg configuration 
file\n);
printf(   -O   - generate local acm_ops.cfg options 
file\n);
}

static void gen_opts_temp(FILE *f)
{
fprintf(f, # InfiniBand Multicast Communication Manager for clusters 
configuration file\n);
fprintf(f, #\n);
fprintf(f, # Use ib_acme utility with -O option to automatically 
generate a sample\n);
fprintf(f, # acm_opts.cfg file for the current system.\n);
fprintf(f, #\n);
fprintf(f, # Entry format is:\n);
fprintf(f, # name value\n);
fprintf(f, \n);
fprintf(f, # log_file:\n);
fprintf(f, # Specifies the location of the ACM service output.  The 
log file is used to\n);
fprintf(f, # assist with ACM service debugging and troubleshooting.  
The log_file can\n);
fprintf(f, # be set to 'stdout', 'stderr', or the base name of a file. 
 If a file name\n);
fprintf(f, # is specified, the actual name formed by appending a 
process ID and '.log'\n);
fprintf(f, # extension to the end of the specified file name.\n);
fprintf(f, # Examples:\n);
fprintf(f, # log_file stdout\n);
fprintf(f, # log_file stderr\n);
fprintf(f, # log_file /tmp/acm_\n);
fprintf(f, \n);
fprintf(f, log_file stdout\n);
fprintf(f, \n);
fprintf(f, # log_level:\n);
fprintf(f, # Indicates the amount of detailed data written to the log 
file.  Log levels\n);
fprintf(f, # should be one of the following values:\n);
fprintf(f, # 0 - basic configuration  errors\n);
fprintf(f, # 1 - verbose configuation  errors\n);
fprintf(f, # 2 - verbose operation\n);
fprintf(f, \n);
fprintf(f, log_level 0\n);
fprintf(f, \n);
fprintf(f, # server_port:\n);
fprintf(f, # TCP port number that the server listens on.\n);
fprintf(f, # If this value is changed, then a corresponding change is 
required for\n);
fprintf(f, # client applications.\n);
fprintf(f, \n);
fprintf(f, server_port 6125\n);
fprintf(f, \n);
fprintf(f, # timeout:\n);
fprintf(f, # Additional time, in milliseconds, that the ACM service 
will wait for a\n);
fprintf(f, # response from a remote ACM service or the IB SA.  The 
actual request\n);
fprintf(f

RE: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

 #define IB_PATH_RECORD_REVERSIBLE 0x80

 struct ib_path_record
 {
  uint64_tservice_id;
  union ibv_gid   dgid;
  union ibv_gid   sgid;
  uint16_tdlid;
  uint16_tslid;
  uint32_tflowlabel_hoplimit; /* resv-31:28 flow label-27:8 hop
limit-7:0*/
  uint8_t tclass;
  uint8_t reversible_numpath; /* reversible-7:7 num path-6:0 */
  uint16_tpkey;
  uint16_tqosclass_sl;/* qos class-15:4 sl-3:0 */
  uint8_t mtu;/* mtu selector-7:6 mtu-5:0 */
  uint8_t rate;   /* rate selector-7:6 rate-5:0 */
  uint8_t packetlifetime; /* lifetime selector-7:6
lifetime-5:0
*/
  uint8_t preference;
  uint8_t reserved[6];
 };

I would prefer to use the structures already defined in ib_types.h...  I
understand your not wanting to make ACM dependant on the OpenSM packages so is
it time to move ib_types.h out of the OpenSM tree and somewhere more generic?
Perhaps libibumad?  This also applies to ib_sa_mad in your 5th patch.

OTOH, ib_types.h is a 10K line file with multiple long (10 lines) inlined
functions.  Perhaps it deserves it's own library?

Defining some of these types in libibumad isn't a bad idea.  Although, WinOF
actually has 2 copies of ib_types.h (that differ...)  I find using ib_types.h
painful given its size; separate header files may help.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

I'm not sure this is a good idea. ibutils (ibis and ibmgtsim) wants ib_types.h
but does not want libibumad.

Well, libibumad is pretty useless without some network structure definitions.
Currently, the alternatives are to install opensm, which also requires
installing libibmad, libibcommon, and complib, or for the app to define what
they need, which is what was done here.  I'm not sure how you pick up ib_types.h
without libibumad getting installed, but you can make a reasonable argument that
libibumad should define the MAD and SA attribute structures.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

libibcm needs to learn how to do PR queries, it should have a good PR
query API since libibcm is pretty useless without being able to do PR
queries..

PR queries don't work - regardless of what the API looks like or where it
resides.  Plus adding PR queries to libibcm doesn't solve the problem of where
the structure definitions reside.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

PR queries work fine, I don't understand your comment.

MPI does not use PR queries because it does not scale.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofw] Re: [ofa-general] [RFC] 3/5: IB ACM: libibacm

2009-09-17 Thread Sean Hefty

Not all the world is MPI.

The focus of this package is for MPI though.  The librdmacm interface does
perform standard PR queries for applications that use that interface.  I'm not
fond the mad interfaces, but I'm not trying to fix them with this.  We can
debate whether an application should use an interface that exposes path records
and the IB CM protocol directly, but the feedback from MPI and other developers
is that connection establishment over IB requires too much code and is too
difficult.

Short term, while the ib_acm is considered experimental, I want to call the
ib_acm from under the librdmacm interface.  This allows it to be used without
applications needing to change.  Long term, if the ib_acm can to prove itself,
then accessing it directly from the kernel is a possibility.

Your new acm stuff still does PR queries.

The primary reason for adding PR query was to verify that the path information
returned by the ib_acm was usable.  A user needs some way to know if the ib_acm
can be used on their cluster.  This was one of the last things that I added, and
I think it has value, even if only for verification purposes.  The central
mechanism the ib_acm employs to acquire path data uses multicast.

Anyone using libibverbs multicast needs to do PR queries from
userspace.

The ib_acm uses libibverbs multicast and does not do PR queries.

Anyone using libibcm needs to do PR queries from userspace.

Open MPI has coded to the libibcm and does not perform PR queries.

What's needed in either of the above cases is path information; however, there
are alternate ways of obtaining this information without involving a direct
query to the SA.  MPI and DAPL can connect over IB today without doing PR
queries.  While there are limitations to determining path information without
doing a PR query, there are also limitations to obtaining path information doing
one.  Looking at current implementations, I would deduce that the latter is more
limiting than the former in practice.

Therefore we should just jam the PR query stuff in libibcm, everyone
can use that, and your acm can ride on the PR query code from
libibcm for its own needs too.

These are the calls exposed through libibacm:

int ib_acm_resolve_name(char *src, char *dest,
struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah,
struct ib_acm_resolve_data *data);
int ib_acm_resolve_ip(struct sockaddr *src, struct sockaddr *dest,
struct ib_acm_dev_addr *dev_addr, struct ibv_ah_attr *ah,
struct ib_acm_resolve_data *data);
int ib_acm_resolve_path(struct ib_path_record *path);
int ib_acm_query_path(struct ib_path_record *path);
int ib_acm_convert_to_path(struct ib_acm_dev_addr *dev_addr,
struct ibv_ah_attr *ah, struct ib_acm_resolve_data *data,
struct ib_path_record *path);

Of these, the one of most importance to the problem I'm trying to solve is
ib_acm_resolve_ip().  I do not believe that we want to add what should be
considered an experimental interface to libibcm, libibumad, or librdmacm based
on socket addresses that would then need to be maintained.

If your objection is that ib_acm_query_path() should be moved to libibcm, that's
a possibility.  libibacm already interfaces to libibumad, and it was trivial to
add support for PR queries.  libibcm does not currently depend on libibumad.
And if you take a step back in the connection process, I don't know that support
for just PR queries is sufficient for establishing a connection over IB.  You
first need to identify the endpoint, which opens up the possibility of other SA
queries.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: Does the CMA user space support join multicast for IPv6 too?

2009-09-15 Thread Sean Hefty

Does rdma_join_multicast supports IPv6 addresses?
If yes from which version on the librdmacm?

Hmm... I don't think so.  It looks like the librdmacm and rdma_cm kernel modules
could support it with a small change.  The kernel module calls ip_ib_mc_map() to
map IP addresses to MGIDs, which only works with IPv4.

Does ipoib map IPv6 multicast addresses to MGIDs directly?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH/RFC] IB/mad: Fix lock-lock-timer deadlock in RMPP code (was: [NEW PATCH] IB/mad: Fix possible lock-lock-timer deadlock)

2009-09-09 Thread Sean Hefty

Holding agent-lock across cancel_delayed_work() (which does
del_timer_sync()) in ib_cancel_rmpp_recvs() leads to lockdep reports of
possible lock-timer deadlocks if a consumer ever does something that
connects agent-lock to a lock taken in IRQ context (cf
http://marc.info/?l=linux-rdmam=125243699026045).

However, it seems this locking is not necessary here, since the locking
did not prevent the rmpp_list from having an item added immediately
after the lock is dropped -- so there must be sufficient synchronization
protecting the rmpp_list without the locking here.  Therefore, we can
fix the lockdep issue by simply deleting the locking.

The locking is needed to protect against items being removed from rmpp_list in
recv_timeout_handler() and recv_cleanup_handler().  No new items should be added
to the rmpp_list when ib_cancel_rmpp_recvs() is running (or there's a separate
bug).

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] performance to call ibv_poll_cq() vs. call select() on completion channel

2009-09-02 Thread Sean Hefty

But I just check the source code, ibv_poll_cq() is actually ibv_cmd_poll_cq(),
and ibv_cmd_poll_cq() calls write() system call on the IB device.

Doesn't this write() system call switch to kernel mode and possiblely casuse
a context switch ?

See verbs.h:

static inline int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc
*wc)
{
return cq-context-ops.poll_cq(cq, num_entries, wc);
}

The userspace provider library sets poll_cq to an internal call.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Help - RDMA event files remain open after acknowledging them

2009-08-14 Thread Sean Hefty

What am I doing wrong? Is there something more I need to do than calling
rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be
closed? As an fyi, I have even tried closing the rdma_id and destroying the
event channel when the connection fails to force the event files to be closed
without success.

The following calls result in opening files to the kernel:

ibv_create_comp_channel() - used to report cq events
rdma_create_event_channel() - used to report rdma cm events

Be sure that there are corresponding calls to:

ibv_destroy_comp_channel()
rdma_destroy_event_channel()

These are the calls that close the opened files.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] will opensm respond to requests that do not originate from qp1

2009-08-14 Thread Sean Hefty

Based on a code audit, I've confirmed that this should work
(osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure
it's been tried for SA but it has been exercised for other GS classes (sending
to some QP other than QP1).

Thanks for checking and pointing me at the right source file.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: How to destroy IB resources (was Re: [ofa-general] Help - RDMA event files remain open after acknowledging them)

2009-08-14 Thread Sean Hefty

I guess my question is, what's the best way to destroy IB resources? (Perhaps
even, what's the best way to init them in the first place).

If you're destroying the CQ, there's no need to call ibv_get_cq_event() or
ibv_poll_cq(), unless you need completion information (for example, from flushed
receives).

However, every successful call to ibv_get_cq_event() needs a corresponding call
to ibv_ack_cq_event().  You can call ack(1) for each cq event, or count the
number of times that get returns success and call ack(get_cnt) once before
calling destroy.  Note that the count refers to the number of cq events, and not
the number of completions returned through ibv_poll_cq.

For your drain_cq() function, you should be safe doing something like this:

while (ibv_poll_cq(...)  0)
/* optional processing of any left over completions */;

ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */
ibv_destroy_cq(...);

ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return
error EBUSY

This sounds like a QP isn't being destroyed.  I'm not sure that anything else
fails CQ destruction with EBUSY.

Btw, if you're using the rdma_cm interface, then it's simpler to use the
rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP
state transitions for you.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock)

2009-08-14 Thread Sean Hefty

How about this approach?  Basically it just open-codes delayed work by
splitting the timer and the work struct, and switches to mod_timer()
instead of del_timer() + add_timer().  It passes very light testing here
(basically I started ipoib and nothing blew up).

The approach looks okay to me. 

@@ -512,7 +523,8 @@ static void unregister_mad_agent(struct
ib_mad_agent_private *mad_agent_priv)
*/
   cancel_mads(mad_agent_priv);
   port_priv = mad_agent_priv-qp_info-port_priv;
-  cancel_delayed_work(mad_agent_priv-timed_work);
+  del_timer_sync(mad_agent_priv-timeout_timer);
+  cancel_work_sync(mad_agent_priv-timeout_work);

I had to check if there was a race between del_timer_sync() and the worker
thread, but the call to cancel_mads() looks like it prevents any issues.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?

2009-08-12 Thread Sean Hefty

Call Trace: 882fb6d5{:rdma_cm:rdma_init_qp_attr+209}
   88309285{:rdma_ucm:ucma_init_qp_attr+160}
   802ea55a{thread_return+0}
8830832e{:rdma_ucm:ucma_write+115}
   80186662{vfs_write+215} 80186c2b{sys_write+69}
  8010adba{system_call+126}

The rdma_cm is being used, so alternate path information is not used.

static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
   struct ib_qp_attr *qp_attr,
   int *qp_attr_mask)
{

if (cm_id_priv-id.lap_state == IB_CM_LAP_UNINIT) {
.
} else {
   *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE;
   qp_attr-alt_port_num = cm_id_priv-alt_av.port-port_num; -die

The rdma_cm should always send us through the if portion, and I would expect
alt_av to be NULL.  Maybe the cm_id is corrupted..?  Is there any chance that
the remote side is trying to load an alternate path?  Getting the value of the
lap_state may help, to see if it's at least a valid lap_state value.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas?

2009-08-12 Thread Sean Hefty

Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT.

Errr... not sure how that happened.  I don't know if ofed 1.3 has this feature
or not, but can you cat:

/sys/class/infiniband_cm/device/port_num/cm_tx_msgs/lap

if it exists?  Are both sides using the rdma_cm to communicate?  Does anything
in the app (either side) try to do something with alternate paths?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports

2009-08-10 Thread Sean Hefty

Might there be some GS service to expose ? Vendor MADs perhaps ? If not, then
not exposing QP1 should be OK.

At some point, exposing QP1 may make sense.  I was thinking more along the lines
of limiting the user space interfaces until things can be standardized. 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type

2009-08-06 Thread Sean Hefty

 Can resources (PDs, CQs, MRs, etc.) between the different transports be
shared?
 Does QP failover between transports work?

There is nothing in the architecture that precludes this; we are not
currently focusing on this.

Does the implementation allow this?  Right now PDs, CQs, etc are allocated per
device, not per port.  I'm not immediately concerned about QP failover.
However, I believe there needs to be some level of coordination between the
Infiniband side of the CM and the Ethernet side of the CM, since QPs are
associated with CA GUIDs.  I'm just trying to understand the impact of this
coordination.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries

2009-08-06 Thread Sean Hefty

-  ctx-cq = ibv_create_cq(ctx-context, ctx-rx_depth, NULL, ctx-channel,
0);
+  ctx-cq = ibv_create_cq(ctx-context, ctx-tx_depth + ctx-rx_depth,
+  NULL, ctx-channel, 0);

I'm looking at a windows port of this test, but at least there, rx_depth is set
to rx_depth + tx_depth.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries

2009-08-06 Thread Sean Hefty

Sure. Just above the call to ibv_create_cq(), ctx-rx_depth is set to
   ctx-rx_depth = rx_depth + tx_depth
but the rest of the code does ibv_post_send() and ibv_post_recv()
based on ctx-tx_depth and ctx-rx_depth which means the CQ needs
to be ctx-tx_depth + ctx-rx_depth big.

If the tx_depth is the same on both sides, why would there ever be more than the
initial tx_depth and rx_depth completions on the CQ?  How many receive
completions can there be on the CQ, and what throttles the sender? 

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries

2009-08-06 Thread Sean Hefty

Remember that this fix only affects the bi-directional test.
Both client and sever are going to post ctx-rx_depth receives
and ctx-tx_depth sends and then check for completions.
It won't post more sends or receives until the completions are
seen.

Okay - I think I understand what's happening.

The maximum number of outstanding sends is limited to tx_depth / 2.  After
posting that many sends, the code waits for completions.  Once some sends
complete, additional sends may be posted, up to the iteration count.  There's
nothing that coordinates posting the sends with completing receives on the
remote side.  (This is what I was missing.)  Eventually, all posted receives
could be complete and generate CQ entries.  The send side is basically throttled
by RNR NACKs.

Now I don't understand the purpose behind doubling the rx_depth...

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] cma: fix access to freed memory

2009-08-05 Thread Sean Hefty

rdma_join_multicast() allocates struct cma_multicast and then proceeds to join
to a multicast address. However, the join operation completes in another
context and the allocated struct could be released if the user destroys either
the rdma_id object or decides to leave the multicast group while the join is in
progress. This patch uses reference counting to to avoid such situation. It
also protects removal from id_priv-mc_list in cma_leave_mc_groups().

rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast.  This call
will block until the join callback completes or is canceled.  Can you describe
the race with cma_ib_mc_handler in more detail?

Also, cma_leave_mc_groups is only called from rdma_destroy_id.  Locking around
the mc-list shouldn't be required, since calls to join/leave aren't allowed.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type

2009-08-05 Thread Sean Hefty

As a preparation to devices that, in general, support different transport
protocol for each port, specifically RDMAoE, this patch defines transport type
for each of a device's ports. As a result rdma_node_get_transport() has been
unexported and is used internally by the implementation of the new API,
rdma_port_get_transport() which gives the transport protocol of the queried
port. All references to rdma_node_get_transport() are changed to to use
rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum
rdma_transport_type.

Can resources (PDs, CQs, MRs, etc.) between the different transports be shared?
Does QP failover between transports work?

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 5130fc5..f930f1d 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device)
   unsigned long flags;
   int ret;
   u8 i;
-
-  if (rdma_node_get_transport(ib_device-node_type) != RDMA_TRANSPORT_IB)
-  return;

Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask
of the supported transports available on the device?  I'm wondering if something
like this makes sense, to allow skipping devices that are not of interest to a
particular module.  This would be in addition to the rdma_port_get_transport
call.

There's just a lot of new checks to handle the transport on a port by port
basis.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] cma: fix access to freed memory

2009-08-05 Thread Sean Hefty

So where does this leave things?  Is any part of Eli's patch needed?

I don't believe the patch is needed, and Eli agreed with this.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] IB: Possible write outside array bounds

2009-07-29 Thread Sean Hefty

@@ -132,6 +136,9 @@ enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp,
u8 node_type,
   hop_ptr = smp-hop_ptr;
   hop_cnt = smp-hop_cnt;

+  if (hop_cnt = IB_SMP_MAX_PATH_HOPS)
+  return IB_SMI_DISCARD;
+
   /* See section 14.2.2.2, Vol 1 IB spec */
   if (!ib_get_smp_direction(smp)) {
   /* C14-9:1 -- sender should have incremented hop_ptr */
@@ -140,7 +147,8 @@ enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp,
u8 node_type,

   /* C14-9:2 -- intermediate hop */
   if (hop_ptr  hop_ptr  hop_cnt) {
-  if (node_type != RDMA_NODE_IB_SWITCH)
+  if (node_type != RDMA_NODE_IB_SWITCH ||
+  hop_ptr + 1 = IB_SMP_MAX_PATH_HOPS)

I believe at this point:

hop_ptr  hop_cnt  IB_SMP_MAX_PATH_HOPS

so, this test will always fail.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] perftest Add rdma_cm retries

2009-07-24 Thread Sean Hefty

Here is version 2 of the patch.  Based on observations of tests, I believe
Steve Wise's comments are reasonable, so I removed the rdma_resolve_addr
retry and simply changed the timeout value.  Feel free to use whichever one
of these patches you like best.  However, I urge you to apply one of these,
since the programs fail in a busy large fabric.

Why not just make the retry and timeout values command line parameters and allow
adjusting both?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] perftest Add rdma_cm retries

2009-07-24 Thread Sean Hefty

I'm not sure we need the retry.

On IB, resolve route is done using unreliable datagram with no lower level
timeout or retry.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: Running more than 894 processes doing rdma_listen

2009-07-22 Thread Sean Hefty

Is there an explicit limit on the number of ports that can be listening using
rdma_cm?

There's no inherent limit built into the code.

It prints out CMA: unable to open RDMA device

It then doesn't gracefully handle that problem, ending in

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47695401269920 (LWP 30003)]
__ibv_close_device (context=0x0) at src/device.c:154
154 int async_fd = context-async_fd;
(gdb) where
#0  __ibv_close_device (context=0x0) at src/device.c:154
#1  0x0034e360184f in ucma_cleanup () at src/cma.c:165
#2  0x0034e3601a13 in ucma_init () at src/cma.c:257
#3  0x0034e3602080 in rdma_create_event_channel () at src/cma.c:299
#4  0x00403077 in main (argc=4, argv=0x7fffb739fcc8) at rdma_bw.c:1057

Thanks - I see where the bug is for this.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] sending mad in parallel mode and perfquery

2009-07-09 Thread Sean Hefty

ibumad library has functions send_mad and recv_mad which should be send
sequentially.
Is it possible to create function which would send several MADs to
several destinations and then waits for replies(in terms of ib driver)?

I'm not sure that send_mad and recv_mad don't do what you want.  To send to
multiple destinations, call send_mad multiple times.  The call returns after
posting or queuing the send operation to the QP.  It does not wait for a
response or guarantee that the send has actually been placed on the wire before
returning.

recv_mad blocks until any response is received, and it can be called from
multiple threads.  recv_mad only has multi-threaded issues if MADs  256 bytes
are received.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] rdma_listen() backlog

2009-07-09 Thread Sean Hefty

Maybe I've missed something, but the last time I checked it appeared
to me that for kernel RDMA CM the 'backlog' parameter was not used at
all unless for iWarp transport.

It's not used for kernel IB connections.  Since connection requests are reported
through a callback, there's nothing to queue and it's unneeded.

It is used for userspace connections.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: Question on rdma_resolve_route and retries

2009-07-08 Thread Sean Hefty

We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband fabric
using OFED 1.4.1.  When the MPI jobs get large enough, the event response to
rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of
ETIMEDOUT.

Yep - you pretty much need to connect out of band with all large MPI jobs using
made up path data, or enable some sort of PR caching.

It seems pretty clear that the SA path record requests are being synchronized
and bunching together, and in the end exhausting the resources of the subnet
manager node so only the first N are actually received.

In our testing, we discovered that the SA almost never dropped any queries.  The
problem was that the backlog grew so huge, that all requests had timed out
before they could be acted on.  There's probably something that could be done
here to avoid storing received MADs for extended periods of time.

The sequence seems to be:

call librdmacm-1.0.8/src/cma.c's rdma_resolve_route

which translates directly into a kernel call into infiniband/core/cma.c's
rdma_resolve_route

with an IB fabric becomes a call into cma_resolve_ib_route

which leads to a call to cma_query_ib_route

which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with the
callback pointing to cma_query_handler

When cma_query_handler gets a callaback with a bad status, it sets the returned
event to RDMA_CM_EVENT_ROUTE_ERROR

Nowhere in there do I see any retry attempts.  If the SA path record query
packet, or it's response packet, gets lost, then the timeout eventually happens
and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.

The kernel sa_query module does not issue retries.  All retries are the
responsibility of the caller.  This gives greater flexibility to how timeouts
are handled, but has the drawback that all 'retries' are really new
transactions.

First question: Did I miss a retry buried somewhere in all of that?

I don't believe so.

Second question: How does somebody come up with a timeout value that makes
sense?  Assuming retries are the responsibility of the rdma_resolve_route
caller, you would like to have a value that is long enough to avoid false
timeouts when a response is eventually going to make it, but not any longer.
This value seems like it would be dependent on the fabric and the capabilities
of the node running the subnet manager, and should be a fabric-specific
parameter instead of something chosen at random by each caller of
rdma_resolve_route.

The timeout is also dependent on the load hitting the SA.  I don't know that a
fabric-specific parameter can work.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: Question on rdma_resolve_route and retries

2009-07-08 Thread Sean Hefty

This is encouraging.  I did try testing with 10,000 ms timeouts and still got
the failure with only 800 different processes, so I jumped to the conclusion
that the queries were being dropped.  Do you have a guess as to a timeout value
that would always succeed?

We ended up around a 60 second timeout based on the number of connections and
how quickly our SM node could process queries.  This was done a while ago, and
there have been a lot of improvements to opensm since then.  I don't know of an
easy way to test the performance of the SM.  It's also possible that our test
staggered the queries just enough that the SM could keep up receiving them.

Maybe I should have come up with a better name.  By fabric-specific, I meant a
specific implentation of the fabric, including the capability of the subnet
manager node.  How does somebody writing rdma_cm code come up with a number?
That particular program might not put much of a load on the SA, but could run
concurrently with other jobs that do (or don't).  It would be nice to have a
way to set up the retry mechanism so that it would work on any system it ran
on.

Maybe the SA service could track the SA response time and adjust the timeout
accordingly.  E.g. guess = .2(last response) + .8(last guess).  Users could
indicate that the default timeout could be used.

Apps could also help by staggering their start times to avoid hitting the SA
with hundreds of thousands of queries at once.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] cmatose fails whereas rping passes on iWarp

2009-07-08 Thread Sean Hefty

I did this change and the hang went away as well.

I think cmatose.c needs this fix.


 ucmatose completes when I change the following line:
send_wr.send_flags = 0;
 to
send_wr.send_flags = IBV_SEND_SIGNALED;

cmatose sets init_qp_attr.sq_sig_all = 1 when initializing the QP, so I wouldn't
expect this flag to be used.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] cmatose fails whereas rping passes on iWarp

2009-07-03 Thread Sean Hefty

If this test sends data from server side first you could
be running into the iWARP requirement of sending from
client first.

This was my thought as well.  I think Chelsio supports sending from the server
side first, but I'm not sure, or if it's enabled by default.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] ib_rdma_bw - memory leaks?

2009-07-02 Thread Sean Hefty

As mentioned in my previous email, there are other 3 places of memory leaks,
should I proceed and fix them up in rdma_bw.c file?

I think that makes sense; I was only commenting on the code that I maintain.
Based on looking at the git trees, it appears that Owen Meron is the maintainer
of ib_rdma_bw.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] cmatose fails whereas rping passes on iWarp

2009-07-02 Thread Sean Hefty

I'm wondering if anybody else has seen this behavior.
Is cmatose expected to work on iWarp?

It's intended to work on iWarp.

[r...@lv2 examples]# ./ucmatose -s 192.168.10.30
cmatose: starting client
cmatose: connecting
cmatose: event: RDMA_CM_EVENT_CONNECT_ERROR, error: -22

This looks like an asynchronous error occurring while trying to connect.  I
don't see anything obvious in cmatose.c that would lead to a connect error.
Does anything occur on the server side?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] ib_rdma_bw - memory leaks?

2009-07-01 Thread Sean Hefty

3. rdma_create_event_channel() calls ucma_init() but
rdma_destroy_event_channel() does not call ucma_cleanup(), this results into
memory leak at provider's library since it does not call ibv_close_device()
and thus unable to do *-free_context().

This is the correct behavior.  ucma_init() is called from several routines to
ensure that the library performs proper initialization.  Once initialized, it
remains initialized until the library is no longer used.  The cleanup is done in
rdma_cma_fini().

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Sending two integers via RDMA_WRITE

2009-06-30 Thread Sean Hefty

I want to use completion queue element on  completion queue associated
with received queue (on remote hca) to allow reading databuffer.
But I get nothing from the completion queue.

You need to send immediate data with an RDMA write to generate a completion on
the remote side.  Otherwise, a receive work request is not consumed.

In the specification, it says that a CQE should be created (in the
remote hca) after performing a rdma write

See C10-87 (page 511)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] dapl/windows: remove dlist.c

2009-06-29 Thread Sean Hefty

All dlist functions have been moved to the header file.  Remove
references to dlist.c.

Signed-off-by: Sean Hefty sean.he...@intel.com
---

 dapl/openib_cma/dapl_ib_util.c |1 -
 dapl/openib_scm/dapl_ib_cq.c   |1 -
 2 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c
index bf23d43..f48c1cb 100755
--- a/dapl/openib_cma/dapl_ib_util.c
+++ b/dapl/openib_cma/dapl_ib_util.c
@@ -56,7 +56,6 @@ struct dapl_llist_entry *g_hca_list;
 
 #if defined(_WIN64) || defined(_WIN32)
 #include ..\..\..\..\..\etc\user\comp_channel.cpp
-#include ..\..\..\..\..\etc\user\dlist.c
 #include rdma\winverbs.h
 
 struct ibvw_windata windata;
diff --git a/dapl/openib_scm/dapl_ib_cq.c b/dapl/openib_scm/dapl_ib_cq.c
index 2af1889..8a9a2ab 100644
--- a/dapl/openib_scm/dapl_ib_cq.c
+++ b/dapl/openib_scm/dapl_ib_cq.c
@@ -55,7 +55,6 @@
 
 #if defined(_WIN64) || defined(_WIN32)
 #include ..\..\..\..\..\etc\user\comp_channel.cpp
-#include ..\..\..\..\..\etc\user\dlist.c
 
 void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr)
 {



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] verb level interoperability between vendor's hcas

2009-06-28 Thread Sean Hefty

Is a mixed HCA environment cluster not ready for prime time - yet?

Are the crashes in the kernel or userspace?  Is there a specific HCA on the
nodes that crash?

Interop testing is done, but I do not know the details of the configurations and
tests that are run. 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: [PATCH 0/2] Opensm support for external routing engines

2009-06-18 Thread Sean Hefty

 The idea is to include non-open source routing algorithms into opensm on
 demand, which is permitted by the BSD license.

It is permitted, but I don't think that we as open source community
need to support such efforts.

I agree with this.  This sets a precedence of opening up the source code to all
sorts of changes that become difficult to test and maintain.

Anyone is free to take opensm, integrate their own changes, and release
separately, but the burden of maintaining those changes should not rest on the
open source community at large.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet

2009-06-17 Thread Sean Hefty

LL: the RDMA stack will see that the port has different link types.
SLs map cleanly to VLAN user priorities.

LL: you need to emulate *enough* so that typical applications don't need
to worry about the link type. SA path queries is the best example.
Otherwise, every RDMA application (not necessarily a CMA app) will need
to have different code paths depending on the link type.

Let's just say that at this point I completely disagree with where these patches
try to abstract the differences, which are many.

RDMA apps that want to use this and IB without going through an abstraction will
need different code -- just like they would for iWarp, which also provides RDMA
over Ethernet, and is a standard.  IB mad and SA query modules are not
appropriate places for abstracting the differences between IB, iWarp, and
whatever name we give this.

This could change depending on whether this is really trying to be IB with a
different L2, or is just another RDMA protocol that runs on Ethernet.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ewg] RE: [ofa-general] [PATCH 4/9] ib_core: Add RDMAoE SA support

2009-06-17 Thread Sean Hefty

 How can a user control this?  An app needs the same qkey for unicast traffic.

In RDMAoE, the qkey has a fixed well-known value, which will be
returned both by multicast and path queries.

The rdma_cm defines and uses a different well-known qkey.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet

2009-06-16 Thread Sean Hefty

RDMA over Ethernet (RDMAoE) allows running the IB transport protocol over
Ethernet, providing IB capabilities for Ethernet fabrics. The packets are
standard Ethernet frames with an Ethertype, an IB GRH,  unmodified IB transport
headers and payload. HCA RDMAoE ports are no different than regular IB ports
from the RDMA stack perspective.

I would refer to this as IBoE, not RDMAoE.

The RDMA stack should see these ports different than regular IB HCA ports.
There are a lot of differences that should not simply be hidden or incorrectly
assumed: QP0, QoS, multiple paths, routing(?), no SA, etc. 

IB subnet management and SA services are not required for RDMAoE operation;

Then I would not try to emulate it at all.  As Hal mentioned in a separate post,
there are too many ways to interact with the SA that an emulation won't cover.

Ethernet management practices are used instead. In Ethernet, nodes are commonly
referred to by applications by means of an IP address. RDMAoE treats IP
addresses that were assigned to the corresponding Ethernet port as GIDs, and
makes use of the IP stack to bind a destination address to the corresponding
netdevice (just as the CMA does today for IB and iWARP) and to obtain its L2
MAC addresses.

Is the actual L3 address an IP address, or just an encoded IP address in an IBoE
L3 address?  What L3 protocol is being used and will it interoperate with some
peer L3 protocol (IP or IB)?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID -- MAC translations

2009-06-16 Thread Sean Hefty

A few support functions are added to allow the translation from GID to MAC
which is required by hw drivers supporting RDMAoE.

Why not just use IP to MAC calls?  Or use the MAC as the GUID?

Do the GIDs follow the IB GID format?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH 4/9] ib_core: Add RDMAoE SA support

2009-06-16 Thread Sean Hefty

diff --git a/drivers/infiniband/core/multicast.c
b/drivers/infiniband/core/multicast.c
index 107f170..2417f6b 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -488,6 +488,36 @@ retest:
   }
 }

+struct eth_work {
+  struct work_struct   work;
+  struct mcast_member *member;
+  struct ib_device*device;
+  u8   port_num;
+};
+
+static void eth_mcast_work_handler(struct work_struct *work)
+{
+  struct eth_work *w = container_of(work, struct eth_work, work);
+  int err;
+  struct ib_port_attr port_attr;
+  int status = 0;
+
+  err = ib_query_port(w-device, w-port_num, port_attr);
+  if (err)
+  status = err;
+  else if (port_attr.state != IB_PORT_ACTIVE)
+  status = -EAGAIN;
+
+  w-member-multicast.rec.qkey = cpu_to_be32(0xc2c);

How can a user control this?  An app needs the same qkey for unicast traffic.

+  atomic_inc(w-member-refcount);

This needs to be moved below...

+  err = w-member-multicast.callback(status, w-member-multicast);
+  deref_member(w-member);
+  if (err)
+  ib_sa_free_multicast(w-member-multicast);
+
+  kfree(w);
+}
+
 /*
  * Fail a join request if it is still active - at the head of the pending
queue.
  */
@@ -586,21 +616,14 @@ found:
   return group;
 }

-/*
- * We serialize all join requests to a single group to make our lives much
- * easier.  Otherwise, two users could try to join the same group
- * simultaneously, with different configurations, one could leave while the
- * join is in progress, etc., which makes locking around error recovery
- * difficult.
- */
-struct ib_sa_multicast *
-ib_sa_join_multicast(struct ib_sa_client *client,
-   struct ib_device *device, u8 port_num,
-   struct ib_sa_mcmember_rec *rec,
-   ib_sa_comp_mask comp_mask, gfp_t gfp_mask,
-   int (*callback)(int status,
-   struct ib_sa_multicast *multicast),
-   void *context)
+static struct ib_sa_multicast *
+ib_join_multicast(struct ib_sa_client *client,
+struct ib_device *device, u8 port_num,
+struct ib_sa_mcmember_rec *rec,
+ib_sa_comp_mask comp_mask, gfp_t gfp_mask,
+int (*callback)(int status,
+struct ib_sa_multicast *multicast),
+void *context)
 {
   struct mcast_device *dev;
   struct mcast_member *member;
@@ -647,9 +670,81 @@ err:
   kfree(member);
   return ERR_PTR(ret);
 }
+
+static struct ib_sa_multicast *
+eth_join_multicast(struct ib_sa_client *client,
+ struct ib_device *device, u8 port_num,
+ struct ib_sa_mcmember_rec *rec,
+ ib_sa_comp_mask comp_mask, gfp_t gfp_mask,
+ int (*callback)(int status,
+ struct ib_sa_multicast *multicast),
+ void *context)
+{
+  struct mcast_device *dev;
+  struct eth_work *w;
+  struct mcast_member *member;
+  int err;
+
+  dev = ib_get_client_data(device, mcast_client);
+  if (!dev)
+  return ERR_PTR(-ENODEV);
+
+  member = kzalloc(sizeof *member, gfp_mask);
+  if (!member)
+  return ERR_PTR(-ENOMEM);
+
+  w = kzalloc(sizeof *w, gfp_mask);
+  if (!w) {
+  err = -ENOMEM;
+  goto out1;
+  }
+  w-member = member;
+  w-device = device;
+  w-port_num = port_num;
+
+  member-multicast.context = context;
+  member-multicast.callback = callback;
+  member-client = client;
+  member-multicast.rec.mgid = rec-mgid;
+  init_completion(member-comp);
+  atomic_set(member-refcount, 1);
+
+  ib_sa_client_get(client);
+  INIT_WORK(w-work, eth_mcast_work_handler);
+  queue_work(mcast_wq, w-work);
+
+  return member-multicast;

The user could leave/destroy the multicast join request before the queued work
item runs.  We need to hold an additional reference on the member until the work
item completes.

+
+out1:
+  kfree(member);
+  return ERR_PTR(err);
+}
+
+/*
+ * We serialize all join requests to a single group to make our lives much
+ * easier.  Otherwise, two users could try to join the same group
+ * simultaneously, with different configurations, one could leave while the
+ * join is in progress, etc., which makes locking around error recovery
+ * difficult.
+ */
+struct ib_sa_multicast *
+ib_sa_join_multicast(struct ib_sa_client *client,
+   struct ib_device *device, u8 port_num,
+   struct ib_sa_mcmember_rec *rec,
+   ib_sa_comp_mask comp_mask, gfp_t gfp_mask,
+   int (*callback)(int status,
+   struct ib_sa_multicast *multicast),
+   void *context)
+{
+  return

RE: [ofa-general] [PATCH 1/9] ib_core: Add API to query port link type

2009-06-15 Thread Sean Hefty

This allows to get the type of a port to be either Ethernet or IB which is
required by following patches for implementing RDMA over Ethernet - RDMAoE.

I don't know if this makes more sense without studying the changes in more
detail, but was there a reason why node_type just wasn't extended instead?


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] spin_lock_irqsave in ib_send_mad

2009-06-11 Thread Sean Hefty

spin_lock_irqsave(qp_info-send_queue.lock, flags);
if (qp_info-send_queue.count  qp_info-send_queue.max_active) {
+   qp_info-send_queue.count++;

+   spin_unlock_irqrestore(qp_info-send_queue.lock, flags);

ret = ib_post_send(mad_agent-qp, mad_send_wr-send_wr,
   bad_send_wr);

+   spin_lock_irqsave(qp_info-send_queue.lock, flags);
list = qp_info-send_queue.list;
} else {
ret = 0;
+   qp_info-send_queue.count++;
list = qp_info-overflow_list;
}

if (!ret)
list_add_tail(mad_send_wr-mad_list.list, list);
+else
+   qp_info-send_queue.count--;

It's not quite this simple.  Once the lock is released before calling
ib_post_send, another thread could come down and queue a MAD to the overflow
list.  If ib_post_send fails, the overflow list must be checked to see if a
queued mad should now be sent.

As for being able to hold a lock when calling ib_post_send, that's something
that should be allowed.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] spin_lock_irqsave in ib_send_mad

2009-06-11 Thread Sean Hefty

Why check the overflow list only when the ib_post_send fails? Don't you
need to do this regardless? It looks like you could get stuff into the overflow
list even with the existing code...

You only need to check it when decrementing send_queue.count, which is currently
only after a send completes.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [ofw] [PATCH-resend] ib-mgmt/libibnetdisc: fix typecast warning

2009-06-10 Thread Sean Hefty

Signed-off-by: Sean Hefty sean.he...@intel.com
---
I tried converting ib_portid_t lid to a uint16_t, but that resulted in a cascade
of changes, so I kept the simple approach.  :)

Resending - I didn't see a response to this.

 infiniband-diags/libibnetdisc/src/ibnetdisc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 1e93ff8..baea98e 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -188,7 +188,7 @@ extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int
nextport)
f-fabric.ibmad_port)  0)
return -1;
 
-   portid-drpath.drslid = f-selfportid.lid;
+   portid-drpath.drslid = (uint16_t) f-selfportid.lid;
portid-drpath.drdlid = 0x;
}
 



___
ofw mailing list
o...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] spin_lock_irqsave in ib_send_mad

2009-06-10 Thread Sean Hefty

mad.c:ib_send_mad() calls ib_post_send() after taking spin_lock_irqsave().

Is it really necessary to take the spinlock during the entire time of
ib_post_send()? It appears like it is only necessary for list manipulation!

It protects the list and the counters.  It's technically not needed around
ib_post_send.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Memory registration redux

2009-06-08 Thread Sean Hefty

Are there any comparable Windows plans?

I believe that Windows already provides an equivalent functionality as part of
the OS (Windows 2008 / Vista).  See SecureMemoryCacheCallback.  There are no
plans for WinOF to provide anything separately from this.  (It's likely
impossible without OS support anyway.)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [ofw] skipping QP states during transitions

2009-06-05 Thread Sean Hefty

No, you need to move from reset to init to RTR and only than to RTS.

Ok - thanks.

Look at the IB spec on section 10.3

I was just exploring whether any hardware, separate from the existing software
stacks, supported 'skipping' QP states -- assuming necessary values for the
other states were also given.  In theory, hardware could walk through the states
internally.  The motivation is to decrease the time to connect QPs by reducing
the number of commands that need to be issued to the hardware.

And to be clear, I'm not suggesting that such a feature is all that important.
I'm just exploring ideas.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] skipping QP states during transitions

2009-06-04 Thread Sean Hefty

Does anyone know if the HCAs are capable of transitioning directly from reset to
RTS using a single command?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] dapl/windows cma provider: add support for network devices based on index

2009-06-02 Thread Sean Hefty

The linux cma provider provides support for named network devices, such
as 'ib0' or 'eth0'.  This allows the same dapl configuration file to 
be used easily across a cluster.

To allow similar support on Windows, allow users to specify the device
name 'rdma_devN' in the dapl.conf file.  The given index, N, is map to a
corresponding IP address that is associated with an RDMA device.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
diff -up -r -X \mshefty\scm\winof\trunk\docs\dontdiff.txt -I '\$Id:' 
trunk\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c
branches\winverbs\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c
--- trunk\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c  2009-05-01 
10:18:28.0 -0700
+++ branches\winverbs\ulp\dapl2/dapl/openib_cma/dapl_ib_util.c  2009-06-02 
15:26:19.534649800 -0700
@@ -57,10 +57,50 @@ struct dapl_llist_entry *g_hca_list;
 #if defined(_WIN64) || defined(_WIN32)
 #include ..\..\..\..\..\etc\user\comp_channel.cpp
 #include ..\..\..\..\..\etc\user\dlist.c
+#include rdma\winverbs.h
 
-#define getipaddr_netdev(x,y,z) -1
 struct ibvw_windata windata;
 
+static int getipaddr_netdev(char *name, char *addr, int addr_len)
+{
+   IWVProvider *prov;
+   WV_DEVICE_ADDRESS devaddr;
+   struct addrinfo *res, *ai;
+   HRESULT hr;
+   int index;
+
+   if (strncmp(name, rdma_dev, 8)) {
+   return EINVAL;
+   }
+
+   index = atoi(name + 8);
+
+   hr = WvGetObject(IID_IWVProvider, (LPVOID *) prov);
+   if (FAILED(hr)) {
+   return hr;
+   }
+
+   hr = getaddrinfo(..localmachine, NULL, NULL, res);
+   if (hr) {
+   goto release;
+   }
+
+   for (ai = res; ai; ai = ai-ai_next) {
+   hr = prov-lpVtbl-TranslateAddress(prov, ai-ai_addr, 
devaddr);
+   if (SUCCEEDED(hr)  (ai-ai_addrlen = addr_len)  (index-- 
== 0)) {
+   memcpy(addr, ai-ai_addr, ai-ai_addrlen);
+   goto free;
+   }
+   }
+   hr = ENODEV;
+
+free:
+   freeaddrinfo(res);
+release:
+   prov-lpVtbl-Release(prov);
+   return hr;
+}
+
 static int dapls_os_init(void)
 {
return ibvw_get_windata(windata, IBVW_WINDATA_VERSION);
diff -up -r -X \mshefty\scm\winof\trunk\docs\dontdiff.txt -I '\$Id:' 
trunk\ulp\dapl2/dapl/openib_cma/SOURCES
branches\winverbs\ulp\dapl2/dapl/openib_cma/SOURCES
--- trunk\ulp\dapl2/dapl/openib_cma/SOURCES 2009-05-27 07:25:19.0 
-0700
+++ branches\winverbs\ulp\dapl2/dapl/openib_cma/SOURCES 2009-06-02 
10:38:04.799012200 -0700
@@ -45,10 +45,12 @@ TARGETLIBS= \
$(SDK_LIB_PATH)\ws2_32.lib \
 !if $(FREEBUILD)
$(TARGETPATH)\*\dat2.lib \
+   $(TARGETPATH)\*\winverbs.lib \
$(TARGETPATH)\*\libibverbs.lib \
$(TARGETPATH)\*\librdmacm.lib
 !else
$(TARGETPATH)\*\dat2d.lib \
+   $(TARGETPATH)\*\winverbsd.lib \
$(TARGETPATH)\*\libibverbsd.lib \
$(TARGETPATH)\*\librdmacmd.lib
 !endif


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] SubnAdmGet (6777)

2009-06-01 Thread Sean Hefty

I could not find anywhere in the spec how should the SA respond to
SubnAdmGet() in case there is more than one record. What I did find
is an example of path query mad, and it was with SubnAdmGetTable().

PR NumbPath - 'In a SubnAdmGet() query request, ignored; a value of 1 is used.'

I'm not sure how else you can interpret this except to mean the same as for
SubAdmGetTable: 'If more paths that satisfy the PathRecord query exist for a
given SGID-DGID combination, only NumbPath paths shall be returned
(implementation defined).'

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] SubnAdmGet (6777)

2009-06-01 Thread Sean Hefty

No, it is correct as is (returning an error of too many records for
this case). See p.944:

15.4.6 SUBNADMGET() / SUBNADMGETRESP(): GET AN ATTRIBUTE

C15-0.1.30: Ine response to a SubnAdmGet(), if a single attribute would
be returned based on the access rules specified in 15.4.1 Restrictions on
Access on page 938 and the matching of components specified by the
ComponentMask, then SubAdmGetResp() shall return that attribute with
a zero status value.

C15-0.1.31: If SubnAdmGet() fails to satisfy C15-0.1.30:, SubnAdmGet-
Resp() shall return with the status field providing the reason for failure
(see Table 190 SA MAD Class-Specific Status Encodings on page 900).

This ignores NumbPath = 1 (or defines NumbPath differently for PR SubAdmGet
versus SubAdmGetTable).  With NumbPath = 1, only a single attribute should be
returned.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] SubnAdmGet (6777)

2009-06-01 Thread Sean Hefty

Yes, it is different from GetTable in that SA pares the responses down
to that but Get doesn't (have that additional language to pare them
down).

This seems like an implementation issue (aka bug) with the SA to me.

The language about NumbPath for Get was originally added to indicate
that the NumbPath was ignored on a Get even if it was included in the
component mask.

It states that it's ignored and a value of 1 is used.  What else would a
NumbPath value of 1 mean if it's completely ignored?  I consider this a spec
bug.  :)

From an implementation view, requiring users to use SubnAdmGetTable to get a
single path record is less efficient than returning a single PR from SubnAdmGet.

How have other SM implementations (not based on opensm) interpreted NumbPath for
PR SubnAdmGet?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] RDMA_CM--how to include an SRQ?

2009-06-01 Thread Sean Hefty

Is there an example of how to incorporate an SRQ into using RDMA CM and IB
verbs?  Thank you for any assistance or suggestions.

I don't believe so.  libibverbs has an srq_pingpong example program that uses an
SRQ.  Using an SRQ with the rdma_cm is basically trivial if the QP is created
using rdma_create_qp.  The rdma_cm reads the struct ibv_srq * field from the
struct ibv_qp when establishing a connection.  If the QP is created directly
from libibverbs (ibv_create_qp), then the user should just indicate that an SRQ
is in use when connecting.

Note that I don't believe there's no real requirement to indicate to the remote
side of a connection that an SRQ is in use.  The remote QP doesn't use this
information.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state

2009-05-27 Thread Sean Hefty

 Note that a lot (most?) connections between QPs are established out of band
 using TCP, and these are not tracked by the CM or go through any sort of
 timewait before potentially being reused.

I don't quite understand this. Could you please point me to places
(code, IB spec, so on) where I could poke around?

MPIs typically connect QPs by connecting over sockets and exchanging the QP
information that way.  The QPs are then modified directly using a combination of
locally read and hard-coded values.  The libibverb examples along with the
perftest programs can connect QPs in this fashion.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state

2009-05-26 Thread Sean Hefty

In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP
could enter the TimeWait state without having entered the Established
state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT
happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a
RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some
resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side.

Yes, it's possible to enter timewait without going through established.  I'd
have to walk through the code at this point to identify all of the cases.

Note that a lot (most?) connections between QPs are established out of band
using TCP, and these are not tracked by the CM or go through any sort of
timewait before potentially being reused.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages

2009-05-18 Thread Sean Hefty

please copy the ofw mail list on dapl changes

diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 1c098c5..0378a70 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
 #define dapl_os_vprintf(fmt,args) vprintf(fmt,args)
 #define dapl_os_syslog(fmt,args)  vsyslog(LOG_USER|LOG_WARNING,fmt,args)

-#define dapl_os_getpid getpid
-
+#define dapl_os_getpid (long int)pthread_self

Maybe add a new call, dapl_os_get_thread_id or something similar, to avoid
confusion with the name and what the call returns.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages

2009-05-18 Thread Sean Hefty

diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
index 20ee405..6c6eeb5 100644
--- a/dapl/common/dapl_debug.c
+++ b/dapl/common/dapl_debug.c
@@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char
*fmt, ...)
if (DAPL_DBG_DEST_STDOUT  g_dapl_dbg_dest) {
va_start(args, fmt);
fprintf(stdout, %s:%lx: , _ptr_host_,
-   dapl_os_getpid());
+   dapl_os_gettid());
dapl_os_vprintf(fmt, args);
va_end(args);
}
diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 0378a70..e0e30bf 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
 #define dapl_os_vprintf(fmt,args)  vprintf(fmt,args)
 #define dapl_os_syslog(fmt,args)   vsyslog(LOG_USER|LOG_WARNING,fmt,args)

-#define dapl_os_getpid (long int)pthread_self
+#define dapl_os_getpid (int)getpid
+#define dapl_os_gettid (long int)pthread_self

That's fine - what about Windows?  :)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Sean Hefty

Just to make sure we're on the same page: both IPoIB and the RDMA-CM
use SA path queries

But ipoib caches its path records...

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Sean Hefty

Yes, of-course. But, to start with, lets analyze the case of each node
running --one-- rank and then take it from there to the case where
each node runs C ranks.

The caching is independent of running MPI though.  To get a fair comparison,
you'd probably have to reboot the entire cluster before running the test and
ensure that no other communication between the nodes occurs over ipoib.

For myself, I'm not sure that the tests are the same.  The DAPL providers create
and modify the QPs differently.  I'd need to walk through the code to see
whether QP creation time is included and verify that the QP modify calls are the
same.

As for responding to the initial question, using sockets with hard-coded values
seems to be the most common way to establish IB connections at scale, though I
would guess that using the ib_cm with hard-coded values would work about the
same.
 
- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] ib-mgmt: fixup ibsendtrap for windows

2009-05-04 Thread Sean Hefty

Fix some typecast issues.

Signed-off-by: Sean Hefty sean.he...@intel.com
---

 infiniband-diags/src/ibsendtrap.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c 
b/infiniband-diags/src/ibsendtrap.c
index 469bc39..7ad588e 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port)
 static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
n-generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
-   n-g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
+   n-g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) 
get_node_type(port));
n-g_or_v.generic.trap_num = cl_hton16(144);
-   n-issuer_lid = cl_hton16(port-lid);
-   n-data_details.ntc_144.lid = cl_hton16(port-lid);
+   n-issuer_lid = cl_hton16((uint16_t) port-lid);
+   n-data_details.ntc_144.lid = n-issuer_lid;
n-data_details.ntc_144.local_changes =
TRAP_144_MASK_OTHER_LOCAL_CHANGES;
n-data_details.ntc_144.change_flgs =
@@ -79,10 +79,10 @@ static void build_trap144(ib_mad_notice_attr_t * n, 
ib_portid_t *port)
 static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
n-generic_type = 0x80 | IB_NOTICE_TYPE_URGENT;
-   n-g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
+   n-g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) 
get_node_type(port));
n-g_or_v.generic.trap_num = cl_hton16(129);
-   n-issuer_lid = cl_hton16(port-lid);
-   n-data_details.ntc_129_131.lid = cl_hton16(port-lid);
+   n-issuer_lid = cl_hton16((uint16_t) port-lid);
+   n-data_details.ntc_129_131.lid = n-issuer_lid;
n-data_details.ntc_129_131.pad = 0;
n-data_details.ntc_129_131.port_num = (uint8_t) error_port;
 }



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support

2009-04-27 Thread Sean Hefty

 +#include infiniband/mad_osd.h

Why is this inclusion needed? mad_osd.h is included via mad.h.

It's not then, but I prefer to include necessary files directly, rather than
relying on other include files to pick them up.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support

2009-04-27 Thread Sean Hefty

I would agree in general, but in this specific case it is *_osd.h -
system dependent file which is not included directly, at least not in
libibmad and infiniband-diags up to now (hypothetically in some
implementations it may not exist at all).

libibmad mad.h includes mad_osd.h directly.  I added it to ibnetdisc.h, because
libibnetdisc is a new library and requires OS dependent mechanisms (i.e.
MAD_EXPORT) to export the new interfaces.  I agree in trying to keep mad_osd.h
out of the diags, but libibnetdisc is special within the diags...

I really don't have a strong preference on this, so whatever you want is fine. 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7

2009-04-27 Thread Sean Hefty

OTOH I cannot understand why port-capmask is defined as uint64_t and
not as 32-bit. Kernel uses 32-bit value and it is shown in this file as
0x%0x.

What about to convert type of port-capmask to uint32_t?

I think that makes the most sense. 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7

2009-04-24 Thread Sean Hefty

diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c
index 7985be1..99af9a8 100644
--- a/infiniband-diags/src/ibstat.c
+++ b/infiniband-diags/src/ibstat.c
@@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone)
   printf(%sBase lid: %d\n, pre, port-base_lid);
   printf(%sLMC: %d\n, pre, port-lmc);
   printf(%sSM lid: %d\n, pre, port-sm_lid);
-  printf(%sCapability mask: 0x%08x\n, pre, (unsigned)ntohll(port-
capmask));
+  printf(%sCapability mask: 0x%08x\n, pre,
(unsigned)(ntohl((uint32_t)(port-capmask;

Casting from 64-bit to 32-bit, then byte swapping doesn't look right.

I think the problem may be in libibumad, umad.c, line 166:

if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, port-capmask)  0)
goto clean;

port-capmask = htonl(port-capmask);

capmask is read as a 64-bit value, but only 32-bit swap is used.  (libibumad is
not shared between Linux and Windows, so this problem doesn't show up on
Windows.)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc

2009-04-23 Thread Sean Hefty

 Where does the definition for ibdebug come from?

It is in ibdiag_common.c. Every infiniband-ibdiag tool is linked with
it. And yes, using this in this library can be problematic since
introduces a hidden dependency.

How does that work?  The library doesn't link ibdiag_common.c, so I'm not sure
what definition it picks up.  Maybe it defaults to undefined, assumed int...

To get things to build and run on Windows, I defined it as a static in the
library.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc

2009-04-23 Thread Sean Hefty

There is also an ibdebug defined in libibmad.

extern int ibdebug;

This is the one it is using...  :-/  I think there should be a wrapper
function.  Perhaps madrpc_show_errors?

Yes - that's the one it picks up.  Adding a wrapper makes sense to me.  (I don't
think that declaring a variable as extern is sufficient to share it across
library boundaries in windows.)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections

2009-04-23 Thread Sean Hefty

The output is much easier to read.  :)

@@ -59,6 +62,10 @@ MODULE_LICENSE(Dual BSD/GPL);
 #define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)

+#define CASE_RET(val, ret) case val: return #ret;

I would just drop this abstraction.

+static const char *format_node_type(enum rdma_node_type nt)
+{
+  enum rdma_transport_type tt;
+  if (nt) {
+  tt = rdma_node_get_transport(nt);
+  switch (tt) {

We don't really need the local variable tt.

+static int cma_rdma_id_seq_show(struct seq_file *file, void *v)
+{
+  struct rdma_id_private *id_priv;
+  char local_addr[64], remote_addr[64];
+
+  if (!v)
+  return 0;
+  if (v == SEQ_START_TOKEN) {
+  seq_printf(file,
+ %-5s
+ %-8s
+ %-5s
+ %-8s
+ %-52s
+ %-52s
+ %-6s
+ %-15s
+ %-8s
+ \n,
+ TYPE, DEVICE, PORT, NET_DEV, SRC_ADDR,
DST_ADDR,
SPACE, STATE, QP_NUM);
+  } else {
+  id_priv = list_entry(v, struct rdma_id_private, list);
+ format_addr((struct sockaddr *)id_priv-id.route.addr.src_addr,
+ local_addr);
+ format_addr((struct sockaddr *)id_priv-id.route.addr.dst_addr,
+ remote_addr);
+
+ seq_printf(file,
+ %-5s
+ %-8s
+ %-5d
+ %-8s
+ %-52s
+ %-52s
+ %-6s
+ %-15s
+ %-8d
+ \n,
+ format_node_type(id_priv-
id.route.addr.dev_addr.dev_type),
+ (id_priv-id.device) ? id_priv-id.device-name :
,
+ id_priv-id.port_num,
+ (id_priv-id.route.addr.dev_addr.src_dev) ?
id_priv-
id.route.addr.dev_addr.src_dev-name : ,
+ local_addr,
+ remote_addr,
+ format_port_space(id_priv-id.ps),
+ format_cma_state(id_priv-state),
+ id_priv-qp_num);
+  }

I still think this requires a lot of scrolling to get past a couple of print
statements.  Can we at least collapse the %-5s ... \n stuff down to a single
line?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 1/4] ib-mgmt/ibn3 branch: diags updated for continued windows support

2009-04-21 Thread Sean Hefty

Signed-off-by: Sean Hefty sean.he...@intel.com
---
This patch is based on the ibn3 branch

 infiniband-diags/src/ibaddr.c|1 +
 infiniband-diags/src/iblinkinfo.c|4 ++--
 infiniband-diags/src/ibnetdiscover.c |2 +-
 infiniband-diags/src/ibsendtrap.c|4 ++--
 infiniband-diags/src/vendstat.c  |4 ++--
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c
index bb22be9..7909a52 100644
--- a/infiniband-diags/src/ibaddr.c
+++ b/infiniband-diags/src/ibaddr.c
@@ -39,6 +39,7 @@
 #include stdlib.h
 #include unistd.h
 #include getopt.h
+#include arpa/inet.h
 
 #include infiniband/umad.h
 #include infiniband/mad.h
diff --git a/infiniband-diags/src/iblinkinfo.c 
b/infiniband-diags/src/iblinkinfo.c
index 1e43788..c6ce81b 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -48,7 +48,7 @@
 #include errno.h
 #include inttypes.h
 
-#include infiniband/complib/cl_nodenamemap.h
+#include complib/cl_nodenamemap.h
 #include infiniband/ibnetdisc.h
 
 char *argv0 = iblinkinfotest;
@@ -284,7 +284,7 @@ main(int argc, char **argv)
{ compat, 0, 0, 3},
{ from, 1, 0, 'f'},
{ R, 0, 0, 'R'},
-   { }
+   { 0 }
};
 
f = stdout;
diff --git a/infiniband-diags/src/ibnetdiscover.c 
b/infiniband-diags/src/ibnetdiscover.c
index 99750f0..2ca696e 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -210,7 +210,7 @@ out_chassis(ibnd_fabric_t *fabric, int chassisnum)
uint64_t guid;
 
fprintf(f, \nChassis %d, chassisnum);
-   guid = ibnd_get_chassis_guid(fabric, chassisnum);
+   guid = ibnd_get_chassis_guid(fabric, (unsigned char) chassisnum);
if (guid)
fprintf(f,  (guid 0x% PRIx64 ), guid);
fprintf(f, \n);
diff --git a/infiniband-diags/src/ibsendtrap.c 
b/infiniband-diags/src/ibsendtrap.c
index d0afca0..13f125f 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -73,7 +73,7 @@ static void build_trap129(ib_mad_notice_attr_t * n, uint16_t 
lid)
n-issuer_lid = cl_hton16(lid);
n-data_details.ntc_129_131.lid = cl_hton16(lid);
n-data_details.ntc_129_131.pad = 0;
-   n-data_details.ntc_129_131.port_num = error_port;
+   n-data_details.ntc_129_131.port_num = (uint8_t) error_port;
 }
 
 static int send_trap(const char *name,
@@ -100,7 +100,7 @@ static int send_trap(const char *name,
trap_rpc.dataoffs = IB_SMP_DATA_OFFS;
 
memset(notice, 0, sizeof(notice));
-   build(notice, selfportid.lid);
+   build(notice, (uint16_t) selfportid.lid);
 
return mad_send_via(trap_rpc, sm_port, NULL, notice, srcport);
 }
diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c
index 240c4cb..0bf9616 100644
--- a/infiniband-diags/src/vendstat.c
+++ b/infiniband-diags/src/vendstat.c
@@ -184,8 +184,8 @@ void config_counter_groups(ib_portid_t *portid, int port)
cg_config = (is4_config_counter_groups_t *)buf;
 
printf(counter_groups_config: configuring group0 %d group1 %d\n, cg0, 
cg1);
-   cg_config-group_selects[0].group_select = cg0;
-   cg_config-group_selects[1].group_select = cg1;
+   cg_config-group_selects[0].group_select = (uint8_t) cg0;
+   cg_config-group_selects[1].group_select = (uint8_t) cg1;
 
if (!ib_vendor_call_via(buf, portid, call, srcport))
IBERROR(config counter group set);



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2/4] ib-mgmt/ibn3 branch: libibmad update for windows support

2009-04-21 Thread Sean Hefty

Signed-off-by: Sean Hefty sean.he...@intel.com
---

 libibmad/src/portid.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c
index de9e2d3..6f8fea2 100644
--- a/libibmad/src/portid.c
+++ b/libibmad/src/portid.c
@@ -38,6 +38,7 @@
 #include stdio.h
 #include stdlib.h
 #include string.h
+#include arpa/inet.h
 
 #include infiniband/mad.h
 



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 3/4] ib-mgmt/ibn3 branch: libibmad: remove ib_resolve_guid function prototype

2009-04-21 Thread Sean Hefty

This function isn't implemented.

Signed-off-by: Sean Hefty sean.he...@intel.com
---

 libibmad/include/infiniband/mad.h |3 ---
 libibmad/src/libibmad.map |1 -
 2 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/libibmad/include/infiniband/mad.h 
b/libibmad/include/infiniband/mad.h
index b8290a7..188b66b 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -844,9 +844,6 @@ MAD_EXPORT int ib_path_query_via(const struct ibmad_port 
*srcport,
 /* resolve.c */
 MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout)
DEPRECATED;
-MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid,
-  ib_portid_t * sm_id, int timeout)
-   DEPRECATED;
 MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str,
 enum MAD_DEST dest, ib_portid_t * sm_id)
DEPRECATED;
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index 4306dbc..daa9319 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -58,7 +58,6 @@ IBMAD_1.3 {
mad_register_server;
mad_register_client_via;
mad_register_server_via;
-   ib_resolve_guid;
ib_resolve_portid_str;
ib_resolve_self;
ib_resolve_smlid;



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support

2009-04-21 Thread Sean Hefty

Allow libibnetdisc to build and run on Windows as part of the WinOF
distribution

Signed-off-by: Sean Hefty sean.he...@intel.com
---

 .../libibnetdisc/include/infiniband/ibnetdisc.h|   48 
---
 infiniband-diags/libibnetdisc/src/chassis.c|4 +-
 infiniband-diags/libibnetdisc/src/ibnetdisc.c  |   18 
 infiniband-diags/libibnetdisc/src/libibnetdisc.map |8 ---
 4 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
index a882994..370ae31 100644
--- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
+++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -37,6 +37,7 @@
 #include stdio.h
 #include infiniband/mad.h
 #include iba/ib_types.h
+#include infiniband/mad_osd.h
 
 struct ib_fabric; /* forward declare */
 struct chassis; /* forward declare */
@@ -140,11 +141,12 @@ typedef struct ib_fabric {
 /** =
  * Initialization (fabric operations)
  */
-void   ibnd_debug(int i);
-void   ibnd_show_progress(int i);
+MAD_EXPORT void ibnd_debug(int i);
+MAD_EXPORT void ibnd_show_progress(int i);
 
-ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port,
-   int timeout_ms, ib_portid_t *from, int hops);
+MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port,
+  int timeout_ms,
+  ib_portid_t *from, int hops);
/**
 * dev_name: (required) local device name to use to access the fabric
 * dev_port: (required) local device port to use to access the fabric
@@ -156,33 +158,35 @@ ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int 
dev_port,
 * hops: (optional) Specify how much of the fabric to traverse.
 *   negative value == scan entire fabric
 */
-void   ibnd_destroy_fabric(ibnd_fabric_t *fabric);
+MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t *fabric);
 
 /** =
  * Node operations
  */
-ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid);
-ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str);
-ibnd_node_t *ibnd_update_node(ibnd_node_t *node);
+MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t 
guid);
+MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str);
+MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_node_t *node);
 
 typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data);
-void ibnd_iter_nodes(ibnd_fabric_t *fabric,
-   ibnd_iter_node_func_t func,
-   void *user_data);
-void ibnd_iter_nodes_type(ibnd_fabric_t *fabric,
-   ibnd_iter_node_func_t func,
-   int node_type,
-   void *user_data);
+MAD_EXPORT void ibnd_iter_nodes(ibnd_fabric_t *fabric,
+   ibnd_iter_node_func_t func,
+   void *user_data);
+MAD_EXPORT void ibnd_iter_nodes_type(ibnd_fabric_t *fabric,
+ibnd_iter_node_func_t func,
+int node_type,
+void *user_data);
 
 /** =
  * Chassis queries
  */
-uint64_t  ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char 
chassisnum);
-char *ibnd_get_chassis_type(ibnd_node_t *node);
-char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size);
-
-int   ibnd_is_xsigo_guid(uint64_t guid);
-int   ibnd_is_xsigo_tca(uint64_t guid);
-int   ibnd_is_xsigo_hca(uint64_t guid);
+MAD_EXPORT uint64_t  ibnd_get_chassis_guid(ibnd_fabric_t *fabric,
+  unsigned char chassisnum);
+MAD_EXPORT char *ibnd_get_chassis_type(ibnd_node_t *node);
+MAD_EXPORT char *ibnd_get_chassis_slot_str(ibnd_node_t *node,
+  char *str, size_t size);
+
+MAD_EXPORT int   ibnd_is_xsigo_guid(uint64_t guid);
+MAD_EXPORT int   ibnd_is_xsigo_tca(uint64_t guid);
+MAD_EXPORT int   ibnd_is_xsigo_hca(uint64_t guid);
 
 #endif /* _IBNETDISC_H_ */
diff --git a/infiniband-diags/libibnetdisc/src/chassis.c 
b/infiniband-diags/libibnetdisc/src/chassis.c
index 6b4930e..dbb0abe 100644
--- a/infiniband-diags/libibnetdisc/src/chassis.c
+++ b/infiniband-diags/libibnetdisc/src/chassis.c
@@ -156,6 +156,8 @@ static int is_xsigo_switch(uint64_t guid)
 static uint64_t xsigo_chassisguid(ibnd_node_t *node)
 {
uint64_t sysimgguid

RE: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections

2009-04-20 Thread Sean Hefty

rdma_id is  a suffix that leaves room for more, or  in other works - I just
wanted to leave room for other
debug information in the future (e.g. number of count of total incoming
connection on device)

ok - makes sense

TP=TyPe (Device type)
PO=POrt (Port Number)
PS=PortSpace
ST=STate

I tried to shorten the output line as much as possible to make the output looks
as easy to
read table (on most screen the output will be one line per rdma_id)
The same thought made me print only the numeric value and not it's string
value.

I was able to figure these out by looking at the code, but if I look at the
output of netstat, the headings and values are easy to interpret without needing
to refer to source code.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections

2009-04-17 Thread Sean Hefty

If the path is:

/sys/kernel/debug/rdma_cm/mthca0_rdma_id

do we really need to append '_rdma_id' at the end?  (I'll defer to others if
debugfs is the right location or not.)

+  if (v == SEQ_START_TOKEN) {
+  seq_printf(file,
+ %-3s
+ %-8s
+ %-3s
+ %-5s
+ %-52s
+ %-52s
+ %-5s
+ %-3s
+ %-8s
+ \n,
+ TP, DEV, PO, NDEV, SRC, DST, PS, ST,
QPN);

{snip}

+ seq_printf(file,
+ %-3d
+ %-8s
+ %-3d
+ %-5s
+ %-52s
+ %-52s
+ %-5d
+ %-3d
+ %-8d
+ \n,
+ id_priv-id.route.addr.dev_addr.dev_type,
+ (id_priv-id.device) ? id_priv-id.device-name :
,
+ id_priv-id.port_num,
+ (id_priv-id.route.addr.dev_addr.src_dev) ?
id_priv-
id.route.addr.dev_addr.src_dev-name : ,
+ local_addr,
+ remote_addr,
+ id_priv-id.ps,
+ id_priv-state,
+ id_priv-qp_num);

nit:
I'm
not
a
big
fan
of
one
parameter
per
line.

:)

It's not readily apparent to me what several of the headings are (TP, PO, PS,
ST) or what the numeric values map to (for TP, PS, ST).

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] RDMA over infiniband, diffrences between rdam_cm and libmthca-rdmav2

2009-04-15 Thread Sean Hefty

I am new in infiniband, and I am doing some research on rdma.
I have found two diffrents way of sending data on infiniband protucts
using rdma.
The first one use rdam_cm module (from kernel source code), and second
one use  libmthca-rdmav2/libibverbs.

If someone can explain me the diffrences between this two types of programming.

The library to send data is libibverbs.  The rdma_cm (or librdmacm) is one
method that can be used to setup the QPs for communication.  I.e. exchange the
QP numbers, LIDs, etc.  You could also setup the QPs using the libibcm or just
exchange the data over a standard socket.  If you look at the librdmacm code,
you will see that it calls the libibverbs functions to allocate and modify the
QP.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library.

2009-04-03 Thread Sean Hefty

This new series uses the current master version ibmad to decode the data.  If
you accept the mad_*printf functions then I can convert later.  For now I want
to get this library in!  :-D

It would be helpful to check libibnetdisc into a branch in the management.git
tree.  I need some time to add libibnetdisc to windows.  (Where exactly is this
library?)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library.

2009-04-03 Thread Sean Hefty

The patch creates a subdirectory in infiniband-diags call libibnetdisc.  Is
that what you mean?  Unfortunately I don't have a public git tree I can point
you to here at the lab.  :-(

My mailer tossed patch 1/3 into my junk mail folder, so I missed the patch for
the actual library itself...

If it's possible, I'd like for Sasha to add these to a branch in his
management.git tree until I can setup the windows build and verify that
everything compiles.  I should only need a few days to do this.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: QoS setting and propagation

2009-04-02 Thread Sean Hefty

responding on general list:

do we set QoS parameters in SM only?

The SM must be configured with QoS.  You'll need to look in the opensm QoS
documentation to see how to setup QoS.  (I don't know those details.)

I looked in cma.c and ib_cm and iw_cm and do not see any parameter passing for
QoS.
Am I missing something?

IB specifies qos using the service ID and qos_class fields in the PR query.
This is done during 'route resolution'.  See cma_query_ib_route().

Can we set it in transport independent way?

See rdma_set_service_type().  This call is intended to be generic.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH] rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast

2009-03-30 Thread Sean Hefty

  When joining IPoIB multicast group, use the same rate as in the broadcast
group. Otherwise, if rdma_cm creates this group before IPoIB does, it might get
a different rate. This will cause IPoIB to fail joining to the same group later
on, because IPoIB has a strict rate selection.

Should the rdma_cm be creating IPoIB multicast groups?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH] rdma_cm: create cm id even when port is down

2009-03-27 Thread Sean Hefty

  When doing rdma_resolve_addr() and relevant port is down, the function fails
and rdma_cm id is not bound to the device. Therefore, application does not have
device handle and cannot wait for the port to become active. The function
fails because ipoib is not joined to the multicast group and therefore sa does
not have a multicast record to take a qkey from.
  The patch here is to make lazy qkey resolution - cma_set_qkey will set
id_priv-qkey if it was not set, and will be called just before the qkey is
really required.

Signed-off-by: Yossi Etigin yos...@voltaire.com
Acked-by: Sean Hefty sean.he...@intel.com
---

Roland, a thread that discussed this starts here:

http://lists.openfabrics.org/pipermail/general/2009-February/056895.html

The subject never contained '[PATCH]', so it was probably missed, but Yossi's
patch should be good for 2.6.30.

 drivers/infiniband/core/cma.c |   41 +++--
 1 file changed, 27 insertions(+), 14 deletions(-)

Index: b/drivers/infiniband/core/cma.c
===
--- a/drivers/infiniband/core/cma.c2009-03-10 18:21:47.0 +0200
+++ b/drivers/infiniband/core/cma.c2009-03-10 19:22:18.0 +0200
@@ -297,21 +297,25 @@ static void cma_detach_from_dev(struct r
   id_priv-cma_dev = NULL;
 }

-static int cma_set_qkey(struct ib_device *device, u8 port_num,
-  enum rdma_port_space ps,
-  struct rdma_dev_addr *dev_addr, u32 *qkey)
+static int cma_set_qkey(struct rdma_id_private *id_priv)
 {
   struct ib_sa_mcmember_rec rec;
   int ret = 0;

-  switch (ps) {
+  if (id_priv-qkey)
+  return;
+
+  switch (id_priv-id.ps) {
   case RDMA_PS_UDP:
-  *qkey = RDMA_UDP_QKEY;
+  id_priv-qkey = RDMA_UDP_QKEY;
   break;
   case RDMA_PS_IPOIB:
-  ib_addr_get_mgid(dev_addr, rec.mgid);
-  ret = ib_sa_get_mcmember_rec(device, port_num, rec.mgid, rec);
-  *qkey = be32_to_cpu(rec.qkey);
+  ib_addr_get_mgid(id_priv-id.route.addr.dev_addr, rec.mgid);
+  ret = ib_sa_get_mcmember_rec(id_priv-id.device,
+   id_priv-id.port_num, rec.mgid,
+   rec);
+  if (!ret)
+  id_priv-qkey = be32_to_cpu(rec.qkey);
   break;
   default:
   break;
@@ -341,12 +345,7 @@ static int cma_acquire_dev(struct rdma_i
   ret = ib_find_cached_gid(cma_dev-device, gid,
id_priv-id.port_num, NULL);
   if (!ret) {
-  ret = cma_set_qkey(cma_dev-device,
- id_priv-id.port_num,
- id_priv-id.ps, dev_addr,
- id_priv-qkey);
-  if (!ret)
-  cma_attach_to_dev(id_priv, cma_dev);
+  cma_attach_to_dev(id_priv, cma_dev);
   break;
   }
   }
@@ -578,6 +577,10 @@ static int cma_ib_init_qp_attr(struct rd
   *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT;

   if (cma_is_ud_ps(id_priv-id.ps)) {
+  ret = cma_set_qkey(id_priv);
+  if (ret)
+  return ret;
+
   qp_attr-qkey = id_priv-qkey;
   *qp_attr_mask |= IB_QP_QKEY;
   } else {
@@ -2201,6 +2204,12 @@ static int cma_sidr_rep_handler(struct i
   event.status = ib_event-param.sidr_rep_rcvd.status;
   break;
   }
+  ret = cma_set_qkey(id_priv);
+  if (ret) {
+  event.event = RDMA_CM_EVENT_ADDR_ERROR;
+  event.status = -EINVAL;
+  break;
+  }
   if (id_priv-qkey != rep-qkey) {
   event.event = RDMA_CM_EVENT_UNREACHABLE;
   event.status = -EINVAL;
@@ -2480,10 +2489,14 @@ static int cma_send_sidr_rep(struct rdma
const void *private_data, int private_data_len)
 {
   struct ib_cm_sidr_rep_param rep;
+  int ret;

   memset(rep, 0, sizeof rep);
   rep.status = status;
   if (status == IB_SIDR_SUCCESS) {
+  ret = cma_set_qkey(id_priv);
+  if (ret)
+  return ret;
   rep.qp_num = id_priv-qp_num;
   rep.qkey = id_priv-qkey;
   }

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [ofw] What is the current support level for QoS in WinOF?

2009-03-24 Thread Sean Hefty

I do have urgent time-sensitive traffic, and non-urgent traffic.
The urgent traffic and the non-urgent traffic is generated from different
hosts.
I would like to differentiate them by using different SL, then configure QoS to
give maximum priority to the urgent time-sensitive
traffic, and minimum priority to the non-urgent traffic.

Are you saying I can't do this in WinOF?

I don't think the WinOF opensm will support this, but I'm not certain.

And I can't do that even adding a Linux host that runs opensm (OFED version)?

I would expect that this is possible.

Traffic separation based on HCA port could be an option, but I need to think
more about that.
What can you do with that kind of QoS?

More simply, this would allow you to group hosts into different traffic priority
groups.

Do you mark this HCA port as high-priority, that HCA port as low-priority, etc?
What happens when a high-pri port sends traffic to a low-pri port? And vice-
versa?

There should be rules in the opensm QoS config file that will determine this.
I've copied the general list on this reply.  Sasha, Hal, or someone that deals
more directly with opensm will be able to direct you better.

What happens when a high-pri port sends traffic to a normal port (a port that
is not marked as high-priority nor low-priority)?
I'm using only RDMA Write with Imm in my system, although I'm interested in
what happens on all types of traffic.
If you know of a document that explains that, please let me know, I haven't
found it by now.

The OFED opensm includes documentation on setting up QoS.  It's in opensm/doc in
the management source tree.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH] add c99 definitions within the ib_mad_f structure

2009-03-18 Thread Sean Hefty

this knowingly breaks the windows build...

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH] add c99 definitions within the ib_mad_f structure

2009-03-18 Thread Sean Hefty

So what do you suggest?

Changing the WinOF build environment is something that could be brought up in
Sonoma, if there will be enough representatives there.  Alternatively, WinOF
schedules regular con-calls.

Ira replied that he has no problems with it.

I remember Ira stating that he couldn't build or test his patches on Windows.  I
have no problem with that.  I don't pull the ib-mgmt.git tree every day.  When I
do pull, if I hit into any build issues, I'll just correct them and submit a
patch.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: SPAM Re: [ofa-general] SPAM [PATCH] infiniband-diags/mcm_rereg_test.c: Add missing mad_rpc_close_port call

2009-03-12 Thread Sean Hefty

Someone made the decision to want to be able to switch back and forth
earlier. This should be directed to them. It's certainly easy to
eliminate the old code.

I'm wasn't suggesting that you fix the existing code, just not add to it.

If someone wants to be able to switch back and forth, it makes way more sense to
use an #if something_that_can_be_set_during_the_build, than #if 1, which
requires source code changes in multiple places.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: merging madeye into mainline

2009-03-12 Thread Sean Hefty

Yes, exposing snooping capabilities to user space and writing a
user space app that does snooping sounds reasonable - what would
it take to expose this capability to user-space - will it fine
smoothly into the ib_umad and libibumad design/structure?

libibumad needs a way for the user to indicate that they want to snoop mads, so
ib_umad calls ib_register_mad_snoop().  ib_umad would also need to store copies
of the mad data, rather than queuing the actual mad.  I wouldn't think it would
be that difficult to add, though RMPP may cause a small head-ache.  (I don't
remember if snooping occurs before or after RMPP packets are reassembled.  If
it's before, it'll be easier to copy the mad.)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: [PATCH] infiniband-diags: Fix memory leaks on IBERROR and IBPANIC

2009-03-12 Thread Sean Hefty

It's not a matter of relying on exit for open fds but rather the
allocated memory under the covers of mad_rpc_open_port so no longer
can one rely on just exit and this needs to be made explicit.

The OS should reclaim any allocated memory not freed by the app when it exits.
Is this your concern?

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: merging madeye into mainline

2009-03-11 Thread Sean Hefty

Have you ever considered to push the madeye module (below)
into the kernel to ease with fabric debugging? I have tested
it now against Linus tree and it works fine.

I hadn't really thought about it, but I don't have any objection to someone
submitting it.  There may be a better way of doing this if we want to include
this upstream - for example, expose snooping capabilities to userspace.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

1 2 3 4 5 6 7 8 >

1 - 100 of 793 matches

Mail list logo