On Sun, Oct 3, 2010 at 12:01 PM, Jack Morgenstein <[email protected]> wrote: > I have not yet submitted the patch to the list. Without the patch, we have a > race condition. > > Yevgeny, any ideas on scalability? What happens if opensm receives the trap > while it is performing a heavy sweep?
There's no requirement for a NodeDescription changed trap to cause a "heavy" sweep at the SM so the scalability of this should be fine. The mainstream OpenSM has implemented it this way for some months now. -- Hal > > -Jack > > -----Original Message----- > From: Or Gerlitz [mailto:[email protected]] > Sent: Sunday, October 03, 2010 5:31 PM > To: Jack Morgenstein; linux-rdma > Subject: re: mlx4: propagate node_description changes down to FW > > Hi Jack, > > I just came across this patch of yours which was placed in ofed 1.5.2, I > didn't see any track of it > here @ linux-rdma (any specific reason for that?) - some questions/issues to > discuss - > > 1st and most, (say) for 1k node cluster, is it correct that for each node > doing start/restart of the openibd > service a trap will be sent to opensm and the latter will heavy sweep?! this > doesn't sound very much scalable... > have you tested it over large clusters? what was the impact? > > Or. > > mlx4: propagate node_description changes down to FW. > > The Node Description cannot be changed via MADs (it is read-only). > Until now, it was changed in the driver, and the new Node Description > was simply overwritten by the driver on MAD responses. > > The node description was modified in the driver by openibd via sysfs. > However, that generated a race condition, where OpenSM could get the > FW node description rather than the overwritten description if OpenSM > queried the device before openibd had a chance to enter the new description. > > The solution is a new FW command (SET_NODE) which allows passing the > new node description to FW. When this command is invoked, FW issues > a 144 trap to OpenSM. Upon receiving this trap, OpenSM initiates a > heavy sweep, thus updating the node description properly -- and eliminating > the race. > > This patch works whether or not the new FW command is available. If SET_NODE > is not available, things work as before. > > Fixes FM82320 > > Signed-off-by: Jack Morgenstein <[email protected]> > > Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c > =================================================================== > --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2010-09-27 > 17:20:54.069787000 +0200 > +++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c 2010-09-27 > 17:21:15.074810000 +0200 > @@ -421,14 +421,34 @@ out: > static int mlx4_ib_modify_device(struct ib_device *ibdev, int mask, > struct ib_device_modify *props) > { > + struct mlx4_cmd_mailbox *mailbox; > + int err; > + > if (mask & ~IB_DEVICE_MODIFY_NODE_DESC) > return -EOPNOTSUPP; > > - if (mask & IB_DEVICE_MODIFY_NODE_DESC) { > - spin_lock(&to_mdev(ibdev)->sm_lock); > - memcpy(ibdev->node_desc, props->node_desc, 64); > - spin_unlock(&to_mdev(ibdev)->sm_lock); > - } > + if (!(mask & IB_DEVICE_MODIFY_NODE_DESC)) > + return 0; > + > + spin_lock(&to_mdev(ibdev)->sm_lock); > + memcpy(ibdev->node_desc, props->node_desc, 64); > + spin_unlock(&to_mdev(ibdev)->sm_lock); > + > + /* if possible, pass node desc to FW, so it can generate > + * a 144 trap. If cmd fails, just ignore. > + */ > + mailbox = mlx4_alloc_cmd_mailbox(to_mdev(ibdev)->dev); > + if (IS_ERR(mailbox)) > + return 0; > + > + memset(mailbox->buf, 0, 256); > + memcpy(mailbox->buf, props->node_desc, 64); > + err = mlx4_cmd(to_mdev(ibdev)->dev, mailbox->dma, 1, 0, > + MLX4_CMD_SET_NODE, MLX4_CMD_TIME_CLASS_A); > + if (err) > + mlx4_ib_dbg("SET_NODE command failed (%d)", err); > + > + mlx4_free_cmd_mailbox(to_mdev(ibdev)->dev, mailbox); > > return 0; > } > Index: ofed_kernel/include/linux/mlx4/cmd.h > =================================================================== > --- ofed_kernel.orig/include/linux/mlx4/cmd.h 2010-09-27 17:20:40.519054000 > +0200 > +++ ofed_kernel/include/linux/mlx4/cmd.h 2010-09-27 17:21:15.081799000 > +0200 > @@ -58,6 +58,7 @@ enum { > MLX4_CMD_SENSE_PORT = 0x4d, > MLX4_CMD_HW_HEALTH_CHECK = 0x50, > MLX4_CMD_SET_PORT = 0xc, > + MLX4_CMD_SET_NODE = 0x5a, > MLX4_CMD_ACCESS_DDR = 0x2e, > MLX4_CMD_MAP_ICM = 0xffa, > MLX4_CMD_UNMAP_ICM = 0xff9, > Index: ofed_kernel/drivers/net/mlx4/cmd.c > =================================================================== > --- ofed_kernel.orig/drivers/net/mlx4/cmd.c 2010-09-27 17:20:32.995814000 > +0200 > +++ ofed_kernel/drivers/net/mlx4/cmd.c 2010-09-27 17:21:15.088792000 +0200 > @@ -242,8 +242,11 @@ static int mlx4_cmd_poll(struct mlx4_dev > __raw_readl(hcr + > HCR_OUT_PARAM_OFFSET + 4)); > stat = be32_to_cpu((__force __be32) __raw_readl(hcr + > HCR_STATUS_OFFSET)) >> 24; > err = mlx4_status_to_errno(stat); > - if (err) > - mlx4_err(dev, "command 0x%x failed: fw status = 0x%x\n", op, > stat); > + if (err) { > + if (op != MLX4_CMD_SET_NODE || stat != CMD_STAT_BAD_OP) > + mlx4_err(dev, "command 0x%x failed: fw status = > 0x%x\n", > + op, stat); > + } > > out: > up(&priv->cmd.poll_sem); > @@ -296,8 +299,9 @@ static int mlx4_cmd_wait(struct mlx4_dev > > err = context->result; > if (err) { > - mlx4_err(dev, "command 0x%x failed: fw status = 0x%x\n", > - op, context->fw_status); > + if (op != MLX4_CMD_SET_NODE || context->fw_status != > CMD_STAT_BAD_OP) > + mlx4_err(dev, "command 0x%x failed: fw status = > 0x%x\n", > + op, context->fw_status); > goto out; > } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
