Re: [PATCH v4 17/25] ibnbd: client: main functionality
On 27.09.19 11:32, Danil Kipnis wrote: On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev wrote: No, it seems this thingy is a bit different. According to my understanding patches 3 and 4 from this patchset do the following: 1# split equally the whole queue depth on number of hardware queues and 2# return tag number which is unique host-wide (more or less similar to unique_tag, right?). 2# is not needed for ibtrs, and 1# can be easy done by dividing queue_depth on number of hw queues on tag set allocation, e.g. something like the following: ... tags->nr_hw_queues = num_online_cpus(); tags->queue_depth = sess->queue_deph / tags->nr_hw_queues; blk_mq_alloc_tag_set(tags); And this trick won't work out for the performance. ibtrs client has a single resource: set of buffer chunks received from a server side. And these buffers should be dynamically distributed between IO producers according to the load. Having a hard split of the whole queue depth between hw queues we can forget about a dynamic load distribution, here is an example: - say server shares 1024 buffer chunks for a session (do not remember what is the actual number). - 1024 buffers are equally divided between hw queues, let's say 64 (number of cpus), so each queue is 16 requests depth. - only several CPUs produce IO, and instead of occupying the whole "bandwidth" of a session, i.e. 1024 buffer chunks, we limit ourselves to a small queue depth of an each hw queue. And performance drops significantly when number of IO producers is smaller than number of hw queues (CPUs), and it can be easily tested and proved. So for this particular ibtrs case tags should be globally shared, and seems (unfortunately) there is no any other similar requirements for other block devices. I don't see any difference between what you describe here and 100 dm volumes sitting on top of a single NVME device. Hallo Christoph, am I wrong? Thank you, Danil.
Re: [PATCH v4 17/25] ibnbd: client: main functionality
On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev wrote: > > No, it seems this thingy is a bit different. According to my > understanding patches 3 and 4 from this patchset do the > following: 1# split equally the whole queue depth on number > of hardware queues and 2# return tag number which is unique > host-wide (more or less similar to unique_tag, right?). > > 2# is not needed for ibtrs, and 1# can be easy done by dividing > queue_depth on number of hw queues on tag set allocation, e.g. > something like the following: > > ... > tags->nr_hw_queues = num_online_cpus(); > tags->queue_depth = sess->queue_deph / tags->nr_hw_queues; > > blk_mq_alloc_tag_set(tags); > > > And this trick won't work out for the performance. ibtrs client > has a single resource: set of buffer chunks received from a > server side. And these buffers should be dynamically distributed > between IO producers according to the load. Having a hard split > of the whole queue depth between hw queues we can forget about a > dynamic load distribution, here is an example: > >- say server shares 1024 buffer chunks for a session (do not > remember what is the actual number). > >- 1024 buffers are equally divided between hw queues, let's > say 64 (number of cpus), so each queue is 16 requests depth. > >- only several CPUs produce IO, and instead of occupying the > whole "bandwidth" of a session, i.e. 1024 buffer chunks, > we limit ourselves to a small queue depth of an each hw > queue. > > And performance drops significantly when number of IO producers > is smaller than number of hw queues (CPUs), and it can be easily > tested and proved. > > So for this particular ibtrs case tags should be globally shared, > and seems (unfortunately) there is no any other similar requirements > for other block devices. I don't see any difference between what you describe here and 100 dm volumes sitting on top of a single NVME device.
Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
> > Bart, would it in your opinion be OK to drop the file_io support in > > IBNBD entirely? We implemented this feature in the beginning of the > > project to see whether it could be beneficial in some use cases, but > > never actually found any. > > I think that's reasonable since the loop driver can be used to convert a > file into a block device. Jack, shall we drop it?
Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
On Thu, Sep 26, 2019 at 5:11 PM Bart Van Assche wrote: > >>> +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags, > >>> + enum ibnbd_io_mode mode, struct bio_set > >>> *bs, > >>> + ibnbd_dev_io_fn io_cb) > >>> +{ > >>> + struct ibnbd_dev *dev; > >>> + int ret; > >>> + > >>> + dev = kzalloc(sizeof(*dev), GFP_KERNEL); > >>> + if (!dev) > >>> + return ERR_PTR(-ENOMEM); > >>> + > >>> + if (mode == IBNBD_BLOCKIO) { > >>> + dev->blk_open_flags = flags; > >>> + ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags); > >>> + if (ret) > >>> + goto err; > >>> + } else if (mode == IBNBD_FILEIO) { > >>> + dev->blk_open_flags = FMODE_READ; > >>> + ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags); > >>> + if (ret) > >>> + goto err; > >>> + > >>> + ret = ibnbd_dev_vfs_open(dev, path, flags); > >>> + if (ret) > >>> + goto blk_put; > >> > >> This looks really weird. Why to call ibnbd_dev_blk_open() first for file > >> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode? Bart, would it in your opinion be OK to drop the file_io support in IBNBD entirely? We implemented this feature in the beginning of the project to see whether it could be beneficial in some use cases, but never actually found any.
Re: [PATCH v4 06/25] ibtrs: client: main functionality
On Thu, Sep 26, 2019 at 1:21 AM Bart Van Assche wrote: > > On 9/25/19 3:53 PM, Danil Kipnis wrote: > > Oh, you mean we just need stub functions for those, so that nobody > > steps on a null? > > What I meant is that the memory that is backing a device must not be > freed until the reference count of a device has dropped to zero. If a > struct device is embedded in a larger structure that means signaling a > completion from inside the release function (ibtrs_clt_dev_release()) > and not freeing the struct device memory (kfree(clt) in free_clt()) > before that completion has been triggered. Got it, thank you. Will move free_clt into the release function.
Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions
On Tue, Sep 17, 2019 at 6:36 PM Jinpu Wang wrote: > > On Sat, Sep 14, 2019 at 12:25 AM Bart Van Assche wrote: > > > > On 6/20/19 8:03 AM, Jack Wang wrote: > > > + charpathname[NAME_MAX]; > > [ ... ] > > > +charblk_symlink_name[NAME_MAX]; > > > > Please allocate path names dynamically instead of hard-coding the upper > > length for a path. Those strings are used to name directories and files under sysfs, which I think makes NAME_MAX a natural limitation for them. Client and server only exchange those strings on connection establishment, not in the IO path. We do not really need to safe 256K on a server with 1000 devices mapped in parallel. A patch to allocate those strings makes the code longer, introduces new error paths and in my opinion doesn't bring any benefits.
Re: [PATCH v4 06/25] ibtrs: client: main functionality
On Wed, Sep 25, 2019 at 11:08 PM Bart Van Assche wrote: > > On 9/25/19 1:50 PM, Danil Kipnis wrote: > > On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche wrote: > >> There is plenty of code under drivers/base that calls get_device() and > >> put_device(). Are you sure that none of the code under drivers/base will > >> ever call get_device() and put_device() for the ibtrs client device? > > > > You mean how could multiple kernel modules share the same ibtrs > > session...? I really never thought that far... > > I meant something else: device_register() registers struct device > instances in multiple lists. The driver core may decide to iterate over > these lists and to call get_device() / put_device() on the devices it > finds in these lists. Oh, you mean we just need stub functions for those, so that nobody steps on a null?
Re: [PATCH v4 17/25] ibnbd: client: main functionality
On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche wrote: > Combining multiple queues (a) into a single queue (b) that is smaller > than the combined source queues without sacrificing performance is > tricky. We already have one such implementation in the block layer core > and it took considerable time to get that implementation right. See e.g. > blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart(). Roma, can you please estimate the performance impact in case we switch to it?
Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
On Tue, Sep 24, 2019 at 12:50 AM Bart Van Assche wrote: > > On 6/20/19 8:03 AM, Jack Wang wrote: > > +#define P1 ) > > +#define P2 )) > > +#define P3 ))) > > +#define P4 > > +#define P(N) P ## N > > + > > +#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__) > > +#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__ > > + > > +#define LIST(...)\ > > + __VA_ARGS__,\ > > + ({ unknown_type(); NULL; }) \ > > + CAT(P, COUNT_ARGS(__VA_ARGS__)) \ > > + > > +#define EMPTY() > > +#define DEFER(id) id EMPTY() > > + > > +#define _CASE(obj, type, member) \ > > + __builtin_choose_expr( \ > > + __builtin_types_compatible_p( \ > > + typeof(obj), type), \ > > + ((type)obj)->member > > +#define CASE(o, t, m) DEFER(_CASE)(o, t, m) > > + > > +/* > > + * Below we define retrieving of sessname from common IBTRS types. > > + * Client or server related types have to be defined by special > > + * TYPES_TO_SESSNAME macro. > > + */ > > + > > +void unknown_type(void); > > + > > +#ifndef TYPES_TO_SESSNAME > > +#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; }) > > +#endif > > + > > +#define ibtrs_prefix(obj)\ > > + _CASE(obj, struct ibtrs_con *, sess->sessname),\ > > + _CASE(obj, struct ibtrs_sess *, sessname), \ > > + TYPES_TO_SESSNAME(obj) \ > > + )) > > No preprocessor voodoo please. Please remove all of the above and modify > the logging statements such that these pass the proper name string as > first argument to logging macros. Hi Bart, do you think it would make sense we first submit a new patchset for IBTRS (with the changes you suggested plus closed security problem) and later submit a separate one for IBNBD only? Thank you, Danil
Re: [PATCH v4 06/25] ibtrs: client: main functionality
On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche wrote: > > On 9/25/19 10:36 AM, Danil Kipnis wrote: > > On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche wrote: > >> On 6/20/19 8:03 AM, Jack Wang wrote: > >>> +static void ibtrs_clt_dev_release(struct device *dev) > >>> +{ > >>> + /* Nobody plays with device references, so nop */ > >>> +} > >> > >> That comment sounds wrong. Have you reviewed all of the device driver > >> core code and checked that there is no code in there that manipulates > >> struct device refcounts? I think the code that frees struct ibtrs_clt > >> should be moved from free_clt() into the above function. > > > > We only use the device to create an entry under /sys/class. free_clt() > > is destroying sysfs first and unregisters the device afterwards. I > > don't really see the need to free from the callback instead... Will > > make it clear in the comment. > > There is plenty of code under drivers/base that calls get_device() and > put_device(). Are you sure that none of the code under drivers/base will > ever call get_device() and put_device() for the ibtrs client device? You mean how could multiple kernel modules share the same ibtrs session...? I really never thought that far... > Thanks, > > Bart. >
Re: [PATCH v4 06/25] ibtrs: client: main functionality
Hallo Bart, On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche wrote: > > On 6/20/19 8:03 AM, Jack Wang wrote: > > +static const struct ibtrs_ib_dev_pool_ops dev_pool_ops; > > +static struct ibtrs_ib_dev_pool dev_pool = { > > + .ops = &dev_pool_ops > > +}; > > Can the definitions in this file be reordered such that the forward > declaration of dev_pool_ops can be removed? Will try to. > > +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con); > > +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id, > > + struct rdma_cm_event *ev); > > +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc); > > +static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno, > > + bool notify, bool can_wait); > > +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req); > > +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req); > > Please also remove these forward declarations. OK > > +bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess) > > +{ > > + return sess->state == IBTRS_CLT_CONNECTED; > > +} > > Is it really useful to introduce a one line function for testing the > session state? No, not in that case really, thanks. > > +static inline struct ibtrs_tag * > > +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) > > +{ > > + size_t max_depth = clt->queue_depth; > > + struct ibtrs_tag *tag; > > + int cpu, bit; > > + > > + cpu = get_cpu(); > > + do { > > + bit = find_first_zero_bit(clt->tags_map, max_depth); > > + if (unlikely(bit >= max_depth)) { > > + put_cpu(); > > + return NULL; > > + } > > + > > + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); > > + put_cpu(); > > + > > + tag = GET_TAG(clt, bit); > > + WARN_ON(tag->mem_id != bit); > > + tag->cpu_id = cpu; > > + tag->con_type = con_type; > > + > > + return tag; > > +} > > What is the role of the get_cpu() and put_cpu() calls in this function? > How can it make sense to assign the cpu number to tag->cpu_id after > put_cpu() has been called? We disable preemption while looking for a free "ibtrs_tag" (permit) in our tags_map. We store the cpu number the ibtrs_clt_get_tag() function has been originally called on in the ibtrs_tag we just found, so that when the user later would use this ibtrs_tag for an rdma operation (ibtrs_clt_request()), we would select the rdma connection with cq_vector corresponding to this cpu. If IRQ affinity is configured accordingly, this enables for an IO response to be processed on the same cpu the IO request was originally submitted on. > > +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req, > > + struct ibtrs_clt_sess *sess, > > + ibtrs_conf_fn *conf, > > + struct ibtrs_tag *tag, void *priv, > > + const struct kvec *vec, size_t usr_len, > > + struct scatterlist *sg, size_t sg_cnt, > > + size_t data_len, int dir) > > +{ > > + struct iov_iter iter; > > + size_t len; > > + > > + req->tag = tag; > > + req->in_use = true; > > + req->usr_len = usr_len; > > + req->data_len = data_len; > > + req->sglist = sg; > > + req->sg_cnt = sg_cnt; > > + req->priv = priv; > > + req->dir = dir; > > + req->con = ibtrs_tag_to_clt_con(sess, tag); > > + req->conf = conf; > > + req->need_inv = false; > > + req->need_inv_comp = false; > > + req->inv_errno = 0; > > + > > + iov_iter_kvec(&iter, READ, vec, 1, usr_len); > > + len = _copy_from_iter(req->iu->buf, usr_len, &iter); > > + WARN_ON(len != usr_len); > > + > > + reinit_completion(&req->inv_comp); > > + if (sess->stats.enable_rdma_lat) > > + req->start_jiffies = jiffies; > > +} > > A comment that explains what "req" stands for would be welcome. Since > this function copies the entire payload, I assume that it is only used > for control messages and not for reading or writing data from a block > device? Yes, we only copy control message provided by the user. Will extend the description. > > +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt, > > + struct ibtrs_clt_io_req *fail_req) > > +{ > > + struct ibtrs_clt_sess *alive_sess; > > + struct ibtrs_clt_io_req *req; > > + int err = -ECONNABORTED; > > + struct path_it it; > > + > > + do_each_path(alive_sess, clt, &it) { > > + if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED)) > > + continue; > > + req = ibtrs_clt_get_copy_req(alive_sess, fail_req); > > + if (req->dir == DMA_TO_DEVICE) > > + err = ibtrs_clt_write_req(req); > > +
Re: [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections
On Mon, Sep 23, 2019 at 7:44 PM Bart Van Assche wrote: > > +/** > > + * enum ibtrs_clt_link_ev - Events about connectivity state of a client > > + * @IBTRS_CLT_LINK_EV_RECONNECTEDClient was reconnected. > > + * @IBTRS_CLT_LINK_EV_DISCONNECTED Client was disconnected. > > + */ > > +enum ibtrs_clt_link_ev { > > + IBTRS_CLT_LINK_EV_RECONNECTED, > > + IBTRS_CLT_LINK_EV_DISCONNECTED, > > +}; > > + > > +/** > > + * Source and destination address of a path to be established > > + */ > > +struct ibtrs_addr { > > + struct sockaddr_storage *src; > > + struct sockaddr_storage *dst; > > +}; > > Is it really useful to define a structure to hold two pointers or can > these two pointers also be passed as separate arguments? We always need both src and dst throughout ibnbd and ibtrs code and indeed one reason to introduce this struct is that "f(struct ibtrs_addr *addr, ...);" is shorter than "f(struct sockaddr_storage *src, struct sockaddr_storage *dst, ...);". But it also makes it easier to extend the address information describing one ibtrs path in the future. > > +/** > > + * ibtrs_clt_open() - Open a session to a IBTRS client > > + * @priv:User supplied private data. > > + * @link_ev: Event notification for connection state changes > > + * @priv: user supplied data that was passed to > > + * ibtrs_clt_open() > > + * @ev:Occurred event > > + * @sessname: name of the session > > + * @paths: Paths to be established defined by their src and dst addresses > > + * @path_cnt: Number of elemnts in the @paths array > > + * @port: port to be used by the IBTRS session > > + * @pdu_sz: Size of extra payload which can be accessed after tag > > allocation. > > + * @max_inflight_msg: Max. number of parallel inflight messages for the > > session > > + * @max_segments: Max. number of segments per IO request > > + * @reconnect_delay_sec: time between reconnect tries > > + * @max_reconnect_attempts: Number of times to reconnect on error before > > giving > > + * up, 0 for * disabled, -1 for forever > > + * > > + * Starts session establishment with the ibtrs_server. The function can > > block > > + * up to ~2000ms until it returns. > > + * > > + * Return a valid pointer on success otherwise PTR_ERR. > > + */ > > +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev, > > + const char *sessname, > > + const struct ibtrs_addr *paths, > > + size_t path_cnt, short port, > > + size_t pdu_sz, u8 reconnect_delay_sec, > > + u16 max_segments, > > + s16 max_reconnect_attempts); > > Having detailed kernel-doc headers for describing API functions is great > but I'm not sure a .h file is the best location for such documentation. > Many kernel developers keep kernel-doc headers in .c files because that > makes it more likely that the documentation and the implementation stay > in sync. What is better: to move it or to only copy it to the corresponding C file? > > > + > > +/** > > + * ibtrs_clt_close() - Close a session > > + * @sess: Session handler, is freed on return > ^^^ > handle? > > This sentence suggests that the handle is freed on return. I guess that > you meant that the session is freed upon return? Right, will fix the wording. > > > +/** > > + * ibtrs_clt_get_tag() - allocates tag for future RDMA operation > > + * @sess:Current session > > + * @con_type:Type of connection to use with the tag > > + * @wait:Wait type > > + * > > + * Description: > > + *Allocates tag for the following RDMA operation. Tag is used > > + *to preallocate all resources and to propagate memory pressure > > + *up earlier. > > + * > > + * Context: > > + *Can sleep if @wait == IBTRS_TAG_WAIT > > + */ > > +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess, > > + enum ibtrs_clt_con_type con_type, > > + int wait); > > Since struct ibtrs_tag has another role than what is called a tag in the > block layer I think a better description is needed of what struct > ibtrs_tag actually represents. I think it would be better to rename it to ibtrs_permit in order to avoid confusion with block layer tags. Will extend the description also. > > +/* > > + * Here goes IBTRS server API > > + */ > > Most software either uses the client API or the server API but not both > at the same time. Has it been considered to use separate header files > for the client and server APIs? I don't have any really good reason to put API of server and client into a single file. Except may be that the reader can see API calls corresponding the full sequence of request -> indication -> response -> confirmation in one place.
Re: [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()
On Mon, Sep 23, 2019 at 7:21 PM Bart Van Assche wrote: > > On 6/20/19 8:03 AM, Jack Wang wrote: > > Function is going to be used in transport over RDMA module > > in subsequent patches. > > It seems like several words are missing from this patch description. Will extend with corresponding description of the function from fs/sysfs/file.c and explanation why we need it.
Re: [PATCH v4 20/25] ibnbd: server: main functionality
On Fri, Sep 20, 2019 at 5:42 PM Bart Van Assche wrote: > > On 9/20/19 12:36 AM, Danil Kipnis wrote: > > On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche wrote: > >> On 6/20/19 8:03 AM, Jack Wang wrote: > >>> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs, > >>> + struct ibnbd_srv_session *srv_sess, > >>> + const void *msg, size_t len, > >>> + void *data, size_t datalen) > >>> +{ > >>> + const struct ibnbd_msg_sess_info *sess_info_msg = msg; > >>> + struct ibnbd_msg_sess_info_rsp *rsp = data; > >>> + > >>> + srv_sess->ver = min_t(u8, sess_info_msg->ver, > >>> IBNBD_PROTO_VER_MAJOR); > >>> + pr_debug("Session %s using protocol version %d (client version: %d," > >>> + " server version: %d)\n", srv_sess->sessname, > >>> + srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR); > >> > >> Has this patch been verified with checkpatch? I think checkpatch > >> recommends not to split literal strings. > > > > Yes it does complain about our splitted strings. But it's either > > splitted string or line over 80 chars or "Avoid line continuations in > > quoted strings" if we use backslash on previous line. I don't know how > > to avoid all three of them. > > Checkpatch shouldn't complain about constant strings that exceed 80 > columns. If it complains about such strings then that's a checkpatch bug. It doesn't in deed... Will concat those splitted quoted string, thank you.
Re: [PATCH v4 17/25] ibnbd: client: main functionality
On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche wrote: > > On 9/18/19 12:14 AM, Danil Kipnis wrote: > > I'm not familiar with dm code, but don't they need to deal with the > > same situation: if I configure 100 logical volumes on top of a single > > NVME drive with X hardware queues, each queue_depth deep, then each dm > > block device would need to advertise X hardware queues in order to > > achieve highest performance in case only this one volume is accessed, > > while in fact those X physical queues have to be shared among all 100 > > logical volumes, if they are accessed in parallel? > > Combining multiple queues (a) into a single queue (b) that is smaller > than the combined source queues without sacrificing performance is > tricky. We already have one such implementation in the block layer core > and it took considerable time to get that implementation right. See e.g. > blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart(). We will need some time, to check if we can reuse those... > dm drivers are expected to return DM_MAPIO_REQUEUE or > DM_MAPIO_DELAY_REQUEUE if the queue (b) is full. It turned out to be > difficult to get this right in the dm-mpath driver and at the same time > to achieve good performance. We also first tried to just return error codes in case we can't process an incoming request, but this was causing huge performance degradation when number of devices mapped over the same session is growing. Since we introduced those per cpu per devices lists of stopped queues, we do scale very well. > > The ibnbd driver introduces a third implementation of code that combines > multiple (per-cpu) queues into one queue per CPU. It is considered > important in the Linux kernel to avoid code duplication. Hence my > question whether ibnbd can reuse the block layer infrastructure for > sharing tag sets. Yes, will have to reiterate on this.
Re: [PATCH v4 20/25] ibnbd: server: main functionality
On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche wrote: > > On 6/20/19 8:03 AM, Jack Wang wrote: > > +#undef pr_fmt > > +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt > > Same comment here as for a previous patch - please do not include line > number information in pr_fmt(). Will drop it, thanks. > > +MODULE_AUTHOR("ib...@profitbricks.com"); > > +MODULE_VERSION(IBNBD_VER_STRING); > > +MODULE_DESCRIPTION("InfiniBand Network Block Device Server"); > > +MODULE_LICENSE("GPL"); > > Please remove the version number (MODULE_VERSION()). OK. > > +static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH; > > Please change dev_search_path[] into a dynamically allocated string to > avoid a hard-coded length limit. OK. > > + if (dup[strlen(dup) - 1] == '\n') > > + dup[strlen(dup) - 1] = '\0'; > > Can this be changed into a call to strim()? A directory name can start and end with spaces, for example this works: mkdir " x " > > +static void ibnbd_endio(void *priv, int error) > > +{ > > + struct ibnbd_io_private *ibnbd_priv = priv; > > + struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev; > > + > > + ibnbd_put_sess_dev(sess_dev); > > + > > + ibtrs_srv_resp_rdma(ibnbd_priv->id, error); > > + > > + kfree(priv); > > +} > > Since ibtrs_srv_resp_rdma() starts an RDMA WRITE without waiting for the > write completion, shouldn't the session reference be retained until the > completion for that RDMA WRITE has been received? In other words, is > there a risk with the current approach that the buffer that is being > transferred to the client will be freed before the RDMA WRITE has finished? ibtrs-srv.c is keeping track of inflights. When closing session it first marks the queue as closing, so that no new write requests would be posted, when IBNBD calls ibtrs_srv_resp_rdma(): 1831 if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING, 1832&old_state) Then ibtrs-srv schedules the ibtrs_srv_close_work, that drains the queue and then waits for all inflights to return from IBNBD: ... 1274 ib_drain_qp(con->c.qp); 1275 } 1276 /* Wait for all inflights */ 1277 ibtrs_srv_wait_ops_ids(sess); Only then the resources can be deallocated: 1282 unmap_cont_bufs(sess); 1283 ibtrs_srv_free_ops_ids(sess); > > > +static struct ibnbd_srv_sess_dev * > > +ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess) > > +{ > > + struct ibnbd_srv_sess_dev *sess_dev; > > + int ret = 0; > > + > > + read_lock(&srv_sess->index_lock); > > + sess_dev = idr_find(&srv_sess->index_idr, dev_id); > > + if (likely(sess_dev)) > > + ret = kref_get_unless_zero(&sess_dev->kref); > > + read_unlock(&srv_sess->index_lock); > > + > > + if (unlikely(!sess_dev || !ret)) > > + return ERR_PTR(-ENXIO); > > + > > + return sess_dev; > > +} > > Something that is not important: isn't the sess_dev check superfluous in > the if-statement just above the return statement? If ret == 1, does that > imply that sess_dev != 0 ? We want to have found the device (sess_dev != NULL) and we want to have been able to take reference to it (ret != 0)... You are right, if ret != 0 then sess_dev can't be NULL. > Has it been considered to return -ENODEV instead of -ENXIO if no device > is found? The backend block device, i.e. /dev/nullb0, is still there and might even be still exported over other session(s). So we thought "No such device or address" is more appropriate. > > > +static int create_sess(struct ibtrs_srv *ibtrs) > > +{ > > [ ... ] > > + strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname)); > > Please change the session name into a dynamically allocated string such > that strdup() can be used instead of strlcpy(). OK. > > > +static int process_msg_open(struct ibtrs_srv *ibtrs, > > + struct ibnbd_srv_session *srv_sess, > > + const void *msg, size_t len, > > + void *data, size_t datalen); > > + > > +static int process_msg_sess_info(struct ibtrs_srv *ibtrs, > > + struct ibnbd_srv_session *srv_sess, > > + const void *msg, size_t len, > > + void *data, size_t datalen); > > Can the code be reordered such that these forward declarations can be > dropped? Will try to. > > > +static struct ibnbd_srv_sess_dev * > > +ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess, > > + const struct ibnbd_msg_open *open_msg, > > + struct ibnbd_dev *ibnbd_dev, fmode_t open_flags, > > + struct ibnbd_srv_dev *srv_dev) > > +{ > > + struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess); > > + > > + if (IS_ERR(sdev)) > > + return sdev; > > + > > +
Re: [PATCH v4 17/25] ibnbd: client: main functionality
> > > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche > > > wrote: > > >> A more general question is why ibnbd needs its own queue management > > >> while no other block driver needs this? > > > > > > Each IBNBD device promises to have a queue_depth (of say 512) on each > > > of its num_cpus hardware queues. In fact we can only process a > > > queue_depth inflights at once on the whole ibtrs session connecting a > > > given client with a given server. Those 512 inflights (corresponding > > > to the number of buffers reserved by the server for this particular > > > client) have to be shared among all the devices mapped on this > > > session. This leads to the situation, that we receive more requests > > > than we can process at the moment. So we need to stop queues and start > > > them again later in some fair fashion. > > > > Can a single CPU really sustain a queue depth of 512 commands? Is it > > really necessary to have one hardware queue per CPU or is e.g. four > > queues per NUMA node sufficient? Has it been considered to send the > > number of hardware queues that the initiator wants to use and also the > > command depth per queue during login to the target side? That would > > allow the target side to allocate an independent set of buffers for each > > initiator hardware queue and would allow to remove the queue management > > at the initiator side. This might even yield better performance. > We needed a way which would allow us to address one particular > requirement: we'd like to be able to "enforce" that a response to an > IO would be processed on the same CPU the IO was originally submitted > on. In order to be able to do so we establish one rdma connection per > cpu, each having a separate cq_vector. The administrator can then > assign the corresponding IRQs to distinct CPUs. The server always > replies to an IO on the same connection he received the request on. If > the administrator did configure the /proc/irq/y/smp_affinity > accordingly, the response sent by the server will generate interrupt > on the same cpu, the IO was originally submitted on. The administrator > can configure IRQs differently, for example assign a given irq > (<->cq_vector) to a range of cpus belonging to a numa node, or > whatever assignment is best for his use-case. > Our transport module IBTRS establishes number of cpus connections > between a client and a server. The user of the transport module (i.e. > IBNBD) has no knowledge about the rdma connections, it only has a > pointer to an abstract "session", which connects him somehow to a > remote host. IBNBD as a user of IBTRS creates block devices and uses a > given "session" to send IOs from all the block devices it created for > that session. That means IBNBD is limited in maximum number of his > inflights toward a given remote host by the capability of the > corresponding "session". So it needs to share the resources provided > by the session (in our current model those resources are in fact some > pre registered buffers on server side) among his devices. > It is possible to extend the IBTRS API so that the user (IBNBD) could > specify how many connections he wants to have on the session to be > established. It is also possible to extend the ibtrs_clt_get_tag API > (this is to get a send "permit") with a parameter specifying the > connection, the future IO is to be send on. > We now might have to change our communication model in IBTRS a bit in > order to fix the potential security problem raised during the recent > RDMA MC: https://etherpad.net/p/LPC2019_RDMA. > I'm not familiar with dm code, but don't they need to deal with the same situation: if I configure 100 logical volumes on top of a single NVME drive with X hardware queues, each queue_depth deep, then each dm block device would need to advertise X hardware queues in order to achieve highest performance in case only this one volume is accessed, while in fact those X physical queues have to be shared among all 100 logical volumes, if they are accessed in parallel?
Re: [PATCH v4 17/25] ibnbd: client: main functionality
On Mon, Sep 16, 2019 at 6:46 PM Bart Van Assche wrote: > > On 9/16/19 7:17 AM, Danil Kipnis wrote: > > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche wrote: > >> On 6/20/19 8:03 AM, Jack Wang wrote: > >>> +/* > >>> + * This is for closing devices when unloading the module: > >>> + * we might be closing a lot (>256) of devices in parallel > >>> + * and it is better not to use the system_wq. > >>> + */ > >>> +static struct workqueue_struct *unload_wq; > >> > >> I think that a better motivation is needed for the introduction of a new > >> workqueue. > > > > We didn't want to pollute the system workqueue when unmapping a big > > number of devices at once in parallel. Will reiterate on it. > > There are multiple system workqueues. From : > > extern struct workqueue_struct *system_wq; > extern struct workqueue_struct *system_highpri_wq; > extern struct workqueue_struct *system_long_wq; > extern struct workqueue_struct *system_unbound_wq; > extern struct workqueue_struct *system_freezable_wq; > extern struct workqueue_struct *system_power_efficient_wq; > extern struct workqueue_struct *system_freezable_power_efficient_wq; > > Has it been considered to use e.g. system_long_wq? Will try to switch to system_long_wq, I do agree that a new wq for just closing devices does make an impression of an overreaction. > > >> A more general question is why ibnbd needs its own queue management > >> while no other block driver needs this? > > > > Each IBNBD device promises to have a queue_depth (of say 512) on each > > of its num_cpus hardware queues. In fact we can only process a > > queue_depth inflights at once on the whole ibtrs session connecting a > > given client with a given server. Those 512 inflights (corresponding > > to the number of buffers reserved by the server for this particular > > client) have to be shared among all the devices mapped on this > > session. This leads to the situation, that we receive more requests > > than we can process at the moment. So we need to stop queues and start > > them again later in some fair fashion. > > Can a single CPU really sustain a queue depth of 512 commands? Is it > really necessary to have one hardware queue per CPU or is e.g. four > queues per NUMA node sufficient? Has it been considered to send the > number of hardware queues that the initiator wants to use and also the > command depth per queue during login to the target side? That would > allow the target side to allocate an independent set of buffers for each > initiator hardware queue and would allow to remove the queue management > at the initiator side. This might even yield better performance. We needed a way which would allow us to address one particular requirement: we'd like to be able to "enforce" that a response to an IO would be processed on the same CPU the IO was originally submitted on. In order to be able to do so we establish one rdma connection per cpu, each having a separate cq_vector. The administrator can then assign the corresponding IRQs to distinct CPUs. The server always replies to an IO on the same connection he received the request on. If the administrator did configure the /proc/irq/y/smp_affinity accordingly, the response sent by the server will generate interrupt on the same cpu, the IO was originally submitted on. The administrator can configure IRQs differently, for example assign a given irq (<->cq_vector) to a range of cpus belonging to a numa node, or whatever assignment is best for his use-case. Our transport module IBTRS establishes number of cpus connections between a client and a server. The user of the transport module (i.e. IBNBD) has no knowledge about the rdma connections, it only has a pointer to an abstract "session", which connects him somehow to a remote host. IBNBD as a user of IBTRS creates block devices and uses a given "session" to send IOs from all the block devices it created for that session. That means IBNBD is limited in maximum number of his inflights toward a given remote host by the capability of the corresponding "session". So it needs to share the resources provided by the session (in our current model those resources are in fact some pre registered buffers on server side) among his devices. It is possible to extend the IBTRS API so that the user (IBNBD) could specify how many connections he wants to have on the session to be established. It is also possible to extend the ibtrs_clt_get_tag API (this is to get a send "permit") with a parameter specifying the connection, the future IO is to be send on. We now might have to change our communication model in IBTRS a bit in order to fix the potential security
Re: [PATCH v4 17/25] ibnbd: client: main functionality
On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche wrote: > > On 6/20/19 8:03 AM, Jack Wang wrote: > > +MODULE_VERSION(IBNBD_VER_STRING); > > No version numbers in upstream code please. Will drop this, thanks. > > > +/* > > + * This is for closing devices when unloading the module: > > + * we might be closing a lot (>256) of devices in parallel > > + * and it is better not to use the system_wq. > > + */ > > +static struct workqueue_struct *unload_wq; > > I think that a better motivation is needed for the introduction of a new > workqueue. We didn't want to pollute the system workqueue when unmapping a big number of devices at once in parallel. Will reiterate on it. > > > +#define KERNEL_SECTOR_SIZE 512 > > Please use SECTOR_SIZE instead of redefining it. Right. > > > +static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev, > > + size_t new_nsectors) > > +{ > > + int err = 0; > > + > > + ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n", > > +dev->nsectors, new_nsectors); > > + dev->nsectors = new_nsectors; > > + set_capacity(dev->gd, > > + dev->nsectors * (dev->logical_block_size / > > + KERNEL_SECTOR_SIZE)); > > + err = revalidate_disk(dev->gd); > > + if (err) > > + ibnbd_err(dev, "Failed to change device size from" > > + " %zu to %zu, err: %d\n", dev->nsectors, > > + new_nsectors, err); > > + return err; > > +} > > Since this function changes the block device size, I think that the name > ibnbd_clt_revalidate_disk() is confusing. Please rename this function. I guess ibnbd_clt_resize_disk() would be more appropriate. > > > +/** > > + * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued > > + * > > + * Description: > > + * Each CPU has a list of HW queues, which needs to be requeed. If a > > list > > + * is not empty - it is marked with a bit. This function finds first > > + * set bit in a bitmap and returns corresponding CPU list. > > + */ > > What does it mean to requeue a queue? Queue elements can be requeued but > a queue in its entirety not. Please make this comment more clear. Will fix the comment. The right wording should probably be "..., which need to be rerun". We have a list of "stopped" queues for each cpu. We need to select a list and a queue on that list to rerun, when an IO is completed. > > > +/** > > + * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty > > + * > > + * Description: > > + * Each CPU has it's own list of HW queues, which should be requeued. > > + * Function finds such list with HW queues, takes a list lock, picks up > > + * the first HW queue out of the list and requeues it. > > + * > > + * Return: > > + * True if the queue was requeued, false otherwise. > > + * > > + * Context: > > + * Does not matter. > > + */ > > Same comment here. > > > +/** > > + * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if > > + * session is idling (there are no requests in-flight). > > + * > > + * Description: > > + * This function tries to rerun all stopped queues if there are no > > + * requests in-flight anymore. This function tries to solve an obvious > > + * problem, when number of tags < than number of queues (hctx), which > > + * are stopped and put to sleep. If last tag, which has been just put, > > + * does not wake up all left queues (hctxs), IO requests hang forever. > > + * > > + * That can happen when all number of tags, say N, have been exhausted > > + * from one CPU, and we have many block devices per session, say M. > > + * Each block device has it's own queue (hctx) for each CPU, so > > eventually > > + * we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids. > > + * If number of tags N < M x nr_cpu_ids finally we will get an IO hang. > > + * > > + * To avoid this hang last caller of ibnbd_put_tag() (last caller is > > the > > + * one who observes sess->busy == 0) must wake up all remaining queues. > > + * > > + * Context: > > + * Does not matter. > > + */ > > Same comment here. > > A more general question is why ibnbd needs its own queue management > while no other block driver needs this? Each IBNBD device promises to have a queue_depth (of say 512) on each of its num_cpus hardware queues. In fact we can only process a queue_depth inflights at once on the whole ibtrs session connecting a given client with a given server. Those 512 inflights (corresponding to the number of buffers reserved by the server for this particular client) have to be shared among all the devices mapped on this session. This leads to the situation, that we receive more requests than we can process at the moment. So we need to stop queues and start them again later in some fair fashion. > > > +static void ibnbd_softirq_done_fn(struct
Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
> > > +/** > > > + * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN > > > + * @hdr: message header > > > + * @nsectors:number of sectors > > > > What is the size of a single sector? > 512b, will mention explicitly in the next round. We only have KERNEL_SECTOR_SIZE=512, defined in ibnbd-clt.c. Looks we only depend on this exact number to set the capacity of the block device on client side. I'm not sure whether it is worth extending the protocol to send the number from the server instead. > > > + * @max_segments:max segments hardware support in one transfer > > > > Does 'hardware' refer to the RDMA adapter that transfers the IBNBD > > message or to the storage device? In the latter case, I assume that > > transfer refers to a DMA transaction? > "hardware" refers to the storage device on the server-side. The field contains queue_max_segments() of the target block device. And is used to call blk_queue_max_segments() on the corresponding device on the client side. We do also have BMAX_SEGMENTS define in ibnbd-clt.h which sets an upper limit to max_segments and does refer to the capabilities of the RDMA adapter. This information should only be known to the transport module and ideally would be returned to IBNBD during the registration in IBTRS. > > Note: I plan to review the entire patch series but it may take some time > > before I have finished reviewing the entire patch series. > > > That will be great, thanks a lot, Bart Thank you Bart!
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Sagi, thanks a lot for the information. We are doing the right thing regarding the invalidation (your 2f122e4f5107), but we do use unsignalled sends and need to fix that. Please correct me if I'm wrong: The patches (b4b591c87f2b, b4b591c87f2b) fix the problem that if the ack from target is lost for some reason, the initiators HCA will resend it even after the request is completed. But doesn't the same problem persist also other way around: for the lost acks from client? I mean, target is did a send for the "read" IOs; client completed the request (after invalidation, refcount dropped to 0, etc), but the ack is not delivered to the HCA of the target, so the target will also resend it. This seems unfixable, since the client can't possible know if the server received his ack or not? Doesn't the problem go away, if rdma_conn_param.retry_count is just set to 0? Thanks for your help, Best, Danil. On Tue, Jul 9, 2019 at 11:27 PM Sagi Grimberg wrote: > > > >> Thanks Jason for feedback. > >> Can you be more specific about "the invalidation model for MR was wrong" > > > > MR's must be invalidated before data is handed over to the block > > layer. It can't leave MRs open for access and then touch the memory > > the MR covers. > > Jason is referring to these fixes: > 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing > a request") > 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work > request has completed") > b4b591c87f2b ("nvme-rdma: don't suppress send completions")
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg wrote: > > > >> My main issues which were raised before are: > >> - IMO there isn't any justification to this ibtrs layering separation > >> given that the only user of this is your ibnbd. Unless you are > >> trying to submit another consumer, you should avoid adding another > >> subsystem that is not really general purpose. > > We designed ibtrs not only with the IBNBD in mind but also as the > > transport layer for a distributed SDS. We'd like to be able to do what > > ceph is capable of (automatic up/down scaling of the storage cluster, > > automatic recovery) but using in-kernel rdma-based IO transport > > drivers, thin-provisioned volume managers, etc. to keep the highest > > possible performance. > > Sounds lovely, but still very much bound to your ibnbd. And that part > is not included in the patch set, so I still don't see why should this > be considered as a "generic" transport subsystem (it clearly isn't). Having IBTRS sit on a storage enables that storage to communicate with other storages (forward requests, request read from other storages i.e. for sync traffic). IBTRS is generic in the sense that it removes the strict separation into initiator (converting BIOs into some hardware specific protocol messages) and target (which forwards those messages to some local device supporting that protocol). It appears less generic to me to talk SCSI or NVME between storages if some storages have SCSI, other NVME disks or LVM volumes, or mixed setup. IBTRS allows to just send or request read of an sg-list between machines over rdma - the very minimum required to transport a BIO. It would in-deed support our case with the library if we would propose at least two users of it. We now only have a very early stage prototype capable of organizing storages in pools, multiplexing io between different storages, etc. sitting on top of ibtrs, it's not functional yet. On the other hand ibnbd with ibtrs alone already make over 1 lines. > > All in all itbrs is a library to establish a "fat", multipath, > > autoreconnectable connection between two hosts on top of rdma, > > optimized for transport of IO traffic. > > That is also dictating a wire-protocol which makes it useless to pretty > much any other consumer. Personally, I don't see how this library > would ever be used outside of your ibnbd. Its true, IBTRS also imposes a protocol for connection establishment and IO path. I think at least the IO part we did reduce to a bare minimum: 350 * Write * 351 352 1. When processing a write request client selects one of the memory chunks 353 on the server side and rdma writes there the user data, user header and the 354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 355 contains size of the user header. The client tells the server which chunk has 356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by 357 using the IMM field. 358 359 2. When confirming a write request server sends an "empty" rdma message with 360 an immediate field. The 32 bit field is used to specify the outstanding 361 inflight IO and for the error code. 362 363 CLT SRV 364 usr_data + usr_hdr + ibtrs_msg_rdma_write -> [IBTRS_IO_REQ_IMM] 365 [IBTRS_IO_RSP_IMM]<- (id + errno) 366 367 * Read * 368 369 1. When processing a read request client selects one of the memory chunks 370 on the server side and rdma writes there the user header and the 371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of 372 the user header, flags (specifying if memory invalidation is necessary) and the 373 list of addresses along with keys for the data to be read into. 374 375 2. When confirming a read request server transfers the requested data first, 376 attaches an invalidation message if requested and finally an "empty" rdma 377 message with an immediate field. The 32 bit field is used to specify the 378 outstanding inflight IO and the error code. 379 380 CLT SRV 381 usr_hdr + ibtrs_msg_rdma_read --> [IBTRS_IO_REQ_IMM] 382 [IBTRS_IO_RSP_IMM]<-- usr_data + (id + errno) 383 or in case client requested invalidation: 384 [IBTRS_IO_RSP_IMM_W_INV] <-- usr_data + (INV) + (id + errno) > >> - ibtrs in general is using almost no infrastructure from the existing > >> kernel subsystems. Examples are: > >> - tag allocation mechanism (which I'm not clear why its needed) > > As you correctly noticed our client manages the buffers allocated and > > registered by the server on the connection establishment. Our tags are > > just a mechanism to take and release those buffers for incoming > > requests on client side. Since the buffers allocated by the server are > > to be shared between all the devices mapped from that server and all > > their HW queue
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Sagi, thanks a lot for the detailed reply. Answers inline below: On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg wrote: > > Hi Danil and Jack, > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community > > That is not exactly correct AFAIR, > > My main issues which were raised before are: > - IMO there isn't any justification to this ibtrs layering separation >given that the only user of this is your ibnbd. Unless you are >trying to submit another consumer, you should avoid adding another >subsystem that is not really general purpose. We designed ibtrs not only with the IBNBD in mind but also as the transport layer for a distributed SDS. We'd like to be able to do what ceph is capable of (automatic up/down scaling of the storage cluster, automatic recovery) but using in-kernel rdma-based IO transport drivers, thin-provisioned volume managers, etc. to keep the highest possible performance. That modest plan of ours should among others cover for the following: When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned volumes) from server to client and building an (md-)raid on top of the imported devices on client side in order to provide for redundancy across different machines, one gets very decent throughput and low latency, since the IOs are sent in parallel to the storage machines. One downside of this setup, is that the resync traffic has to flow over the client, where the md-raid is sitting. Ideally the resync traffic should flow directly between the two "legs" (storage machines) of the raid. The server side of such a "distributed raid" capable of this direct syncing between the array members would necessarily require to have some logic on server side and hence could also sit on top of ibtrs. (To continue the analogy, the "server" side of an md-raid build on top of say two NVMEoF devices are just two block devices, which couldn't communicate with each other) All in all itbrs is a library to establish a "fat", multipath, autoreconnectable connection between two hosts on top of rdma, optimized for transport of IO traffic. > - ibtrs in general is using almost no infrastructure from the existing >kernel subsystems. Examples are: >- tag allocation mechanism (which I'm not clear why its needed) As you correctly noticed our client manages the buffers allocated and registered by the server on the connection establishment. Our tags are just a mechanism to take and release those buffers for incoming requests on client side. Since the buffers allocated by the server are to be shared between all the devices mapped from that server and all their HW queues (each having num_cpus of them) the mechanism behind get_tag/put_tag also takes care of the fairness. >- rdma rw abstraction similar to what we have in the core On the one hand we have only single IO related function: ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write with imm, or requests an rdma write with imm to be executed by the server. On the other hand we provide an abstraction to establish and manage what we call "session", which consist of multiple paths (to do failover and multipath with different policies), where each path consists of num_cpu rdma connections. Once you established a session you can add or remove paths from it on the fly. In case the connection to server is lost, the client does periodic attempts to reconnect automatically. On the server side you get just sg-lists with a direction READ or WRITE as requested by the client. We designed this interface not only as the minimum required to build a block device on top of rdma but also with a distributed raid in mind. >- list_next_or_null_rr_rcu ?? We use that for multipath. The macro (and more importantly the way we use it) has been reviewed by Linus and quit closely by Paul E. McKenney. AFAIR the conclusion was that Romans implementation is correct, but too tricky to use correctly in order to be included into kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659 >- few other examples sprinkled around.. To my best knowledge we addressed everything we got comments on and will definitely do so in the future. > Another question, from what I understand from the code, the client > always rdma_writes data on writes (with imm) from a remote pool of > server buffers dedicated to it. Essentially all writes are immediate (no > rdma reads ever). How is that different than using send wrs to a set of > pre-posted recv buffers (like all others are doing)? Is it faster? At the very beginning of the project we did some measurements and saw, that it is faster. I'm not sure if this is still true, since the hardware and the drivers and rdma subsystem did change in that time. Also it seemed to make the code simpler. > Also, given that the server pre-allocate a substantial amou
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Sagi, thanks a lot for the analysis. I didn't know about about the inline_data_size parameter in nvmet. It is at PAGE_SIZE on our systems. Will rerun our benchmarks with echo 2097152 > /sys/kernel/config/nvmet/ports/1/param_inline_data_size echo 2097152 > /sys/kernel/config/nvmet/ports/2/param_inline_data_size before enabling the port. Best Danil. On Wed, Jul 10, 2019 at 9:11 PM Sagi Grimberg wrote: > > > >> I still do not understand why this should give any notice-able > >> performance advantage. > > > > Usually omitting invalidations gives a healthy bump. > > > > Also, RDMA WRITE is generally faster than READ at the HW level in > > various ways. > > Yes, but this should be essentially identical to running nvme-rdma > with 512KB of immediate-data (the nvme term is in-capsule data). > > In the upstream nvme target we have inline_data_size port attribute > that is tunable for that (defaults to PAGE_SIZE). -- Danil Kipnis Linux Kernel Developer 1&1 IONOS Cloud GmbH | Greifswalder Str. 207 | 10405 Berlin | Germany E-mail: danil.kip...@cloud.ionos.com | Web: www.ionos.de Head Office: Berlin, Germany District Court Berlin Charlottenburg, Registration number: HRB 125506 B Executive Management: Christoph Steffens, Matthias Steinberg, Achim Weiss Member of United Internet This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail.
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Leon, thanks for the feedback! On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky wrote: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > it would make sense for us to rework our patchset and try pushing it > > for staging tree first, so that we can proof IBNBD is well maintained, > > beneficial for the eco-system, find a proper location for it within > > block/rdma subsystems? This would make it easier for people to try it > > out and would also be a huge step for us in terms of maintenance > > effort. > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > near future). Do you think it would make sense to rename the driver to > > RNBD/RTRS? > > It is better to avoid "staging" tree, because it will lack attention of > relevant people and your efforts will be lost once you will try to move > out of staging. We are all remembering Lustre and don't want to see it > again. > > Back then, you was asked to provide support for performance superiority. I have only theories of why ibnbd is showing better numbers than nvmeof: 1. The way we utilize the MQ framework in IBNBD. We promise to have queue_depth (say 512) requests on each of the num_cpus hardware queues of each device, but in fact we have only queue_depth for the whole "session" toward a given server. The moment we have queue_depth inflights we need stop the queue (on a device on a cpu) we get more requests on. We need to start them again after some requests are completed. We maintain per cpu lists of stopped HW queues, a bitmap showing which lists are not empty, etc. to wake them up in a round-robin fashion to avoid starvation of any devices. 2. We only do rdma writes with imm. A server reserves queue_depth of max_io_size buffers for a given client. The client manages those himself. Client uses imm field to tell to the server which buffer has been written (and where) and server uses the imm field to send back errno. If our max_io_size is 64K and queue_depth 512 and client only issues 4K IOs all the time, then 60*512K memory is wasted. On the other hand we do no buffer allocation/registration in io path on server side. Server sends rdma addresses and keys to those preregistered buffers on connection establishment and deallocates/unregisters them when a session is closed. That's for writes. For reads, client registers user buffers (after fr) and sends the addresses and keys to the server (with an rdma write with imm). Server rdma writes into those buffers. Client does the unregistering/invalidation and completes the request. > Can you please share any numbers with us? Apart from github (https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3) the performance results for v5.2-rc3 on two different systems can be accessed under dcd.ionos.com/ibnbd-performance-report. The page allows to filter out test scenarios interesting for comparison. > > Thanks
Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, Could you please provide some feedback to the IBNBD driver and the IBTRS library? So far we addressed all the requests provided by the community and continue to maintain our code up-to-date with the upstream kernel while having an extra compatibility layer for older kernels in our out-of-tree repository. I understand that SRP and NVMEoF which are in the kernel already do provide equivalent functionality for the majority of the use cases. IBNBD on the other hand is showing higher performance and more importantly includes the IBTRS - a general purpose library to establish connections and transport BIO-like read/write sg-lists over RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While I believe IBNBD does meet the kernel coding standards, it doesn't have a lot of users, while SRP and NVMEoF are widely accepted. Do you think it would make sense for us to rework our patchset and try pushing it for staging tree first, so that we can proof IBNBD is well maintained, beneficial for the eco-system, find a proper location for it within block/rdma subsystems? This would make it easier for people to try it out and would also be a huge step for us in terms of maintenance effort. The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the near future). Do you think it would make sense to rename the driver to RNBD/RTRS? Thank you, Best Regards, Danil On Thu, Jun 20, 2019 at 5:03 PM Jack Wang wrote: > > Hi all, > > Here is v4 of IBNBD/IBTRS patches, which have minor changes > > Changelog > - > v4: > o Protocol extended to transport IO priorities > o Support for Mellanox ConnectX-4/X-5 > o Minor sysfs extentions (display access mode on server side) > o Bug fixes: cleaning up sysfs folders, race on deallocation of resources > o Style fixes > > v3: > o Sparse fixes: > - le32 -> le16 conversion > - pcpu and RCU wrong declaration > - sysfs: dynamically alloc array of sockaddr structures to reduce >size of a stack frame > > o Rename sysfs folder on client and server sides to show source and > destination addresses of the connection, i.e.: >...//paths// > > o Remove external inclusions from Makefiles. > * https://lwn.net/Articles/756994/ > > v2: > o IBNBD: > - No legacy request IO mode, only MQ is left. > > o IBTRS: > - No FMR registration, only FR is left. > > * https://lwn.net/Articles/755075/ > > v1: > - IBTRS: load-balancing and IO fail-over using multipath features were > added. > > - Major parts of the code were rewritten, simplified and overall code > size was reduced by a quarter. > > * https://lwn.net/Articles/746342/ > > v0: > - Initial submission > > * https://lwn.net/Articles/718181/ > > > Introduction > - > > IBTRS (InfiniBand Transport) is a reliable high speed transport library > which allows for establishing connection between client and server > machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE > and iWARP, but we mainly tested in IB environment. It is optimized to > transfer (read/write) IO blocks in the sense that it follows the BIO > semantics of providing the possibility to either write data from a > scatter-gather list to the remote side or to request ("read") data > transfer from the remote side into a given set of buffers. > > IBTRS is multipath capable and provides I/O fail-over and load-balancing > functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA > CMs and particular path is selected according to the load-balancing policy. > It can be used for other components not bind to IBNBD. > > > IBNBD (InfiniBand Network Block Device) is a pair of kernel modules > (client and server) that allow for remote access of a block device on > the server over IBTRS protocol. After being mapped, the remote block > devices can be accessed on the client side as local block devices. > Internally IBNBD uses IBTRS as an RDMA transport library. > > >- IBNBD/IBTRS is developed in order to map thin provisioned volumes, > thus internal protocol is simple. >- IBTRS was developed as an independent RDMA transport library, which > supports fail-over and load-balancing policies using multipath, thus > it can be used for any other IO needs rather than only for block > device. >- IBNBD/IBTRS is fast. > Old comparison results: > https://www.spinics.net/lists/linux-rdma/msg48799.html > New comparison results: see performance measurements section below. > > Key features of IBTRS transport library and IBNBD block device: > > o High throughput and low latency due to: >- Only two RDMA messages per IO. >- IMM InfiniBand messages on responses to reduce round trip latency. >- Simplified memory management: memory allocation happens once on > server side when IBTRS session is esta
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Hi Doug, thanks for the feedback. You read the cover letter correctly: our transport library implements multipath (load balancing and failover) on top of RDMA API. Its name "IBTRS" is slightly misleading in that regard: it can sit on top of ROCE as well. The library allows for "bundling" multiple rdma "paths" (source addr - destination addr pair) into one "session". So our session consists of one or more paths and each path under the hood consists of as many QPs (each connecting source with destination) as there are CPUs on the client system. The user load (In our case IBNBD is a block device and generates some block requests) is load-balanced on per cpu-basis. I understand, this is something very different to what smc-r is doing. Am I right? Do you know what stage MP-RDMA development currently is? Best, Danil Kipnis. P.S. Sorry for the duplicate if any, first mail was returned cause of html. On Thu, Feb 8, 2018 at 7:10 PM Bart Van Assche wrote: > > On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote: > > thanks for the link to the article. To the best of my understanding, > > the guys suggest to authenticate the devices first and only then > > authenticate the users who use the devices in order to get access to a > > corporate service. They also mention in the presentation the current > > trend of moving corporate services into the cloud. But I think this is > > not about the devices from which that cloud is build of. Isn't a cloud > > first build out of devices connected via IB and then users (and their > > devices) are provided access to the services of that cloud as a whole? > > If a malicious user already plugged his device into an IB switch of a > > cloud internal infrastructure, isn't it game over anyway? Can't he > > just take the hard drives instead of mapping them? > > Hello Danil, > > It seems like we each have been focussing on different aspects of the article. > The reason I referred to that article is because I read the following in > that article: "Unlike the conventional perimeter security model, BeyondCorp > doesn’t gate access to services and tools based on a user’s physical location > or the originating network [ ... ] The zero trust architecture spells trouble > for traditional attacks that rely on penetrating a tough perimeter to waltz > freely within an open internal network." Suppose e.g. that an organization > decides to use RoCE or iWARP for connectivity between block storage initiator > systems and block storage target systems and that it has a single company- > wide Ethernet network. If the target system does not restrict access based > on initiator IP address then any penetrator would be able to access all the > block devices exported by the target after a SoftRoCE or SoftiWARP initiator > driver has been loaded. If the target system however restricts access based > on the initiator IP address then that would make it harder for a penetrator > to access the exported block storage devices. Instead of just penetrating the > network access, IP address spoofing would have to be used or access would > have to be obtained to a system that has been granted access to the target > system. > > Thanks, > > Bart. > > -- Danil Kipnis Linux Kernel Developer
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Wed, Feb 7, 2018 at 6:32 PM, Bart Van Assche wrote: > On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote: >> So the question is: are there real life setups where >> some of the local IB network members can be untrusted? > > Hello Roman, > > You may want to read more about the latest evolutions with regard to network > security. An article that I can recommend is the following: "Google reveals > own security regime policy trusts no network, anywhere, ever" > (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/). > > If data-centers would start deploying RDMA among their entire data centers > (maybe they are already doing this) then I think they will want to restrict > access to block devices to only those initiator systems that need it. > > Thanks, > > Bart. > > Hi Bart, thanks for the link to the article. To the best of my understanding, the guys suggest to authenticate the devices first and only then authenticate the users who use the devices in order to get access to a corporate service. They also mention in the presentation the current trend of moving corporate services into the cloud. But I think this is not about the devices from which that cloud is build of. Isn't a cloud first build out of devices connected via IB and then users (and their devices) are provided access to the services of that cloud as a whole? If a malicious user already plugged his device into an IB switch of a cloud internal infrastructure, isn't it game over anyway? Can't he just take the hard drives instead of mapping them? Thanks, Danil.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche wrote: > On 02/05/18 08:40, Danil Kipnis wrote: >> >> It just occurred to me, that we could easily extend the interface in >> such a way that each client (i.e. each session) would have on server >> side her own directory with the devices it can access. I.e. instead of >> just "dev_search_path" per server, any client would be able to only >> access devices under /session_name. (session name >> must already be generated by each client in a unique way). This way >> one could have an explicit control over which devices can be accessed >> by which clients. Do you think that would do it? > > > Hello Danil, > > That sounds interesting to me. However, I think that approach requires to > configure client access completely before the kernel target side module is > loaded. It does not allow to configure permissions dynamically after the > kernel target module has been loaded. Additionally, I don't see how to > support attributes per (initiator, block device) pair with that approach. > LIO e.g. supports the > /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You > may want to implement similar functionality if you want to convince more > users to use IBNBD. > > Thanks, > > Bart. Hello Bart, the configuration (which devices can be accessed by a particular client) can happen also after the kernel target module is loaded. The directory in is a module parameter and is fixed. It contains for example "/ibnbd_devices/". But a particular client X would be able to only access the devices located in the subdirectory "/ibnbd_devices/client_x/". (The sessionname here is client_x) One can add or remove the devices from that directory (those are just symlinks to /dev/xxx) at any time - before or after the server module is loaded. But you are right, we need something additional in order to be able to specify which devices a client can access writable and which readonly. May be another subdirectories "wr" and "ro" for each client: those under /ibnbd_devices/client_x/ro/ can only be read by client_x and those in /ibnbd_devices/client_x/wr/ can also be written to? Thanks, Danil.
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg wrote: > Hi Bart, My another 2 cents:) On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche wrote: > > > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> >> >> o Simple configuration of IBNBD: >> - Server side is completely passive: volumes do not need to be >>explicitly exported. > > > > That sounds like a security hole? I think the ability to configure > whether or > not an initiator is allowed to log in is essential and also which > volumes > an > initiator has access to. Our design target for well controlled production environment, so security is handle in other layer. >>> >>> >>> >>> What will happen to a new adopter of the code you are contributing? >> >> >> Hi Sagi, Hi Bart, >> thanks for your feedback. >> We considered the "storage cluster" setup, where each ibnbd client has >> access to each ibnbd server. Each ibnbd server manages devices under >> his "dev_search_path" and can provide access to them to any ibnbd >> client in the network. > > > I don't understand how that helps? > >> On top of that Ibnbd server has an additional >> "artificial" restriction, that a device can be mapped in writable-mode >> by only one client at once. > > > I think one would still need the option to disallow readable export as > well. It just occurred to me, that we could easily extend the interface in such a way that each client (i.e. each session) would have on server side her own directory with the devices it can access. I.e. instead of just "dev_search_path" per server, any client would be able to only access devices under /session_name. (session name must already be generated by each client in a unique way). This way one could have an explicit control over which devices can be accessed by which clients. Do you think that would do it?
Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
> >> Hi Bart, >> >> My another 2 cents:) >> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche >> wrote: >>> >>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: o Simple configuration of IBNBD: - Server side is completely passive: volumes do not need to be explicitly exported. >>> >>> >>> That sounds like a security hole? I think the ability to configure >>> whether or >>> not an initiator is allowed to log in is essential and also which volumes >>> an >>> initiator has access to. >> >> Our design target for well controlled production environment, so >> security is handle in other layer. > > > What will happen to a new adopter of the code you are contributing? Hi Sagi, Hi Bart, thanks for your feedback. We considered the "storage cluster" setup, where each ibnbd client has access to each ibnbd server. Each ibnbd server manages devices under his "dev_search_path" and can provide access to them to any ibnbd client in the network. On top of that Ibnbd server has an additional "artificial" restriction, that a device can be mapped in writable-mode by only one client at once. -- Danil