Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-27 Thread Danil Kipnis



On 27.09.19 11:32, Danil Kipnis wrote:

On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev  wrote:

No, it seems this thingy is a bit different.  According to my
understanding patches 3 and 4 from this patchset do the
following: 1# split equally the whole queue depth on number
of hardware queues and 2# return tag number which is unique
host-wide (more or less similar to unique_tag, right?).

2# is not needed for ibtrs, and 1# can be easy done by dividing
queue_depth on number of hw queues on tag set allocation, e.g.
something like the following:

 ...
 tags->nr_hw_queues = num_online_cpus();
 tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;

 blk_mq_alloc_tag_set(tags);


And this trick won't work out for the performance.  ibtrs client
has a single resource: set of buffer chunks received from a
server side.  And these buffers should be dynamically distributed
between IO producers according to the load.  Having a hard split
of the whole queue depth between hw queues we can forget about a
dynamic load distribution, here is an example:

- say server shares 1024 buffer chunks for a session (do not
  remember what is the actual number).

- 1024 buffers are equally divided between hw queues, let's
  say 64 (number of cpus), so each queue is 16 requests depth.

- only several CPUs produce IO, and instead of occupying the
  whole "bandwidth" of a session, i.e. 1024 buffer chunks,
  we limit ourselves to a small queue depth of an each hw
  queue.

And performance drops significantly when number of IO producers
is smaller than number of hw queues (CPUs), and it can be easily
tested and proved.

So for this particular ibtrs case tags should be globally shared,
and seems (unfortunately) there is no any other similar requirements
for other block devices.

I don't see any difference between what you describe here and 100 dm
volumes sitting on top of a single NVME device.


Hallo Christoph,

am I wrong?

Thank you,

Danil.



Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-27 Thread Danil Kipnis
On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev  wrote:
>
> No, it seems this thingy is a bit different.  According to my
> understanding patches 3 and 4 from this patchset do the
> following: 1# split equally the whole queue depth on number
> of hardware queues and 2# return tag number which is unique
> host-wide (more or less similar to unique_tag, right?).
>
> 2# is not needed for ibtrs, and 1# can be easy done by dividing
> queue_depth on number of hw queues on tag set allocation, e.g.
> something like the following:
>
> ...
> tags->nr_hw_queues = num_online_cpus();
> tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;
>
> blk_mq_alloc_tag_set(tags);
>
>
> And this trick won't work out for the performance.  ibtrs client
> has a single resource: set of buffer chunks received from a
> server side.  And these buffers should be dynamically distributed
> between IO producers according to the load.  Having a hard split
> of the whole queue depth between hw queues we can forget about a
> dynamic load distribution, here is an example:
>
>- say server shares 1024 buffer chunks for a session (do not
>  remember what is the actual number).
>
>- 1024 buffers are equally divided between hw queues, let's
>  say 64 (number of cpus), so each queue is 16 requests depth.
>
>- only several CPUs produce IO, and instead of occupying the
>  whole "bandwidth" of a session, i.e. 1024 buffer chunks,
>  we limit ourselves to a small queue depth of an each hw
>  queue.
>
> And performance drops significantly when number of IO producers
> is smaller than number of hw queues (CPUs), and it can be easily
> tested and proved.
>
> So for this particular ibtrs case tags should be globally shared,
> and seems (unfortunately) there is no any other similar requirements
> for other block devices.
I don't see any difference between what you describe here and 100 dm
volumes sitting on top of a single NVME device.


Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev

2019-09-26 Thread Danil Kipnis
> > Bart, would it in your opinion be OK to drop the file_io support in
> > IBNBD entirely? We implemented this feature in the beginning of the
> > project to see whether it could be beneficial in some use cases, but
> > never actually found any.
>
> I think that's reasonable since the loop driver can be used to convert a
> file into a block device.
Jack, shall we drop it?


Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev

2019-09-26 Thread Danil Kipnis
On Thu, Sep 26, 2019 at 5:11 PM Bart Van Assche  wrote:
> >>> +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
> >>> +  enum ibnbd_io_mode mode, struct bio_set 
> >>> *bs,
> >>> +  ibnbd_dev_io_fn io_cb)
> >>> +{
> >>> + struct ibnbd_dev *dev;
> >>> + int ret;
> >>> +
> >>> + dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> >>> + if (!dev)
> >>> + return ERR_PTR(-ENOMEM);
> >>> +
> >>> + if (mode == IBNBD_BLOCKIO) {
> >>> + dev->blk_open_flags = flags;
> >>> + ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> >>> + if (ret)
> >>> + goto err;
> >>> + } else if (mode == IBNBD_FILEIO) {
> >>> + dev->blk_open_flags = FMODE_READ;
> >>> + ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> >>> + if (ret)
> >>> + goto err;
> >>> +
> >>> + ret = ibnbd_dev_vfs_open(dev, path, flags);
> >>> + if (ret)
> >>> + goto blk_put;
> >>
> >> This looks really weird. Why to call ibnbd_dev_blk_open() first for file
> >> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?

Bart, would it in your opinion be OK to drop the file_io support in
IBNBD entirely? We implemented this feature in the beginning of the
project to see whether it could be beneficial in some use cases, but
never actually found any.


Re: [PATCH v4 06/25] ibtrs: client: main functionality

2019-09-26 Thread Danil Kipnis
On Thu, Sep 26, 2019 at 1:21 AM Bart Van Assche  wrote:
>
> On 9/25/19 3:53 PM, Danil Kipnis wrote:
> > Oh, you mean we just need stub functions for those, so that nobody
> > steps on a null?
>
> What I meant is that the memory that is backing a device must not be
> freed until the reference count of a device has dropped to zero. If a
> struct device is embedded in a larger structure that means signaling a
> completion from inside the release function (ibtrs_clt_dev_release())
> and not freeing the struct device memory (kfree(clt) in free_clt())
> before that completion has been triggered.

Got it, thank you. Will move free_clt into the release function.


Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions

2019-09-25 Thread Danil Kipnis
On Tue, Sep 17, 2019 at 6:36 PM Jinpu Wang  wrote:
>
> On Sat, Sep 14, 2019 at 12:25 AM Bart Van Assche  wrote:
> >
> > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > + charpathname[NAME_MAX];
> > [ ... ]
> >  > +charblk_symlink_name[NAME_MAX];
> >
> > Please allocate path names dynamically instead of hard-coding the upper
> > length for a path.
Those strings are used to name directories and files under sysfs,
which I think makes NAME_MAX a natural limitation for them. Client and
server only exchange those strings on connection establishment, not in
the IO path. We do not really need to safe 256K on a server with 1000
devices mapped in parallel. A patch to allocate those strings makes
the code longer, introduces new error paths and in my opinion doesn't
bring any benefits.


Re: [PATCH v4 06/25] ibtrs: client: main functionality

2019-09-25 Thread Danil Kipnis
On Wed, Sep 25, 2019 at 11:08 PM Bart Van Assche  wrote:
>
> On 9/25/19 1:50 PM, Danil Kipnis wrote:
> > On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche  wrote:
> >> There is plenty of code under drivers/base that calls get_device() and
> >> put_device(). Are you sure that none of the code under drivers/base will
> >> ever call get_device() and put_device() for the ibtrs client device?
> >
> > You mean how could multiple kernel modules share the same ibtrs
> > session...? I really never thought that far...
>
> I meant something else: device_register() registers struct device
> instances in multiple lists. The driver core may decide to iterate over
> these lists and to call get_device() / put_device() on the devices it
> finds in these lists.
Oh, you mean we just need stub functions for those, so that nobody
steps on a null?


Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-25 Thread Danil Kipnis
On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche  wrote:
> Combining multiple queues (a) into a single queue (b) that is smaller
> than the combined source queues without sacrificing performance is
> tricky. We already have one such implementation in the block layer core
> and it took considerable time to get that implementation right. See e.g.
> blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().

Roma, can you please estimate the performance impact in case we switch to it?


Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers

2019-09-25 Thread Danil Kipnis
On Tue, Sep 24, 2019 at 12:50 AM Bart Van Assche  wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#define P1 )
> > +#define P2 ))
> > +#define P3 )))
> > +#define P4 
> > +#define P(N) P ## N
> > +
> > +#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
> > +#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
> > +
> > +#define LIST(...)\
> > + __VA_ARGS__,\
> > + ({ unknown_type(); NULL; }) \
> > + CAT(P, COUNT_ARGS(__VA_ARGS__)) \
> > +
> > +#define EMPTY()
> > +#define DEFER(id) id EMPTY()
> > +
> > +#define _CASE(obj, type, member) \
> > + __builtin_choose_expr(  \
> > + __builtin_types_compatible_p(   \
> > + typeof(obj), type), \
> > + ((type)obj)->member
> > +#define CASE(o, t, m) DEFER(_CASE)(o, t, m)
> > +
> > +/*
> > + * Below we define retrieving of sessname from common IBTRS types.
> > + * Client or server related types have to be defined by special
> > + * TYPES_TO_SESSNAME macro.
> > + */
> > +
> > +void unknown_type(void);
> > +
> > +#ifndef TYPES_TO_SESSNAME
> > +#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
> > +#endif
> > +
> > +#define ibtrs_prefix(obj)\
> > + _CASE(obj, struct ibtrs_con *,  sess->sessname),\
> > + _CASE(obj, struct ibtrs_sess *, sessname),  \
> > + TYPES_TO_SESSNAME(obj)  \
> > + ))
>
> No preprocessor voodoo please. Please remove all of the above and modify
> the logging statements such that these pass the proper name string as
> first argument to logging macros.

Hi Bart,

do you think it would make sense we first submit a new patchset for
IBTRS (with the changes you suggested plus closed security problem)
and later submit a separate one for IBNBD only?

Thank you,
Danil


Re: [PATCH v4 06/25] ibtrs: client: main functionality

2019-09-25 Thread Danil Kipnis
On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche  wrote:
>
> On 9/25/19 10:36 AM, Danil Kipnis wrote:
> > On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche  wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +static void ibtrs_clt_dev_release(struct device *dev)
> >>> +{
> >>> + /* Nobody plays with device references, so nop */
> >>> +}
> >>
> >> That comment sounds wrong. Have you reviewed all of the device driver
> >> core code and checked that there is no code in there that manipulates
> >> struct device refcounts? I think the code that frees struct ibtrs_clt
> >> should be moved from free_clt() into the above function.
> >
> > We only use the device to create an entry under /sys/class. free_clt()
> > is destroying sysfs first and unregisters the device afterwards. I
> > don't really see the need to free from the callback instead... Will
> > make it clear in the comment.
>
> There is plenty of code under drivers/base that calls get_device() and
> put_device(). Are you sure that none of the code under drivers/base will
> ever call get_device() and put_device() for the ibtrs client device?
You mean how could multiple kernel modules share the same ibtrs
session...? I really never thought that far...

> Thanks,
>
> Bart.
>


Re: [PATCH v4 06/25] ibtrs: client: main functionality

2019-09-25 Thread Danil Kipnis
Hallo Bart,

On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche  wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static const struct ibtrs_ib_dev_pool_ops dev_pool_ops;
> > +static struct ibtrs_ib_dev_pool dev_pool = {
> > + .ops = &dev_pool_ops
> > +};
>
> Can the definitions in this file be reordered such that the forward
> declaration of dev_pool_ops can be removed?
Will try to.

> > +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
> > +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
> > +  struct rdma_cm_event *ev);
> > +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
> > +static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
> > +   bool notify, bool can_wait);
> > +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
> > +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
>
> Please also remove these forward declarations.
OK

> > +bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
> > +{
> > + return sess->state == IBTRS_CLT_CONNECTED;
> > +}
>
> Is it really useful to introduce a one line function for testing the
> session state?
No, not in that case really, thanks.

> > +static inline struct ibtrs_tag *
> > +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> > +{
> > + size_t max_depth = clt->queue_depth;
> > + struct ibtrs_tag *tag;
> > + int cpu, bit;
> > +
> > + cpu = get_cpu();
> > + do {
> > + bit = find_first_zero_bit(clt->tags_map, max_depth);
> > + if (unlikely(bit >= max_depth)) {
> > + put_cpu();
> > + return NULL;
> > + }
> > +
> > + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> > + put_cpu();
> > +
> > + tag = GET_TAG(clt, bit);
> > + WARN_ON(tag->mem_id != bit);
> > + tag->cpu_id = cpu;
> > + tag->con_type = con_type;
> > +
> > + return tag;
> > +}
>
> What is the role of the get_cpu() and put_cpu() calls in this function?
> How can it make sense to assign the cpu number to tag->cpu_id after
> put_cpu() has been called?
We disable preemption while looking for a free "ibtrs_tag" (permit) in
our tags_map. We store the cpu number the ibtrs_clt_get_tag() function
has been originally called on in the ibtrs_tag we just found, so that
when the user later would use this ibtrs_tag for an rdma operation
(ibtrs_clt_request()), we would select the rdma connection with
cq_vector corresponding to this cpu. If IRQ affinity is configured
accordingly, this enables for an IO response to be processed on the
same cpu the IO request was originally submitted on.

> > +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
> > +   struct ibtrs_clt_sess *sess,
> > +   ibtrs_conf_fn *conf,
> > +   struct ibtrs_tag *tag, void *priv,
> > +   const struct kvec *vec, size_t usr_len,
> > +   struct scatterlist *sg, size_t sg_cnt,
> > +   size_t data_len, int dir)
> > +{
> > + struct iov_iter iter;
> > + size_t len;
> > +
> > + req->tag = tag;
> > + req->in_use = true;
> > + req->usr_len = usr_len;
> > + req->data_len = data_len;
> > + req->sglist = sg;
> > + req->sg_cnt = sg_cnt;
> > + req->priv = priv;
> > + req->dir = dir;
> > + req->con = ibtrs_tag_to_clt_con(sess, tag);
> > + req->conf = conf;
> > + req->need_inv = false;
> > + req->need_inv_comp = false;
> > + req->inv_errno = 0;
> > +
> > + iov_iter_kvec(&iter, READ, vec, 1, usr_len);
> > + len = _copy_from_iter(req->iu->buf, usr_len, &iter);
> > + WARN_ON(len != usr_len);
> > +
> > + reinit_completion(&req->inv_comp);
> > + if (sess->stats.enable_rdma_lat)
> > + req->start_jiffies = jiffies;
> > +}
>
> A comment that explains what "req" stands for would be welcome. Since
> this function copies the entire payload, I assume that it is only used
> for control messages and not for reading or writing data from a block
> device?
Yes, we only copy control message provided by the user. Will extend
the description.

> > +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
> > +   struct ibtrs_clt_io_req *fail_req)
> > +{
> > + struct ibtrs_clt_sess *alive_sess;
> > + struct ibtrs_clt_io_req *req;
> > + int err = -ECONNABORTED;
> > + struct path_it it;
> > +
> > + do_each_path(alive_sess, clt, &it) {
> > + if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
> > + continue;
> > + req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
> > + if (req->dir == DMA_TO_DEVICE)
> > + err = ibtrs_clt_write_req(req);
> > +   

Re: [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections

2019-09-25 Thread Danil Kipnis
On Mon, Sep 23, 2019 at 7:44 PM Bart Van Assche  wrote:
> > +/**
> > + * enum ibtrs_clt_link_ev - Events about connectivity state of a client
> > + * @IBTRS_CLT_LINK_EV_RECONNECTEDClient was reconnected.
> > + * @IBTRS_CLT_LINK_EV_DISCONNECTED   Client was disconnected.
> > + */
> > +enum ibtrs_clt_link_ev {
> > + IBTRS_CLT_LINK_EV_RECONNECTED,
> > + IBTRS_CLT_LINK_EV_DISCONNECTED,
> > +};
> > +
> > +/**
> > + * Source and destination address of a path to be established
> > + */
> > +struct ibtrs_addr {
> > + struct sockaddr_storage *src;
> > + struct sockaddr_storage *dst;
> > +};
>
> Is it really useful to define a structure to hold two pointers or can
> these two pointers also be passed as separate arguments?
We always need both src and dst throughout ibnbd and ibtrs code and
indeed one reason to introduce this struct is that "f(struct
ibtrs_addr *addr, ...);" is shorter than "f(struct sockaddr_storage
*src, struct sockaddr_storage *dst, ...);". But it also makes it
easier to extend the address information describing one ibtrs path in
the future.

> > +/**
> > + * ibtrs_clt_open() - Open a session to a IBTRS client
> > + * @priv:User supplied private data.
> > + * @link_ev: Event notification for connection state changes
> > + *   @priv:  user supplied data that was passed to
> > + *   ibtrs_clt_open()
> > + *   @ev:Occurred event
> > + * @sessname: name of the session
> > + * @paths: Paths to be established defined by their src and dst addresses
> > + * @path_cnt: Number of elemnts in the @paths array
> > + * @port: port to be used by the IBTRS session
> > + * @pdu_sz: Size of extra payload which can be accessed after tag 
> > allocation.
> > + * @max_inflight_msg: Max. number of parallel inflight messages for the 
> > session
> > + * @max_segments: Max. number of segments per IO request
> > + * @reconnect_delay_sec: time between reconnect tries
> > + * @max_reconnect_attempts: Number of times to reconnect on error before 
> > giving
> > + *   up, 0 for * disabled, -1 for forever
> > + *
> > + * Starts session establishment with the ibtrs_server. The function can 
> > block
> > + * up to ~2000ms until it returns.
> > + *
> > + * Return a valid pointer on success otherwise PTR_ERR.
> > + */
> > +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> > +  const char *sessname,
> > +  const struct ibtrs_addr *paths,
> > +  size_t path_cnt, short port,
> > +  size_t pdu_sz, u8 reconnect_delay_sec,
> > +  u16 max_segments,
> > +  s16 max_reconnect_attempts);
>
> Having detailed kernel-doc headers for describing API functions is great
> but I'm not sure a .h file is the best location for such documentation.
> Many kernel developers keep kernel-doc headers in .c files because that
> makes it more likely that the documentation and the implementation stay
> in sync.
What is better: to move it or to only copy it to the corresponding C file?

>
> > +
> > +/**
> > + * ibtrs_clt_close() - Close a session
> > + * @sess: Session handler, is freed on return
>   ^^^
>   handle?
>
> This sentence suggests that the handle is freed on return. I guess that
> you meant that the session is freed upon return?
Right, will fix the wording.

>
> > +/**
> > + * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
> > + * @sess:Current session
> > + * @con_type:Type of connection to use with the tag
> > + * @wait:Wait type
> > + *
> > + * Description:
> > + *Allocates tag for the following RDMA operation.  Tag is used
> > + *to preallocate all resources and to propagate memory pressure
> > + *up earlier.
> > + *
> > + * Context:
> > + *Can sleep if @wait == IBTRS_TAG_WAIT
> > + */
> > +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
> > + enum ibtrs_clt_con_type con_type,
> > + int wait);
>
> Since struct ibtrs_tag has another role than what is called a tag in the
> block layer I think a better description is needed of what struct
> ibtrs_tag actually represents.
I think it would be better to rename it to ibtrs_permit in order to
avoid confusion with block layer tags. Will extend the description
also.

> > +/*
> > + * Here goes IBTRS server API
> > + */
>
> Most software either uses the client API or the server API but not both
> at the same time. Has it been considered to use separate header files
> for the client and server APIs?
I don't have any really good reason to put API of server and client
into a single file. Except may be that the reader can see API calls
corresponding the full sequence of request -> indication -> response
-> confirmation in one place.


Re: [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()

2019-09-25 Thread Danil Kipnis
On Mon, Sep 23, 2019 at 7:21 PM Bart Van Assche  wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > Function is going to be used in transport over RDMA module
> > in subsequent patches.
>
> It seems like several words are missing from this patch description.
Will extend with corresponding description of the function from
fs/sysfs/file.c and explanation why we need it.


Re: [PATCH v4 20/25] ibnbd: server: main functionality

2019-09-23 Thread Danil Kipnis
On Fri, Sep 20, 2019 at 5:42 PM Bart Van Assche  wrote:
>
> On 9/20/19 12:36 AM, Danil Kipnis wrote:
> > On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche  wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> >>> +  struct ibnbd_srv_session *srv_sess,
> >>> +  const void *msg, size_t len,
> >>> +  void *data, size_t datalen)
> >>> +{
> >>> + const struct ibnbd_msg_sess_info *sess_info_msg = msg;
> >>> + struct ibnbd_msg_sess_info_rsp *rsp = data;
> >>> +
> >>> + srv_sess->ver = min_t(u8, sess_info_msg->ver, 
> >>> IBNBD_PROTO_VER_MAJOR);
> >>> + pr_debug("Session %s using protocol version %d (client version: %d,"
> >>> +  " server version: %d)\n", srv_sess->sessname,
> >>> +  srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
> >>
> >> Has this patch been verified with checkpatch? I think checkpatch
> >> recommends not to split literal strings.
> >
> > Yes it does complain about our splitted strings. But it's either
> > splitted string or line over 80 chars or "Avoid line continuations in
> > quoted strings" if we use backslash on previous line. I don't know how
> > to avoid all three of them.
>
> Checkpatch shouldn't complain about constant strings that exceed 80
> columns. If it complains about such strings then that's a checkpatch bug.
It doesn't in deed... Will concat those splitted quoted string, thank you.


Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-20 Thread Danil Kipnis
On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche  wrote:
>
> On 9/18/19 12:14 AM, Danil Kipnis wrote:
> > I'm not familiar with dm code, but don't they need to deal with the
> > same situation: if I configure 100 logical volumes on top of a single
> > NVME drive with X hardware queues, each queue_depth deep, then each dm
> > block device would need to advertise X hardware queues in order to
> > achieve highest performance in case only this one volume is accessed,
> > while in fact those X physical queues have to be shared among all 100
> > logical volumes, if they are accessed in parallel?
>
> Combining multiple queues (a) into a single queue (b) that is smaller
> than the combined source queues without sacrificing performance is
> tricky. We already have one such implementation in the block layer core
> and it took considerable time to get that implementation right. See e.g.
> blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().
We will need some time, to check if we can reuse those...

> dm drivers are expected to return DM_MAPIO_REQUEUE or
> DM_MAPIO_DELAY_REQUEUE if the queue (b) is full. It turned out to be
> difficult to get this right in the dm-mpath driver and at the same time
> to achieve good performance.
We also first tried to just return error codes in case we can't
process an incoming request, but this was causing huge performance
degradation when number of devices mapped over the same session is
growing. Since we introduced those per cpu per devices lists of
stopped queues, we do scale very well.

>
> The ibnbd driver introduces a third implementation of code that combines
> multiple (per-cpu) queues into one queue per CPU. It is considered
> important in the Linux kernel to avoid code duplication. Hence my
> question whether ibnbd can reuse the block layer infrastructure for
> sharing tag sets.
Yes, will have to reiterate on this.


Re: [PATCH v4 20/25] ibnbd: server: main functionality

2019-09-20 Thread Danil Kipnis
On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche  wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>
> Same comment here as for a previous patch - please do not include line
> number information in pr_fmt().

Will drop it, thanks.

> > +MODULE_AUTHOR("ib...@profitbricks.com");
> > +MODULE_VERSION(IBNBD_VER_STRING);
> > +MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
> > +MODULE_LICENSE("GPL");
>
> Please remove the version number (MODULE_VERSION()).

OK.

> > +static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
>
> Please change dev_search_path[] into a dynamically allocated string to
> avoid a hard-coded length limit.

OK.

> > + if (dup[strlen(dup) - 1] == '\n')
> > + dup[strlen(dup) - 1] = '\0';
>
> Can this be changed into a call to strim()?

A directory name can start and end with spaces, for example this
works: mkdir " x  "

> > +static void ibnbd_endio(void *priv, int error)
> > +{
> > + struct ibnbd_io_private *ibnbd_priv = priv;
> > + struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
> > +
> > + ibnbd_put_sess_dev(sess_dev);
> > +
> > + ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
> > +
> > + kfree(priv);
> > +}
>
> Since ibtrs_srv_resp_rdma() starts an RDMA WRITE without waiting for the
> write completion, shouldn't the session reference be retained until the
> completion for that RDMA WRITE has been received? In other words, is
> there a risk with the current approach that the buffer that is being
> transferred to the client will be freed before the RDMA WRITE has finished?

ibtrs-srv.c is keeping track of inflights. When closing session it
first marks the queue as closing, so that no new write requests would
be posted, when IBNBD calls ibtrs_srv_resp_rdma():
1831 if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
1832&old_state)
Then ibtrs-srv schedules the ibtrs_srv_close_work, that drains the
queue and then waits for all inflights to return from IBNBD:
...
1274 ib_drain_qp(con->c.qp);
1275 }
1276 /* Wait for all inflights */
1277 ibtrs_srv_wait_ops_ids(sess);

Only then the resources can be deallocated:
1282 unmap_cont_bufs(sess);
1283 ibtrs_srv_free_ops_ids(sess);

>
> > +static struct ibnbd_srv_sess_dev *
> > +ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
> > +{
> > + struct ibnbd_srv_sess_dev *sess_dev;
> > + int ret = 0;
> > +
> > + read_lock(&srv_sess->index_lock);
> > + sess_dev = idr_find(&srv_sess->index_idr, dev_id);
> > + if (likely(sess_dev))
> > + ret = kref_get_unless_zero(&sess_dev->kref);
> > + read_unlock(&srv_sess->index_lock);
> > +
> > + if (unlikely(!sess_dev || !ret))
> > + return ERR_PTR(-ENXIO);
> > +
> > + return sess_dev;
> > +}
>
> Something that is not important: isn't the sess_dev check superfluous in
> the if-statement just above the return statement? If ret == 1, does that
> imply that sess_dev != 0 ?

We want to have found the device (sess_dev != NULL) and we want to
have been able to take reference to it (ret != 0)... You are right, if
ret != 0 then sess_dev can't be NULL.

> Has it been considered to return -ENODEV instead of -ENXIO if no device
> is found?

The backend block device, i.e. /dev/nullb0, is still there and might
even be still exported over other session(s). So we thought "No such
device or address" is more appropriate.

>
> > +static int create_sess(struct ibtrs_srv *ibtrs)
> > +{
>  > [ ... ]
> > + strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));
>
> Please change the session name into a dynamically allocated string such
> that strdup() can be used instead of strlcpy().

OK.

>
> > +static int process_msg_open(struct ibtrs_srv *ibtrs,
> > + struct ibnbd_srv_session *srv_sess,
> > + const void *msg, size_t len,
> > + void *data, size_t datalen);
> > +
> > +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> > +  struct ibnbd_srv_session *srv_sess,
> > +  const void *msg, size_t len,
> > +  void *data, size_t datalen);
>
> Can the code be reordered such that these forward declarations can be
> dropped?

Will try to.

>
> > +static struct ibnbd_srv_sess_dev *
> > +ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
> > +   const struct ibnbd_msg_open *open_msg,
> > +   struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
> > +   struct ibnbd_srv_dev *srv_dev)
> > +{
> > + struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
> > +
> > + if (IS_ERR(sdev))
> > + return sdev;
> > +
> > + 

Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-18 Thread Danil Kipnis
> > > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche  
> > > wrote:
> > >> A more general question is why ibnbd needs its own queue management
> > >> while no other block driver needs this?
> > >
> > > Each IBNBD device promises to have a queue_depth (of say 512) on each
> > > of its num_cpus hardware queues. In fact we can only process a
> > > queue_depth inflights at once on the whole ibtrs session connecting a
> > > given client with a given server. Those 512 inflights (corresponding
> > > to the number of buffers reserved by the server for this particular
> > > client) have to be shared among all the devices mapped on this
> > > session. This leads to the situation, that we receive more requests
> > > than we can process at the moment. So we need to stop queues and start
> > > them again later in some fair fashion.
> >
> > Can a single CPU really sustain a queue depth of 512 commands? Is it
> > really necessary to have one hardware queue per CPU or is e.g. four
> > queues per NUMA node sufficient? Has it been considered to send the
> > number of hardware queues that the initiator wants to use and also the
> > command depth per queue during login to the target side? That would
> > allow the target side to allocate an independent set of buffers for each
> > initiator hardware queue and would allow to remove the queue management
> > at the initiator side. This might even yield better performance.
> We needed a way which would allow us to address one particular
> requirement: we'd like to be able to "enforce" that a response to an
> IO would be processed on the same CPU the IO was originally submitted
> on. In order to be able to do so we establish one rdma connection per
> cpu, each having a separate cq_vector. The administrator can then
> assign the corresponding IRQs to distinct CPUs. The server always
> replies to an IO on the same connection he received the request on. If
> the administrator did configure the /proc/irq/y/smp_affinity
> accordingly, the response sent by the server will generate interrupt
> on the same cpu, the IO was originally submitted on. The administrator
> can configure IRQs differently, for example assign a given irq
> (<->cq_vector) to a range of cpus belonging to a numa node, or
> whatever assignment is best for his use-case.
> Our transport module IBTRS establishes number of cpus connections
> between a client and a server. The user of the transport module (i.e.
> IBNBD) has no knowledge about the rdma connections, it only has a
> pointer to an abstract "session", which connects  him somehow to a
> remote host. IBNBD as a user of IBTRS creates block devices and uses a
> given "session" to send IOs from all the block devices it created for
> that session. That means IBNBD is limited in maximum number of his
> inflights toward a given remote host by the capability of the
> corresponding "session". So it needs to share the resources provided
> by the session (in our current model those resources are in fact some
> pre registered buffers on server side) among his devices.
> It is possible to extend the IBTRS API so that the user (IBNBD) could
> specify how many connections he wants to have on the session to be
> established. It is also possible to extend the ibtrs_clt_get_tag API
> (this is to get a send "permit") with a parameter specifying the
> connection, the future IO is to be send on.
> We now might have to change our communication model in IBTRS a bit in
> order to fix the potential security problem raised during the recent
> RDMA MC: https://etherpad.net/p/LPC2019_RDMA.
>
I'm not familiar with dm code, but don't they need to deal with the
same situation: if I configure 100 logical volumes on top of a single
NVME drive with X hardware queues, each queue_depth deep, then each dm
block device would need to advertise X hardware queues in order to
achieve highest performance in case only this one volume is accessed,
while in fact those X physical queues have to be shared among all 100
logical volumes, if they are accessed in parallel?


Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-17 Thread Danil Kipnis
On Mon, Sep 16, 2019 at 6:46 PM Bart Van Assche  wrote:
>
> On 9/16/19 7:17 AM, Danil Kipnis wrote:
> > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche  wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +/*
> >>> + * This is for closing devices when unloading the module:
> >>> + * we might be closing a lot (>256) of devices in parallel
> >>> + * and it is better not to use the system_wq.
> >>> + */
> >>> +static struct workqueue_struct *unload_wq;
> >>
> >> I think that a better motivation is needed for the introduction of a new
> >> workqueue.
>  >
> > We didn't want to pollute the system workqueue when unmapping a big
> > number of devices at once in parallel. Will reiterate on it.
>
> There are multiple system workqueues. From :
>
> extern struct workqueue_struct *system_wq;
> extern struct workqueue_struct *system_highpri_wq;
> extern struct workqueue_struct *system_long_wq;
> extern struct workqueue_struct *system_unbound_wq;
> extern struct workqueue_struct *system_freezable_wq;
> extern struct workqueue_struct *system_power_efficient_wq;
> extern struct workqueue_struct *system_freezable_power_efficient_wq;
>
> Has it been considered to use e.g. system_long_wq?
Will try to switch to system_long_wq, I do agree that a new wq for
just closing devices does make an impression of an overreaction.

>
> >> A more general question is why ibnbd needs its own queue management
> >> while no other block driver needs this?
> >
> > Each IBNBD device promises to have a queue_depth (of say 512) on each
> > of its num_cpus hardware queues. In fact we can only process a
> > queue_depth inflights at once on the whole ibtrs session connecting a
> > given client with a given server. Those 512 inflights (corresponding
> > to the number of buffers reserved by the server for this particular
> > client) have to be shared among all the devices mapped on this
> > session. This leads to the situation, that we receive more requests
> > than we can process at the moment. So we need to stop queues and start
> > them again later in some fair fashion.
>
> Can a single CPU really sustain a queue depth of 512 commands? Is it
> really necessary to have one hardware queue per CPU or is e.g. four
> queues per NUMA node sufficient? Has it been considered to send the
> number of hardware queues that the initiator wants to use and also the
> command depth per queue during login to the target side? That would
> allow the target side to allocate an independent set of buffers for each
> initiator hardware queue and would allow to remove the queue management
> at the initiator side. This might even yield better performance.
We needed a way which would allow us to address one particular
requirement: we'd like to be able to "enforce" that a response to an
IO would be processed on the same CPU the IO was originally submitted
on. In order to be able to do so we establish one rdma connection per
cpu, each having a separate cq_vector. The administrator can then
assign the corresponding IRQs to distinct CPUs. The server always
replies to an IO on the same connection he received the request on. If
the administrator did configure the /proc/irq/y/smp_affinity
accordingly, the response sent by the server will generate interrupt
on the same cpu, the IO was originally submitted on. The administrator
can configure IRQs differently, for example assign a given irq
(<->cq_vector) to a range of cpus belonging to a numa node, or
whatever assignment is best for his use-case.
Our transport module IBTRS establishes number of cpus connections
between a client and a server. The user of the transport module (i.e.
IBNBD) has no knowledge about the rdma connections, it only has a
pointer to an abstract "session", which connects  him somehow to a
remote host. IBNBD as a user of IBTRS creates block devices and uses a
given "session" to send IOs from all the block devices it created for
that session. That means IBNBD is limited in maximum number of his
inflights toward a given remote host by the capability of the
corresponding "session". So it needs to share the resources provided
by the session (in our current model those resources are in fact some
pre registered buffers on server side) among his devices.
It is possible to extend the IBTRS API so that the user (IBNBD) could
specify how many connections he wants to have on the session to be
established. It is also possible to extend the ibtrs_clt_get_tag API
(this is to get a send "permit") with a parameter specifying the
connection, the future IO is to be send on.
We now might have to change our communication model in IBTRS a bit in
order to fix the potential security 

Re: [PATCH v4 17/25] ibnbd: client: main functionality

2019-09-16 Thread Danil Kipnis
On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche  wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +MODULE_VERSION(IBNBD_VER_STRING);
>
> No version numbers in upstream code please.
Will drop this, thanks.
>
> > +/*
> > + * This is for closing devices when unloading the module:
> > + * we might be closing a lot (>256) of devices in parallel
> > + * and it is better not to use the system_wq.
> > + */
> > +static struct workqueue_struct *unload_wq;
>
> I think that a better motivation is needed for the introduction of a new
> workqueue.
We didn't want to pollute the system workqueue when unmapping a big
number of devices at once in parallel. Will reiterate on it.

>
> > +#define KERNEL_SECTOR_SIZE  512
>
> Please use SECTOR_SIZE instead of redefining it.
Right.

>
> > +static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev,
> > +  size_t new_nsectors)
> > +{
> > + int err = 0;
> > +
> > + ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n",
> > +dev->nsectors, new_nsectors);
> > + dev->nsectors = new_nsectors;
> > + set_capacity(dev->gd,
> > +  dev->nsectors * (dev->logical_block_size /
> > +   KERNEL_SECTOR_SIZE));
> > + err = revalidate_disk(dev->gd);
> > + if (err)
> > + ibnbd_err(dev, "Failed to change device size from"
> > +   " %zu to %zu, err: %d\n", dev->nsectors,
> > +   new_nsectors, err);
> > + return err;
> > +}
>
> Since this function changes the block device size, I think that the name
> ibnbd_clt_revalidate_disk() is confusing. Please rename this function.
I guess ibnbd_clt_resize_disk() would be more appropriate.

>
> > +/**
> > + * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued
> > + *
> > + * Description:
> > + * Each CPU has a list of HW queues, which needs to be requeed.  If a 
> > list
> > + * is not empty - it is marked with a bit.  This function finds first
> > + * set bit in a bitmap and returns corresponding CPU list.
> > + */
>
> What does it mean to requeue a queue? Queue elements can be requeued but
> a queue in its entirety not. Please make this comment more clear.
Will fix the comment. The right wording should probably be "..., which
need to be rerun". We have a list of "stopped" queues for each cpu. We
need to select a list and a queue on that list to rerun, when an IO is
completed.

>
> > +/**
> > + * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty
> > + *
> > + * Description:
> > + * Each CPU has it's own list of HW queues, which should be requeued.
> > + * Function finds such list with HW queues, takes a list lock, picks up
> > + * the first HW queue out of the list and requeues it.
> > + *
> > + * Return:
> > + * True if the queue was requeued, false otherwise.
> > + *
> > + * Context:
> > + * Does not matter.
> > + */
>
> Same comment here.
>
> > +/**
> > + * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if
> > + * session is idling (there are no requests in-flight).
> > + *
> > + * Description:
> > + * This function tries to rerun all stopped queues if there are no
> > + * requests in-flight anymore.  This function tries to solve an obvious
> > + * problem, when number of tags < than number of queues (hctx), which
> > + * are stopped and put to sleep.  If last tag, which has been just put,
> > + * does not wake up all left queues (hctxs), IO requests hang forever.
> > + *
> > + * That can happen when all number of tags, say N, have been exhausted
> > + * from one CPU, and we have many block devices per session, say M.
> > + * Each block device has it's own queue (hctx) for each CPU, so 
> > eventually
> > + * we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
> > + * If number of tags N < M x nr_cpu_ids finally we will get an IO hang.
> > + *
> > + * To avoid this hang last caller of ibnbd_put_tag() (last caller is 
> > the
> > + * one who observes sess->busy == 0) must wake up all remaining queues.
> > + *
> > + * Context:
> > + * Does not matter.
> > + */
>
> Same comment here.
>
> A more general question is why ibnbd needs its own queue management
> while no other block driver needs this?
Each IBNBD device promises to have a queue_depth (of say 512) on each
of its num_cpus hardware queues. In fact we can only process a
queue_depth inflights at once on the whole ibtrs session connecting a
given client with a given server. Those 512 inflights (corresponding
to the number of buffers reserved by the server for this particular
client) have to be shared among all the devices mapped on this
session. This leads to the situation, that we receive more requests
than we can process at the moment. So we need to stop queues and start
them again later in some fair fashion.

>
> > +static void ibnbd_softirq_done_fn(struct

Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers

2019-09-16 Thread Danil Kipnis
> > > +/**
> > > + * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
> > > + * @hdr: message header
> > > + * @nsectors:number of sectors
> >
> > What is the size of a single sector?
> 512b, will mention explicitly in the next round.
We only have KERNEL_SECTOR_SIZE=512, defined in ibnbd-clt.c. Looks we
only depend on this exact number to set the capacity of the block
device on client side. I'm not sure whether it is worth extending the
protocol to send the number from the server instead.

> > > + * @max_segments:max segments hardware support in one transfer
> >
> > Does 'hardware' refer to the RDMA adapter that transfers the IBNBD
> > message or to the storage device? In the latter case, I assume that
> > transfer refers to a DMA transaction?
> "hardware" refers to the storage device on the server-side.
The field contains queue_max_segments() of the target block device.
And is used to call blk_queue_max_segments() on the corresponding
device on the client side.
We do also have BMAX_SEGMENTS define in ibnbd-clt.h which sets an
upper limit to max_segments and does refer to the capabilities of the
RDMA adapter. This information should only be known to the transport
module and ideally would be returned to IBNBD during the registration
in IBTRS.

> > Note: I plan to review the entire patch series but it may take some time
> > before I have finished reviewing the entire patch series.
> >
> That will be great, thanks a  lot, Bart
Thank you Bart!


Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-19 Thread Danil Kipnis
Hi Sagi,

thanks a lot for the information. We are doing the right thing
regarding the invalidation (your 2f122e4f5107), but we do use
unsignalled sends and need to fix that. Please correct me if I'm
wrong: The patches (b4b591c87f2b, b4b591c87f2b) fix the problem that
if the ack from target is lost for some reason, the initiators HCA
will resend it even after the request is completed.
But doesn't the same problem persist also other way around: for the
lost acks from client? I mean, target is did a send for the "read"
IOs; client completed the request (after invalidation, refcount
dropped to 0, etc), but the ack is not delivered to the HCA of the
target, so the target will also resend it. This seems unfixable, since
the client can't possible know if the server received his ack or not?
Doesn't the problem go away, if rdma_conn_param.retry_count is just set to 0?

Thanks for your help,
Best,
Danil.

On Tue, Jul 9, 2019 at 11:27 PM Sagi Grimberg  wrote:
>
>
> >> Thanks Jason for feedback.
> >> Can you be  more specific about  "the invalidation model for MR was wrong"
> >
> > MR's must be invalidated before data is handed over to the block
> > layer. It can't leave MRs open for access and then touch the memory
> > the MR covers.
>
> Jason is referring to these fixes:
> 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing
> a request")
> 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work
> request has completed")
> b4b591c87f2b ("nvme-rdma: don't suppress send completions")


Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-12 Thread Danil Kipnis
On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg  wrote:
>
>
> >> My main issues which were raised before are:
> >> - IMO there isn't any justification to this ibtrs layering separation
> >> given that the only user of this is your ibnbd. Unless you are
> >> trying to submit another consumer, you should avoid adding another
> >> subsystem that is not really general purpose.
> > We designed ibtrs not only with the IBNBD in mind but also as the
> > transport layer for a distributed SDS. We'd like to be able to do what
> > ceph is capable of (automatic up/down scaling of the storage cluster,
> > automatic recovery) but using in-kernel rdma-based IO transport
> > drivers, thin-provisioned volume managers, etc. to keep the highest
> > possible performance.
>
> Sounds lovely, but still very much bound to your ibnbd. And that part
> is not included in the patch set, so I still don't see why should this
> be considered as a "generic" transport subsystem (it clearly isn't).
Having IBTRS sit on a storage enables that storage to communicate with
other storages (forward requests, request read from other storages
i.e. for sync traffic). IBTRS is generic in the sense that it removes
the strict separation into initiator (converting BIOs into some
hardware specific protocol messages) and target (which forwards those
messages to some local device supporting that protocol).
It appears less generic to me to talk SCSI or NVME between storages if
some storages have SCSI, other NVME disks or LVM volumes, or mixed
setup. IBTRS allows to just send or request read of an sg-list between
machines over rdma - the very minimum required to transport a BIO.
It would in-deed support our case with the library if we would propose
at least two users of it. We now only have a very early stage
prototype capable of organizing storages in pools, multiplexing io
between different storages, etc. sitting on top of ibtrs, it's not
functional yet. On the other hand ibnbd with ibtrs alone already make
over 1 lines.

> > All in all itbrs is a library to establish a "fat", multipath,
> > autoreconnectable connection between two hosts on top of rdma,
> > optimized for transport of IO traffic.
>
> That is also dictating a wire-protocol which makes it useless to pretty
> much any other consumer. Personally, I don't see how this library
> would ever be used outside of your ibnbd.
Its true, IBTRS also imposes a protocol for connection establishment
and IO path. I think at least the IO part we did reduce to a bare
minimum:
350 * Write *
351
352 1. When processing a write request client selects one of the memory chunks
353 on the server side and rdma writes there the user data, user header and the
354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
355 contains size of the user header. The client tells the server
which chunk has
356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by
357 using the IMM field.
358
359 2. When confirming a write request server sends an "empty" rdma message with
360 an immediate field. The 32 bit field is used to specify the outstanding
361 inflight IO and for the error code.
362
363 CLT  SRV
364 usr_data + usr_hdr + ibtrs_msg_rdma_write ->
[IBTRS_IO_REQ_IMM]
365 [IBTRS_IO_RSP_IMM]<- (id + errno)
366
367 * Read *
368
369 1. When processing a read request client selects one of the memory chunks
370 on the server side and rdma writes there the user header and the
371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of
372 the user header, flags (specifying if memory invalidation is
necessary) and the
373 list of addresses along with keys for the data to be read into.
374
375 2. When confirming a read request server transfers the requested data first,
376 attaches an invalidation message if requested and finally an "empty" rdma
377 message with an immediate field. The 32 bit field is used to specify the
378 outstanding inflight IO and the error code.
379
380 CLT   SRV
381 usr_hdr + ibtrs_msg_rdma_read --> [IBTRS_IO_REQ_IMM]
382 [IBTRS_IO_RSP_IMM]<-- usr_data + (id + errno)
383 or in case client requested invalidation:
384 [IBTRS_IO_RSP_IMM_W_INV]  <-- usr_data + (INV) +
(id + errno)

> >> - ibtrs in general is using almost no infrastructure from the existing
> >> kernel subsystems. Examples are:
> >> - tag allocation mechanism (which I'm not clear why its needed)
> > As you correctly noticed our client manages the buffers allocated and
> > registered by the server on the connection establishment. Our tags are
> > just a mechanism to take and release those buffers for incoming
> > requests on client side. Since the buffers allocated by the server are
> > to be shared between all the devices mapped from that server and all
> > their HW queue

Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-11 Thread Danil Kipnis
Hi Sagi,

thanks a lot for the detailed reply. Answers inline below:

On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg  wrote:
>
> Hi Danil and Jack,
>
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community
>
> That is not exactly correct AFAIR,
>
> My main issues which were raised before are:
> - IMO there isn't any justification to this ibtrs layering separation
>given that the only user of this is your ibnbd. Unless you are
>trying to submit another consumer, you should avoid adding another
>subsystem that is not really general purpose.
We designed ibtrs not only with the IBNBD in mind but also as the
transport layer for a distributed SDS. We'd like to be able to do what
ceph is capable of (automatic up/down scaling of the storage cluster,
automatic recovery) but using in-kernel rdma-based IO transport
drivers, thin-provisioned volume managers, etc. to keep the highest
possible performance. That modest plan of ours should among others
cover for the following:
When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned
volumes) from server to client and building an (md-)raid on top of the
imported devices on client side in order to provide for redundancy
across different machines, one gets very decent throughput and low
latency, since the IOs are sent in parallel to the storage machines.
One downside of this setup, is that the resync traffic has to flow
over the client, where the md-raid is sitting. Ideally the resync
traffic should flow directly between the two "legs" (storage machines)
of the raid. The server side of such a "distributed raid" capable of
this direct syncing between the array members would necessarily
require to have some logic on server side and hence could also sit on
top of ibtrs. (To continue the analogy, the "server" side of an
md-raid build on top of say two NVMEoF devices are just two block
devices, which couldn't communicate with each other)
All in all itbrs is a library to establish a "fat", multipath,
autoreconnectable connection between two hosts on top of rdma,
optimized for transport of IO traffic.

> - ibtrs in general is using almost no infrastructure from the existing
>kernel subsystems. Examples are:
>- tag allocation mechanism (which I'm not clear why its needed)
As you correctly noticed our client manages the buffers allocated and
registered by the server on the connection establishment. Our tags are
just a mechanism to take and release those buffers for incoming
requests on client side. Since the buffers allocated by the server are
to be shared between all the devices mapped from that server and all
their HW queues (each having num_cpus of them) the mechanism behind
get_tag/put_tag also takes care of the fairness.

>- rdma rw abstraction similar to what we have in the core
On the one hand we have only single IO related function:
ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
with imm, or requests an rdma write with imm to be executed by the
server. On the other hand we provide an abstraction to establish and
manage what we call "session", which consist of multiple paths (to do
failover and multipath with different policies), where each path
consists of num_cpu rdma connections. Once you established a session
you can add or remove paths from it on the fly. In case the connection
to server is lost, the client does periodic attempts to reconnect
automatically. On the server side you get just sg-lists with a
direction READ or WRITE as requested by the client. We designed this
interface not only as the minimum required to build a block device on
top of rdma but also with a distributed raid in mind.

>- list_next_or_null_rr_rcu ??
We use that for multipath. The macro (and more importantly the way we
use it) has been reviewed by Linus and quit closely by Paul E.
McKenney. AFAIR the conclusion was that Romans implementation is
correct, but too tricky to use correctly in order to be included into
kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659

>- few other examples sprinkled around..
To my best knowledge we addressed everything we got comments on and
will definitely do so in the future.

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?
At the very beginning of the project we did some measurements and saw,
that it is faster. I'm not sure if this is still true, since the
hardware and the drivers and rdma subsystem did change in that time.
Also it seemed to make the code simpler.

> Also, given that the server pre-allocate a substantial amou

Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-11 Thread Danil Kipnis
Hi Sagi,

thanks a lot for the analysis. I didn't know about about the
inline_data_size parameter in nvmet. It is at PAGE_SIZE on our
systems.
Will rerun our benchmarks with
echo 2097152 > /sys/kernel/config/nvmet/ports/1/param_inline_data_size
echo 2097152 > /sys/kernel/config/nvmet/ports/2/param_inline_data_size
before enabling the port.
Best
Danil.

On Wed, Jul 10, 2019 at 9:11 PM Sagi Grimberg  wrote:
>
>
> >> I still do not understand why this should give any notice-able
> >> performance advantage.
> >
> > Usually omitting invalidations gives a healthy bump.
> >
> > Also, RDMA WRITE is generally faster than READ at the HW level in
> > various ways.
>
> Yes, but this should be essentially identical to running nvme-rdma
> with 512KB of immediate-data (the nvme term is in-capsule data).
>
> In the upstream nvme target we have inline_data_size port attribute
> that is tunable for that (defaults to PAGE_SIZE).



-- 
Danil Kipnis
Linux Kernel Developer

1&1 IONOS Cloud GmbH | Greifswalder Str. 207 | 10405 Berlin | Germany
E-mail: danil.kip...@cloud.ionos.com | Web: www.ionos.de


Head Office: Berlin, Germany
District Court Berlin Charlottenburg, Registration number: HRB 125506 B
Executive Management: Christoph Steffens, Matthias Steinberg, Achim Weiss

Member of United Internet

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this
e-mail in any way is prohibited. If you have received this e-mail in
error, please notify the sender and delete the e-mail.


Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-10 Thread Danil Kipnis
Hi Leon,

thanks for the feedback!

On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky  wrote:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.

I have only theories of why ibnbd is showing better numbers than nvmeof:
1. The way we utilize the MQ framework in IBNBD. We promise to have
queue_depth (say 512) requests on each of the num_cpus hardware queues
of each device, but in fact we have only queue_depth for the whole
"session" toward a given server. The moment we have queue_depth
inflights we need stop the queue (on a device on a cpu) we get more
requests on. We need to start them again after some requests are
completed. We maintain per cpu lists of stopped HW queues, a bitmap
showing which lists are not empty, etc. to wake them up in a
round-robin fashion to avoid starvation of any devices.
2. We only do rdma writes with imm. A server reserves queue_depth of
max_io_size buffers for a given client. The client manages those
himself. Client uses imm field to tell to the server which buffer has
been written (and where) and server uses the imm field to send back
errno. If our max_io_size is 64K and queue_depth 512 and client only
issues 4K IOs all the time, then 60*512K memory is wasted. On the
other hand we do no buffer allocation/registration in io path on
server side. Server sends rdma addresses and keys to those
preregistered buffers on connection establishment and
deallocates/unregisters them when a session is closed. That's for
writes. For reads, client registers user buffers (after fr) and sends
the addresses and keys to the server (with an rdma write with imm).
Server rdma writes into those buffers. Client does the
unregistering/invalidation and completes the request.

> Can you please share any numbers with us?
Apart from github
(https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3)
the performance results for v5.2-rc3 on two different systems can be
accessed under dcd.ionos.com/ibnbd-performance-report. The page allows
to filter out test scenarios interesting for comparison.

>
> Thanks


Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2019-07-09 Thread Danil Kipnis
Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,

Could you please provide some feedback to the IBNBD driver and the
IBTRS library?
So far we addressed all the requests provided by the community and
continue to maintain our code up-to-date with the upstream kernel
while having an extra compatibility layer for older kernels in our
out-of-tree repository.
I understand that SRP and NVMEoF which are in the kernel already do
provide equivalent functionality for the majority of the use cases.
IBNBD on the other hand is showing higher performance and more
importantly includes the IBTRS - a general purpose library to
establish connections and transport BIO-like read/write sg-lists over
RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
I believe IBNBD does meet the kernel coding standards, it doesn't have
a lot of users, while SRP and NVMEoF are widely accepted. Do you think
it would make sense for us to rework our patchset and try pushing it
for staging tree first, so that we can proof IBNBD is well maintained,
beneficial for the eco-system, find a proper location for it within
block/rdma subsystems? This would make it easier for people to try it
out and would also be a huge step for us in terms of maintenance
effort.
The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
near future). Do you think it would make sense to rename the driver to
RNBD/RTRS?

Thank you,
Best Regards,
Danil

On Thu, Jun 20, 2019 at 5:03 PM Jack Wang  wrote:
>
> Hi all,
>
> Here is v4 of IBNBD/IBTRS patches, which have minor changes
>
>  Changelog
>  -
> v4:
>   o Protocol extended to transport IO priorities
>   o Support for Mellanox ConnectX-4/X-5
>   o Minor sysfs extentions (display access mode on server side)
>   o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
>   o Style fixes
>
> v3:
>   o Sparse fixes:
>  - le32 -> le16 conversion
>  - pcpu and RCU wrong declaration
>  - sysfs: dynamically alloc array of sockaddr structures to reduce
>size of a stack frame
>
>   o Rename sysfs folder on client and server sides to show source and
> destination addresses of the connection, i.e.:
>...//paths//
>
>   o Remove external inclusions from Makefiles.
>   * https://lwn.net/Articles/756994/
>
> v2:
>   o IBNBD:
>  - No legacy request IO mode, only MQ is left.
>
>   o IBTRS:
>  - No FMR registration, only FR is left.
>
>   * https://lwn.net/Articles/755075/
>
> v1:
>   - IBTRS: load-balancing and IO fail-over using multipath features were 
> added.
>
>   - Major parts of the code were rewritten, simplified and overall code
> size was reduced by a quarter.
>
>   * https://lwn.net/Articles/746342/
>
> v0:
>   - Initial submission
>
>   * https://lwn.net/Articles/718181/
>
>
>  Introduction
>  -
>
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
> and iWARP, but we mainly tested in IB environment. It is optimized to
> transfer (read/write) IO blocks in the sense that it follows the BIO
> semantics of providing the possibility to either write data from a
> scatter-gather list to the remote side or to request ("read") data
> transfer from the remote side into a given set of buffers.
>
> IBTRS is multipath capable and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> It can be used for other components not bind to IBNBD.
>
>
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
>
>
>- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>  thus internal protocol is simple.
>- IBTRS was developed as an independent RDMA transport library, which
>  supports fail-over and load-balancing policies using multipath, thus
>  it can be used for any other IO needs rather than only for block
>  device.
>- IBNBD/IBTRS is fast.
>  Old comparison results:
>  https://www.spinics.net/lists/linux-rdma/msg48799.html
>  New comparison results: see performance measurements section below.
>
> Key features of IBTRS transport library and IBNBD block device:
>
> o High throughput and low latency due to:
>- Only two RDMA messages per IO.
>- IMM InfiniBand messages on responses to reduce round trip latency.
>- Simplified memory management: memory allocation happens once on
>  server side when IBTRS session is esta

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-06-04 Thread Danil Kipnis
Hi Doug,

thanks for the feedback. You read the cover letter correctly: our
transport library implements multipath (load balancing and failover)
on top of RDMA API. Its name "IBTRS" is slightly misleading in that
regard: it can sit on top of ROCE as well. The library allows for
"bundling" multiple rdma "paths" (source addr - destination addr pair)
into one "session". So our session consists of one or more paths and
each path under the hood consists of as many QPs (each connecting
source with destination) as there are CPUs on the client system. The
user load (In our case IBNBD is a block device and generates some
block requests) is load-balanced on per cpu-basis.
I understand, this is something very different to what smc-r is doing.
Am I right? Do you know what stage MP-RDMA development currently is?

Best,

Danil Kipnis.

P.S. Sorry for the duplicate if any, first mail was returned cause of html.

On Thu, Feb 8, 2018 at 7:10 PM Bart Van Assche  wrote:
>
> On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote:
> > thanks for the link to the article. To the best of my understanding,
> > the guys suggest to authenticate the devices first and only then
> > authenticate the users who use the devices in order to get access to a
> > corporate service. They also mention in the presentation the current
> > trend of moving corporate services into the cloud. But I think this is
> > not about the devices from which that cloud is build of. Isn't a cloud
> > first build out of devices connected via IB and then users (and their
> > devices) are provided access to the services of that cloud as a whole?
> > If a malicious user already plugged his device into an IB switch of a
> > cloud internal infrastructure, isn't it game over anyway? Can't he
> > just take the hard drives instead of mapping them?
>
> Hello Danil,
>
> It seems like we each have been focussing on different aspects of the article.
> The reason I referred to that article is because I read the following in
> that article: "Unlike the conventional perimeter security model, BeyondCorp
> doesn’t gate access to services and tools based on a user’s physical location
> or the originating network [ ... ] The zero trust architecture spells trouble
> for traditional attacks that rely on penetrating a tough perimeter to waltz
> freely within an open internal network." Suppose e.g. that an organization
> decides to use RoCE or iWARP for connectivity between block storage initiator
> systems and block storage target systems and that it has a single company-
> wide Ethernet network. If the target system does not restrict access based
> on initiator IP address then any penetrator would be able to access all the
> block devices exported by the target after a SoftRoCE or SoftiWARP initiator
> driver has been loaded. If the target system however restricts access based
> on the initiator IP address then that would make it harder for a penetrator
> to access the exported block storage devices. Instead of just penetrating the
> network access, IP address spoofing would have to be used or access would
> have to be obtained to a system that has been granted access to the target
> system.
>
> Thanks,
>
> Bart.
>
>


-- 
Danil Kipnis
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-08 Thread Danil Kipnis
On Wed, Feb 7, 2018 at 6:32 PM, Bart Van Assche  wrote:
> On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote:
>> So the question is: are there real life setups where
>> some of the local IB network members can be untrusted?
>
> Hello Roman,
>
> You may want to read more about the latest evolutions with regard to network
> security. An article that I can recommend is the following: "Google reveals
> own security regime policy trusts no network, anywhere, ever"
> (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/).
>
> If data-centers would start deploying RDMA among their entire data centers
> (maybe they are already doing this) then I think they will want to restrict
> access to block devices to only those initiator systems that need it.
>
> Thanks,
>
> Bart.
>
>

Hi Bart,

thanks for the link to the article. To the best of my understanding,
the guys suggest to authenticate the devices first and only then
authenticate the users who use the devices in order to get access to a
corporate service. They also mention in the presentation the current
trend of moving corporate services into the cloud. But I think this is
not about the devices from which that cloud is build of. Isn't a cloud
first build out of devices connected via IB and then users (and their
devices) are provided access to the services of that cloud as a whole?
If a malicious user already plugged his device into an IB switch of a
cloud internal infrastructure, isn't it game over anyway? Can't he
just take the hard drives instead of mapping them?

Thanks,

Danil.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche  wrote:
> On 02/05/18 08:40, Danil Kipnis wrote:
>>
>> It just occurred to me, that we could easily extend the interface in
>> such a way that each client (i.e. each session) would have on server
>> side her own directory with the devices it can access. I.e. instead of
>> just "dev_search_path" per server, any client would be able to only
>> access devices under /session_name. (session name
>> must already be generated by each client in a unique way). This way
>> one could have an explicit control over which devices can be accessed
>> by which clients. Do you think that would do it?
>
>
> Hello Danil,
>
> That sounds interesting to me. However, I think that approach requires to
> configure client access completely before the kernel target side module is
> loaded. It does not allow to configure permissions dynamically after the
> kernel target module has been loaded. Additionally, I don't see how to
> support attributes per (initiator, block device) pair with that approach.
> LIO e.g. supports the
> /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You
> may want to implement similar functionality if you want to convince more
> users to use IBNBD.
>
> Thanks,
>
> Bart.

Hello Bart,

the configuration (which devices can be accessed by a particular
client) can happen also after the kernel target module is loaded. The
directory in  is a module parameter and is fixed. It
contains for example "/ibnbd_devices/". But a particular client X
would be able to only access the devices located in the subdirectory
"/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
add or remove the devices from that directory (those are just symlinks
to /dev/xxx) at any time - before or after the server module is
loaded. But you are right, we need something additional in order to be
able to specify which devices a client can access writable and which
readonly. May be another subdirectories "wr" and "ro" for each client:
those under /ibnbd_devices/client_x/ro/ can only be read by client_x
and those in /ibnbd_devices/client_x/wr/ can also be written to?

Thanks,

Danil.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg  wrote:
>
 Hi Bart,

 My another 2 cents:)
 On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
 wrote:
>
>
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>
>>
>> o Simple configuration of IBNBD:
>>  - Server side is completely passive: volumes do not need to be
>>explicitly exported.
>
>
>
> That sounds like a security hole? I think the ability to configure
> whether or
> not an initiator is allowed to log in is essential and also which
> volumes
> an
> initiator has access to.


 Our design target for well controlled production environment, so
 security is handle in other layer.
>>>
>>>
>>>
>>> What will happen to a new adopter of the code you are contributing?
>>
>>
>> Hi Sagi, Hi Bart,
>> thanks for your feedback.
>> We considered the "storage cluster" setup, where each ibnbd client has
>> access to each ibnbd server. Each ibnbd server manages devices under
>> his "dev_search_path" and can provide access to them to any ibnbd
>> client in the network.
>
>
> I don't understand how that helps?
>
>> On top of that Ibnbd server has an additional
>> "artificial" restriction, that a device can be mapped in writable-mode
>> by only one client at once.
>
>
> I think one would still need the option to disallow readable export as
> well.

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under /session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
>
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
>> wrote:
>>>
>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:

 o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.
>>>
>>>
>>> That sounds like a security hole? I think the ability to configure
>>> whether or
>>> not an initiator is allowed to log in is essential and also which volumes
>>> an
>>> initiator has access to.
>>
>> Our design target for well controlled production environment, so
>> security is handle in other layer.
>
>
> What will happen to a new adopter of the code you are contributing?

Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network. On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.

-- 
Danil