On 2014/09/02, 7:05 AM, "Alexey Lyashkov" <[email protected]> wrote:
>we don’t need too much sends to single peer, except of LNet routers. >as about other limits >Number of RPC in flight == 1 for MDC<>MDT links, Just a minor correction - while there is currently a limit of 1 modifying RPC in flight for MDC-MDT, there may be up to 8 non-modifying RPCs in flight (readdir, stat, statfs) to the MDT. Also, there is work underway to allow multiple modifying RPCs in flight to the MDT (LU-5319 proposed for discussion at LAD). >and isn’t more 32 for OST, but we have limited to the 512 OST_IO threads. > >about credits - number of credits used in LNet calculation - should >depend of buffers posted to incoming process and that number of buffers >should depend of performance results - like number of RPC processed in >some time. >It’s avoid over buffering in all places, but it open a question about >credits distribution over cluster. There was some research done a few years ago by Yingjin Qian about having server-side control over RPCs in flight, instead of having a static tunable at the client. This showed some improvement in performance, especially at the transition when the workload is changing. However, no work was ever done to integrate this into a Lustre release. http://www.computer.org/csdl/proceedings/msst/2013/0217/00/06558432-abs.htm l http://storageconference.us/2013/Presentations/Yimo.pdf Cheers, Andreas >On Sep 2, 2014, at 4:40 PM, Zhen, Liang <[email protected]> wrote: > >> Precisely ³credit² should be concurrent sends (of ko2iblnd message) to a >> single peer, it is not number of inflight Lustre RPCs. I understand the >> memory issue of this, and by enabling map_on_demand, ko2iblnd will >>create >> FMR for large fragments bulk IO (for example, 32+ fragments or 128K+), >>and >> only allow small IOs to use current way and avoid overhead of creating >> FMR, then we have up to 32 fragments and QP size is only 1/8 of now. >> >> Regards >> Liang >> >> On 9/2/14, 6:09 PM, "Alexey Lyashkov" <[email protected]> >>wrote: >> >>> credits for Lustre ? it¹s works? now it¹s strange number without >>>relation >>> to real network structure and produce over buffering issues on server >>> side. >>> >>> On Sep 2, 2014, at 12:22 PM, Zhen, Liang <[email protected]> wrote: >>> >>>> Yes, I think this is the potential issue of this patch, for each 1M >>>> data lustre has 256 fragments (256 pages) on 4K pagesize system, which >>>> means we can have max to (credits X 256) outstanding work requests for >>>> each connection, decreasing max_send_wr may hit ib_post_send() failure >>>> under heavy workload. >>>> >>>> I understand this may be a problem for low level stack to allocate big >>>> chunk of space, and cause memory allocating failures. The solution is >>>> enabling map_on_demand and use FMR, however, enabling this on some >>>>nodes >>>> will prevent them to join cluster if other nodes have no >>>>map_on_demand, >>>> we already have a patch for this which is pending on review, please >>>> check this (LU-3322) >>>> >>>> Thanks >>>> Liang >>>> >>>> From: David McMillen <[email protected]<mailto:[email protected]>> >>>> Date: Sunday, August 31, 2014 at 6:48 PM >>>> To: >>>> >>>>"[email protected]<mailto:[email protected] >>>>>" >>>> >>>> >>>><[email protected]<mailto:[email protected] >>>>>> >>>> , Eli Cohen <[email protected]<mailto:[email protected]>> >>>> Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on >>>>temporary >>>> failure >>>> >>>> Has this been tested with a significant I/O load? We had tried a >>>> similar approach but ran into subsequent errors and connection drops >>>> when the ib_post_send() failed. The code assumes that the original >>>> init_qp_attr->cap.max_send_wr value succeeded. Is there a second part >>>> to this patch? >>>> >>>> Dave >>>> >>>> On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen >>>> <[email protected]<mailto:[email protected]>> wrote: >>>> >>>>> Lustre code tries to create a QP with max_send_wr which depends on a >>>>> module >>>>> parameter. The device capabilities do provide the maximum number of >>>>> send work >>>>> requests that the device supports but the actual number of work >>>>> requests that >>>>> can be supported in a specific case depends on other characteristics >>>>> of the >>>>> work queue, the transport type, etc. This is in compliance with the >>>>>IB >>>>> spec: >>>>> >>>>> 11.2.1.2 QUERY HCA >>>>> Description: >>>>> Returns the attributes for the specified HCA. >>>>> The maximum values defined in this section are guaranteed >>>>> not-to-exceed values. It is possible for an implementation to >>>>>allocate >>>>> some HCA resources from the same space. In that case, the maximum >>>>> values returned are not guaranteed for all of those resources >>>>> simultaneously. >>>>> >>>>> This patch tries to decrease the number of requested work requests to >>>>> a level >>>>> that can be supported by the HCA. This prevents unnecessary failures. >>>>> >>>>> Signed-off-by: Eli Cohen <eli at mellanox.com> >>>>> --- >>>>> lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++------- >>>>> 1 file changed, 18 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c >>>>> b/lnet/klnds/o2iblnd/o2iblnd.c >>>>> index 4061db00cba2..ef1c6e07cb45 100644 >>>>> --- a/lnet/klnds/o2iblnd/o2iblnd.c >>>>> +++ b/lnet/klnds/o2iblnd/o2iblnd.c >>>>> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct >>>>> rdma_cm_id *cmid, >>>>> int cpt; >>>>> int rc; >>>>> int i; >>>>> + int orig_wr; >>>>> >>>>> LASSERT(net != NULL); >>>>> LASSERT(!in_interrupt()); >>>>> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct >>>>> rdma_cm_id *cmid, >>>>> >>>>> conn->ibc_sched = sched; >>>>> >>>>> - rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, >>>>> init_qp_attr); >>>>> - if (rc != 0) { >>>>> - CERROR("Can't create QP: %d, send_wr: %d, recv_wr: >>>>> %d\n", >>>>> - rc, init_qp_attr->cap.max_send_wr, >>>>> - init_qp_attr->cap.max_recv_wr); >>>>> - goto failed_2; >>>>> - } >>>>> + orig_wr = init_qp_attr->cap.max_send_wr; >>>>> + do { >>>>> + rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, >>>>> init_qp_attr); >>>>> + if (!rc || init_qp_attr->cap.max_send_wr < 16) >>>>> + break; >>>>> + >>>>> + init_qp_attr->cap.max_send_wr /= 2; >>>>> + } while (rc); >>>>> + if (rc != 0) { >>>>> + CERROR("Can't create QP: %d, send_wr: %d, recv_wr: >>>>>%d\n", >>>>> + rc, init_qp_attr->cap.max_send_wr, >>>>> + init_qp_attr->cap.max_recv_wr); >>>>> + goto failed_2; >>>>> + } >>>>> + if (orig_wr != init_qp_attr->cap.max_send_wr) >>>>> + pr_info("original send wr %d, created with %d\n", >>>>> + orig_wr, init_qp_attr->cap.max_send_wr); >>>>> >>>>> LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr)); >>>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> > >_______________________________________________ >Lustre-discuss mailing list >[email protected] >http://lists.lustre.org/mailman/listinfo/lustre-discuss > Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
