we don’t need too much sends to single peer, except of LNet routers. as about other limits Number of RPC in flight == 1 for MDC<>MDT links, and isn’t more 32 for OST, but we have limited to the 512 OST_IO threads.
about credits - number of credits used in LNet calculation - should depend of buffers posted to incoming process and that number of buffers should depend of performance results - like number of RPC processed in some time. It’s avoid over buffering in all places, but it open a question about credits distribution over cluster. On Sep 2, 2014, at 4:40 PM, Zhen, Liang <[email protected]> wrote: > Precisely ³credit² should be concurrent sends (of ko2iblnd message) to a > single peer, it is not number of inflight Lustre RPCs. I understand the > memory issue of this, and by enabling map_on_demand, ko2iblnd will create > FMR for large fragments bulk IO (for example, 32+ fragments or 128K+), and > only allow small IOs to use current way and avoid overhead of creating > FMR, then we have up to 32 fragments and QP size is only 1/8 of now. > > Regards > Liang > > On 9/2/14, 6:09 PM, "Alexey Lyashkov" <[email protected]> wrote: > >> credits for Lustre ? it¹s works? now it¹s strange number without relation >> to real network structure and produce over buffering issues on server >> side. >> >> On Sep 2, 2014, at 12:22 PM, Zhen, Liang <[email protected]> wrote: >> >>> Yes, I think this is the potential issue of this patch, for each 1M >>> data lustre has 256 fragments (256 pages) on 4K pagesize system, which >>> means we can have max to (credits X 256) outstanding work requests for >>> each connection, decreasing max_send_wr may hit ib_post_send() failure >>> under heavy workload. >>> >>> I understand this may be a problem for low level stack to allocate big >>> chunk of space, and cause memory allocating failures. The solution is >>> enabling map_on_demand and use FMR, however, enabling this on some nodes >>> will prevent them to join cluster if other nodes have no map_on_demand, >>> we already have a patch for this which is pending on review, please >>> check this (LU-3322) >>> >>> Thanks >>> Liang >>> >>> From: David McMillen <[email protected]<mailto:[email protected]>> >>> Date: Sunday, August 31, 2014 at 6:48 PM >>> To: >>> "[email protected]<mailto:[email protected]>" >>> >>> <[email protected]<mailto:[email protected]>> >>> , Eli Cohen <[email protected]<mailto:[email protected]>> >>> Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on temporary >>> failure >>> >>> Has this been tested with a significant I/O load? We had tried a >>> similar approach but ran into subsequent errors and connection drops >>> when the ib_post_send() failed. The code assumes that the original >>> init_qp_attr->cap.max_send_wr value succeeded. Is there a second part >>> to this patch? >>> >>> Dave >>> >>> On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen >>> <[email protected]<mailto:[email protected]>> wrote: >>> >>>> Lustre code tries to create a QP with max_send_wr which depends on a >>>> module >>>> parameter. The device capabilities do provide the maximum number of >>>> send work >>>> requests that the device supports but the actual number of work >>>> requests that >>>> can be supported in a specific case depends on other characteristics >>>> of the >>>> work queue, the transport type, etc. This is in compliance with the IB >>>> spec: >>>> >>>> 11.2.1.2 QUERY HCA >>>> Description: >>>> Returns the attributes for the specified HCA. >>>> The maximum values defined in this section are guaranteed >>>> not-to-exceed values. It is possible for an implementation to allocate >>>> some HCA resources from the same space. In that case, the maximum >>>> values returned are not guaranteed for all of those resources >>>> simultaneously. >>>> >>>> This patch tries to decrease the number of requested work requests to >>>> a level >>>> that can be supported by the HCA. This prevents unnecessary failures. >>>> >>>> Signed-off-by: Eli Cohen <eli at mellanox.com> >>>> --- >>>> lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++------- >>>> 1 file changed, 18 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c >>>> b/lnet/klnds/o2iblnd/o2iblnd.c >>>> index 4061db00cba2..ef1c6e07cb45 100644 >>>> --- a/lnet/klnds/o2iblnd/o2iblnd.c >>>> +++ b/lnet/klnds/o2iblnd/o2iblnd.c >>>> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct >>>> rdma_cm_id *cmid, >>>> int cpt; >>>> int rc; >>>> int i; >>>> + int orig_wr; >>>> >>>> LASSERT(net != NULL); >>>> LASSERT(!in_interrupt()); >>>> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct >>>> rdma_cm_id *cmid, >>>> >>>> conn->ibc_sched = sched; >>>> >>>> - rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, >>>> init_qp_attr); >>>> - if (rc != 0) { >>>> - CERROR("Can't create QP: %d, send_wr: %d, recv_wr: >>>> %d\n", >>>> - rc, init_qp_attr->cap.max_send_wr, >>>> - init_qp_attr->cap.max_recv_wr); >>>> - goto failed_2; >>>> - } >>>> + orig_wr = init_qp_attr->cap.max_send_wr; >>>> + do { >>>> + rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, >>>> init_qp_attr); >>>> + if (!rc || init_qp_attr->cap.max_send_wr < 16) >>>> + break; >>>> + >>>> + init_qp_attr->cap.max_send_wr /= 2; >>>> + } while (rc); >>>> + if (rc != 0) { >>>> + CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n", >>>> + rc, init_qp_attr->cap.max_send_wr, >>>> + init_qp_attr->cap.max_recv_wr); >>>> + goto failed_2; >>>> + } >>>> + if (orig_wr != init_qp_attr->cap.max_send_wr) >>>> + pr_info("original send wr %d, created with %d\n", >>>> + orig_wr, init_qp_attr->cap.max_send_wr); >>>> >>>> LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr)); >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
