Hi Oleg -

Eric will probably fix this along the lines you suggested. However, I would like to hear why there are no race conditions in this protocol. How can we be certain that a canceled lock isn't used by the client? Also, is there a point to having these locks at all?

thanks.

- Peter -

Eric Barton wrote:
<green> eeb: it seems a major assumption about possibility of
        reply-less comunication between client and server is broken by
        credits
<eeb>   credits are designed to limit the # of LND buffers we have to
        post but require a responsive peer
<green> right. so if we have enough of credits - everything should
        work
        I mean - we can send a lot of rpcs from servers without
        wedging anything - and we already post a lot of buffers at
        liblustre side without actually having a chance to use them
<eeb>   green: increasing credits lets you have more outstanding
        communications at the expense of memory
<green> right. With liblustre we have tons of credits posted already
        and no real chance to use them except for mds that can send
        AST cancels (this usage is already broken if more than
        peercredits needs to be posted). also there are connect
        replies and pings that are async and they depend on a number
        of credits tha is a direct function of number of targets per
        server
        we cannot really wait for pings to arrive, btw, what if it
        takes severalseconds for ping to be processed? what if it is
        80 seconds? That's colossal waste of time
<eeb>   What I meant by my comment is that I don't think "tinkering"
        with tunables is going to solve this issue - we have to design
        the solution for liblustre failover - until then, we know that
        ensuring the liblustre client is responsive to the network at
        all times when communications may be directed at them (the
        current working assumption) is a correct solution
<green> eeb: except we have a feature that directly depends on ability
        to send traffic to unresponsive liblustre client and for it to
        succeed
        eeb: And, for all cases but those ASTs from MDS we can
        directly predict how much packets we can expect servers to
        send to liblustre client at any given time if liblustre is
        unresponsive
<eeb>   which feature?  pinger replies?  yes you're right - that
        violates the assumption that liblustre always waits for RPC
        replies
<green> feature is "instant lock cancels" from MDS.
        ther we send ASTs to clients and not wait for them, clients
        will process tose ASTs when they wake up.
<eeb>   looks like we have a communications issue between people - it
        has always been an article of faith as far as I have been
        aware that liblustre receives only solicited communications
        and doesn't return until it has received them
<green> no. This instant cancel feature was developed with dependence
        on unlimited amount of communication to be sent to liblustre
        and received there in mind, even if liblustre is not listening
        i think we even discussed that with you atthe time and you
        sait it's possible. I think credits did not exist back then
<eeb>   really?
<green> it was around 1.4.5 timeframe or even earlier. Did we always
        have credits?
<eeb>   the ptllnd has always had a credit flow scheme
        also, "large" GET/PUT (> 0.5K) has always relied on a
        responsive liblustre
<green> eeb: Ah. Well, we are not speaking of large RPCs here anyway.
        eeb: So, if we find this function of how many rpcs might be
        send from server to unresponsive clients (pings & async
        connects, not taking ASTs into account now) - we can tune
        peercredits settings accordingly based on number of targets
<eeb>   now you're putting your finger on our fundamental disagreement
        - I think we need a "designed" solution not a "we can get by
        because we happen not to have exceeded this or that limit"
<green> This should mitigate 10706 and other similar issues I think?
        At expense of higher memory consumption on liblustre nodes
<eeb>   Peter (Braam) and I have already talked a bit about doing what
        you want - it's really required if liblustre is to be suitable
        for asynchronous I/O - but it's really a _big_deal_ to
        implement
        (where "what you want" == "arbitrary outstanding
        communications with an unresponsive liblustre")
<green> eeb: async i/o? I believe you!!
        eeb: As far as a design goes. Do you not thing that
        cxollecting all usecases of sending RPCs for unresponsive
        client from servers and ensuring we always have enough credits
        constitutes a design too?
<eeb>   I can't totally agree I'm afraid - assuming by collecting use
        cases you mean enumerating _all_ allowable use cases,
        sufficient credits is still only 1 of the requirements for
        operation with an unresponsive liblustre client
<green> eeb: what are the others?
<eeb>   maximum PUT payload to an unresponsive client
        are GETs to an unresponsive clients required
<green> well, maximum put payload could also be calculated. We can
        forbid GETs to unresponsive clients at all (they do not happen
        anyway or we would have noticed)
        eeb: so what do you think?
        eeb: btw, is there a way to get peercredits from inside
        liblustre?
<eeb>   I'm very reluctant to overturn the "nothing tried to talk to
        liblustre clients unless liblustre has control" assumption
        until we're _sure_ we understand all the consequences
        s/nothing tried/nothing may try"
<green> Êbut such a talk happening now is a matter of fact
        and can't be easily fixed
<eeb>   similarly, I'm not nearly _sure_ I understand all the
        consequences of that
<green> hm, I thought it all was simple - for PUT's, ifthere is a
        buffer posted - it's succesful (credits aside), if no buffers
        - messae is lost. gets would timeout ;)
<eeb>   do you have a formula to compute the number of buffers we need
        to post to prevent messages being lost?  How can it be OK for
        arbitrary messages to be lost?  If not, how do we discriminate
        between message we can lose and those we can't?
<green> Formula is simple - we need to post at least as mucvh buffers
        as there might be messages. Losing messages is not ok, of
        course, but at least lnet server side would be aware of that
        immediatelly, I presume? And so no 50 seconds
        timeout. Generally we know that maximal number of messages to
        be sent to liblustre is not greater than there are targets on
        the server, plus whatever more messages lnet might be sending.
        the only exception is MDS ASTs - if we can know number of
        peercredits form liblustre - we can set lru_size to that
        number so that MDS won't try to send more cancels than we can
        tolerate
        (I count possible RPCs to unresponsive client by taking the
        fact currently there mght be only one async rpc from client to
        every target - this is not taking into account MDS ASTs,
        again. But on MDS we only have one targetanyway)
<eeb>   server-side LNET will only become aware of dropped messages if
        we start using ACKed Cray Portals PUTs
<green> oh.
<eeb>   But actually I don't think we can design a fix right here on
        IRC - neither of us are fully informed about all the issues
        AFAICS
<green> well, if messages were dropped - we can make lustre to cause
        reconnect. No messages should b e dropped of course ;)
<eeb>   probably we have most of the info between us, but I don't
        think we can simply jump at what look like promising fixes
<green> eeb: Due to my unavareness of all the underlying stuff, fix
        actually look pretty simple to me - as having enough credits
        available ;)
<eeb>   yes - I realise
<green> what we know fo sure if peercredits is less than number of
        targets on a server - credits WOULD BE exausted
<eeb>   I'm really sorry to appear so uncooperative, but I need to
        take this a bit more slowly and devote 100% of my mind to it
        rather than replying "off the cuff" while I'm really busy on
        other things
<green> eeb: Sure, take your time. I am just trying to make sure you
        understand my idea of things.
        So is there a way to know peercredits number from liblustre to
        adjust lru_size accordingly?
<eeb>   what is "lru_size"?
        BTW, to help me think about this, can you tell me all the
        places servers can talk to clients when clients are not
        blocking in liblustre?
<green> this is amount of unused locks client is allowed to cache
        unused locks are locks we can get ASTs for.
        so the only places that can send to clients as I see it right
        now - async replies - currently knows are connect and ping
        replies, all other places wait for replies. And ASTs - can
        happen on MDS only, we can regulate this number with amount of
        locks we hold per client
        thankfully, we do not have any real async i/o ;)
<eeb>   what do you think we gain by running the pinger on liblustre
        clients?
<green> currently failover depends on this, I believe. I need to
        refresh my memorty on exact details. I remember that without
        pinger failover does not know what targets to chose
<eeb>   does that mean we could disable the pinger until we implement
        liblustre failover correctly (this includes ptllnd fixes)
<green> no. Cray actually uses failover on some of their
        installations. It is not proper failover, but rather an
        ability to chose between several targets for same service if
        main target is down for some reason
<eeb>   I would like braam to be involved in this discussion
<green> I have no objections
        but this is probably too lowlevel for him
        eeb: so are we done with this topic or is there anything else
        I can do for you with this issue?
<eeb>   I'll come and hassle you when I have some time...
<green> ok. I just would like to highlight that this bug is regarded
        as high priority by cray. they expect another system with high
        number of targets per oss to be entered into production soon
        and there they won't be able to use current workaround of
        disabling pinger
<eeb>   green: yes - I'm aware of that - it will get my full attention
        RSN
<green> great. thanks

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: [EMAIL PROTECTED]|
---------------------------------------------------



_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to