Hi Oleg -
Eric will probably fix this along the lines you suggested.
However, I would like to hear why there are no race conditions in this
protocol. How can we be certain that a canceled lock isn't used by the
client? Also, is there a point to having these locks at all?
thanks.
- Peter -
Eric Barton wrote:
<green> eeb: it seems a major assumption about possibility of
reply-less comunication between client and server is broken by
credits
<eeb> credits are designed to limit the # of LND buffers we have to
post but require a responsive peer
<green> right. so if we have enough of credits - everything should
work
I mean - we can send a lot of rpcs from servers without
wedging anything - and we already post a lot of buffers at
liblustre side without actually having a chance to use them
<eeb> green: increasing credits lets you have more outstanding
communications at the expense of memory
<green> right. With liblustre we have tons of credits posted already
and no real chance to use them except for mds that can send
AST cancels (this usage is already broken if more than
peercredits needs to be posted). also there are connect
replies and pings that are async and they depend on a number
of credits tha is a direct function of number of targets per
server
we cannot really wait for pings to arrive, btw, what if it
takes severalseconds for ping to be processed? what if it is
80 seconds? That's colossal waste of time
<eeb> What I meant by my comment is that I don't think "tinkering"
with tunables is going to solve this issue - we have to design
the solution for liblustre failover - until then, we know that
ensuring the liblustre client is responsive to the network at
all times when communications may be directed at them (the
current working assumption) is a correct solution
<green> eeb: except we have a feature that directly depends on ability
to send traffic to unresponsive liblustre client and for it to
succeed
eeb: And, for all cases but those ASTs from MDS we can
directly predict how much packets we can expect servers to
send to liblustre client at any given time if liblustre is
unresponsive
<eeb> which feature? pinger replies? yes you're right - that
violates the assumption that liblustre always waits for RPC
replies
<green> feature is "instant lock cancels" from MDS.
ther we send ASTs to clients and not wait for them, clients
will process tose ASTs when they wake up.
<eeb> looks like we have a communications issue between people - it
has always been an article of faith as far as I have been
aware that liblustre receives only solicited communications
and doesn't return until it has received them
<green> no. This instant cancel feature was developed with dependence
on unlimited amount of communication to be sent to liblustre
and received there in mind, even if liblustre is not listening
i think we even discussed that with you atthe time and you
sait it's possible. I think credits did not exist back then
<eeb> really?
<green> it was around 1.4.5 timeframe or even earlier. Did we always
have credits?
<eeb> the ptllnd has always had a credit flow scheme
also, "large" GET/PUT (> 0.5K) has always relied on a
responsive liblustre
<green> eeb: Ah. Well, we are not speaking of large RPCs here anyway.
eeb: So, if we find this function of how many rpcs might be
send from server to unresponsive clients (pings & async
connects, not taking ASTs into account now) - we can tune
peercredits settings accordingly based on number of targets
<eeb> now you're putting your finger on our fundamental disagreement
- I think we need a "designed" solution not a "we can get by
because we happen not to have exceeded this or that limit"
<green> This should mitigate 10706 and other similar issues I think?
At expense of higher memory consumption on liblustre nodes
<eeb> Peter (Braam) and I have already talked a bit about doing what
you want - it's really required if liblustre is to be suitable
for asynchronous I/O - but it's really a _big_deal_ to
implement
(where "what you want" == "arbitrary outstanding
communications with an unresponsive liblustre")
<green> eeb: async i/o? I believe you!!
eeb: As far as a design goes. Do you not thing that
cxollecting all usecases of sending RPCs for unresponsive
client from servers and ensuring we always have enough credits
constitutes a design too?
<eeb> I can't totally agree I'm afraid - assuming by collecting use
cases you mean enumerating _all_ allowable use cases,
sufficient credits is still only 1 of the requirements for
operation with an unresponsive liblustre client
<green> eeb: what are the others?
<eeb> maximum PUT payload to an unresponsive client
are GETs to an unresponsive clients required
<green> well, maximum put payload could also be calculated. We can
forbid GETs to unresponsive clients at all (they do not happen
anyway or we would have noticed)
eeb: so what do you think?
eeb: btw, is there a way to get peercredits from inside
liblustre?
<eeb> I'm very reluctant to overturn the "nothing tried to talk to
liblustre clients unless liblustre has control" assumption
until we're _sure_ we understand all the consequences
s/nothing tried/nothing may try"
<green> Êbut such a talk happening now is a matter of fact
and can't be easily fixed
<eeb> similarly, I'm not nearly _sure_ I understand all the
consequences of that
<green> hm, I thought it all was simple - for PUT's, ifthere is a
buffer posted - it's succesful (credits aside), if no buffers
- messae is lost. gets would timeout ;)
<eeb> do you have a formula to compute the number of buffers we need
to post to prevent messages being lost? How can it be OK for
arbitrary messages to be lost? If not, how do we discriminate
between message we can lose and those we can't?
<green> Formula is simple - we need to post at least as mucvh buffers
as there might be messages. Losing messages is not ok, of
course, but at least lnet server side would be aware of that
immediatelly, I presume? And so no 50 seconds
timeout. Generally we know that maximal number of messages to
be sent to liblustre is not greater than there are targets on
the server, plus whatever more messages lnet might be sending.
the only exception is MDS ASTs - if we can know number of
peercredits form liblustre - we can set lru_size to that
number so that MDS won't try to send more cancels than we can
tolerate
(I count possible RPCs to unresponsive client by taking the
fact currently there mght be only one async rpc from client to
every target - this is not taking into account MDS ASTs,
again. But on MDS we only have one targetanyway)
<eeb> server-side LNET will only become aware of dropped messages if
we start using ACKed Cray Portals PUTs
<green> oh.
<eeb> But actually I don't think we can design a fix right here on
IRC - neither of us are fully informed about all the issues
AFAICS
<green> well, if messages were dropped - we can make lustre to cause
reconnect. No messages should b e dropped of course ;)
<eeb> probably we have most of the info between us, but I don't
think we can simply jump at what look like promising fixes
<green> eeb: Due to my unavareness of all the underlying stuff, fix
actually look pretty simple to me - as having enough credits
available ;)
<eeb> yes - I realise
<green> what we know fo sure if peercredits is less than number of
targets on a server - credits WOULD BE exausted
<eeb> I'm really sorry to appear so uncooperative, but I need to
take this a bit more slowly and devote 100% of my mind to it
rather than replying "off the cuff" while I'm really busy on
other things
<green> eeb: Sure, take your time. I am just trying to make sure you
understand my idea of things.
So is there a way to know peercredits number from liblustre to
adjust lru_size accordingly?
<eeb> what is "lru_size"?
BTW, to help me think about this, can you tell me all the
places servers can talk to clients when clients are not
blocking in liblustre?
<green> this is amount of unused locks client is allowed to cache
unused locks are locks we can get ASTs for.
so the only places that can send to clients as I see it right
now - async replies - currently knows are connect and ping
replies, all other places wait for replies. And ASTs - can
happen on MDS only, we can regulate this number with amount of
locks we hold per client
thankfully, we do not have any real async i/o ;)
<eeb> what do you think we gain by running the pinger on liblustre
clients?
<green> currently failover depends on this, I believe. I need to
refresh my memorty on exact details. I remember that without
pinger failover does not know what targets to chose
<eeb> does that mean we could disable the pinger until we implement
liblustre failover correctly (this includes ptllnd fixes)
<green> no. Cray actually uses failover on some of their
installations. It is not proper failover, but rather an
ability to chose between several targets for same service if
main target is down for some reason
<eeb> I would like braam to be involved in this discussion
<green> I have no objections
but this is probably too lowlevel for him
eeb: so are we done with this topic or is there anything else
I can do for you with this issue?
<eeb> I'll come and hassle you when I have some time...
<green> ok. I just would like to highlight that this bug is regarded
as high priority by cray. they expect another system with high
number of targets per oss to be entered into production soon
and there they won't be able to use current workaround of
disabling pinger
<eeb> green: yes - I'm aware of that - it will get my full attention
RSN
<green> great. thanks
Cheers,
Eric
---------------------------------------------------
|Eric Barton Barton Software |
|9 York Gardens Tel: +44 (117) 330 1575 |
|Clifton Mobile: +44 (7909) 680 356 |
|Bristol BS8 4LL Fax: call first |
|United Kingdom E-Mail: [EMAIL PROTECTED]|
---------------------------------------------------
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel