On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote: > On 3/1/21 5:05 PM, Keith Busch wrote: > > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote: > > > On 3/1/21 2:26 PM, Daniel Wagner wrote: > > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote: > > > > > Crashing is bad, silent data corruption is worse. Is there truly no > > > > > defense against that? If not, why should anyone rely on this? > > > > > > > > If we receive an response for which we don't have a started request, we > > > > know that something is wrong. Couldn't we in just reset the connection > > > > in this case? We don't have to pretend nothing has happened and > > > > continuing normally. This would avoid a host crash and would not create > > > > (more) data corruption. Or I am just too naive? > > > > > > > This is actually a sensible solution. > > > Please send a patch for that. > > > > Is a bad frame a problem that can be resolved with a reset? > > > > Even if so, the reset doesn't indicate to the user if previous commands > > completed with bad data, so it still seems unreliable. > > > We need to distinguish two cases here. > The one is use receiving a frame with an invalid tag, leading to a crash. > This can be easily resolved by issuing a reset, as clearly the command was > garbage and we need to invoke error handling (which is reset). > > The other case is us receiving a frame with a _duplicate_ tag, ie a tag > which is _currently_ valid. This is a case which will fail _even now_, as we > have simply no way of detecting this. > > So what again do we miss by fixing the first case? > Apart from a system which does _not_ crash?
I'm just saying each case is a symptom of the same problem. The only difference from observing one vs the other is a race with the host's dispatch. And since you're proposing this patch, it sounds like this condition does happen on tcp compared to other transports where we don't observe it. I just thought the implication that data corruption happens is a alarming.