When using multipath with iscsi_tcp, there may be the possibility of
data corruption when failing requests between paths during a short
network interruption.

The situation looks like this.

  1. A write is being sent on pathA, and the data is in the TCP transmit
     queue.

  2. The network connection for pathA is interrupted before the write
     data can be sent to the target.

  3. The connection error is detected, requests are failed, dm-multipath
     retires on pathB, iscsid calls ep_disconnect which for iscsi_tcp is
     a close().

  4. The write is successful on pathB

  5. The network connection for pathA is restored before TCP has given
     up on a graceful shutdown.

  6. The write that was started on pathA is completed, and then the FIN
     is sent to close the connection.

At [6] the network packet in the transmission queue held a pointer to
the data page, which may have changed so this stale write could be
carrying a completely incorrect payload.

Basically, if requests are being failed it's important to abort the TCP
connection rather than let TCP wait and attempt a graceful shutdown.

I'd really love to hear other opinions on this one. I don't have a solid trace
of this, but I'm working with a report of data corruption in a controlled test
setup that fits this.  I had the reporter try an early version of this patch,
that always set a 0 SO_LINGER time without the IN_LOGOUT check, and they no
longer could reproduce the problem.

Chris Leech (1):
  iscsi_tcp set SO_LINGER to abort connection for error handling

 usr/io.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

-- 
2.5.0

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

Reply via email to