------- Comment From cha...@us.ibm.com 2016-10-11 17:05 EDT-------
*** Bug 147358 has been marked as a duplicate of this bug. ***

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1632466

Title:
  STC860:Tuleta-L:KVM:iap03:HTX logs miscompares on multiple adapters on
  ubuntu 16.04 guest.

Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Application Cdeadmin <cdead...@us.ibm.com> -
  2016-07-26 11:20:20 ==

  
  == Comment: #1 - Application Cdeadmin <cdead...@us.ibm.com> - 2016-07-26 
11:20:22 ==
  ==== State: Open by: panico on 25 July 2016 18:02:09 ====

  Contact Information:
  ====================
  Defect Originator:  Michael Panico
  Defect Originator   pan...@us.ibm.com

  Backup: Deepti Umarani
  Backup ST ID: Deepti S Umarani/India/IBM

  System Info:
  ============
  Machine Type:............8284-22A
  Card Type:...............FSP2_P8LE
  Current Boot Side:.......T
  Next Boot Side:..........T
  PT_Swap:.................1
  Current Side Driver:.....fips860/b0713a_1630.860 

  Ubuntu KVM Host:
  root@iaos1:/tmp# uname -a
  Linux iaos1 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
  root@iaos1:/tmp# lsb_release -a
  No LSB modules are available.
  Distributor ID:       Ubuntu
  Description:  Ubuntu Yakkety Yak (development branch)
  Release:      16.10
  Codename:     yakkety

  Ubuntu Guest:
  [htx@iap03]  [1m/usr/lpp/htx/bin# [0m uname -a
  Linux iap03 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
  [htx@iap03]  [1m/usr/lpp/htx/bin# [0m lsb_release -a
  No LSB modules are available.
  Distributor ID:       Ubuntu
  Description:  Ubuntu 16.04.1 LTS
  Release:      16.04
  Codename:     xenial

  HTX level on ubuntu 16.04 guest:
  [htx@iap03]  [1m/usr/lpp/htx/bin# [0m dpkg -l | grep htxubuntu
  ii  htxubuntu                          402                             ppc64el

  Problem Description:
  ====================
  Running networks (via bridging) on three different guests (other linux, 
ubuntu 16.10 and ubuntu 16.04) the ubuntu 16.04 guest logged a miscompare on 
the austin adapter and the shinerT adapter.  The shinerT ran for 24 hours and 
the austin for 72 hours before logging miscompares.

  From /tmp/htxerr:
  [htx@iap03]  [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr

  /dev/enp0s6       Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom        
  R(102.1.2.20.50854) connected to W(102.1.1.20.39073)
  Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 
51082 
  Miscompare at displacement (decimal) = 23876 
  wbuf = ffffffffffffffffffffffffffffffffffffffff
  rbuf = 55555555555555555555555555555555ffffffff

  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads= 70739865,  good bytes read=     2868570843930

  /dev/enp0s6       Jul 23 14:29:00 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 1
  R(102.1.2.20.50854) connected to W(102.1.1.20.39073)
  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads=        0,  good bytes read=                 0

  /dev/enp0s7       Jul 23 14:29:00 2016 err=00fb0068 sev=1 hxecom        
  W(102.1.1.20.39073) connected to R(102.1.2.20.50854)
  Wrote 878 characters, expected 963
  Write failed - Connection reset by peer.
  For stanza 1, TCP connection,  bufmin=3, bufmax=2869, bufinc=94
  Consecutive prior good writes= 25459432,  good bytes written=       
35283290616

  htx_messages      Jul 23 14:29:01 2016 err=00000000 sev=1 hxssup        
  hxecom HE program for enp0s6 terminated by exit(0) call.

  htx_messages      Jul 23 16:29:07 2016 err=00000000 sev=4 hang_monitor  
  hxecom for enp0s7 is HUNG!
  Max run time (set in mdt) = 7200 seconds.
  Current elasped time = 7206 seconds.

  htx_messages      Jul 23 18:29:07 2016 err=00000000 sev=4 hang_monitor  
  hxecom for enp0s7 is HUNG!
  Max run time (set in mdt) = 7200 seconds.
  Current elasped time = 14406 seconds.

  htx_messages      Jul 23 20:29:07 2016 err=00000000 sev=4 hang_monitor  
  hxecom for enp0s7 is HUNG!
  Max run time (set in mdt) = 7200 seconds.
  Current elasped time = 21606 seconds.

  htx_messages      Jul 23 22:29:08 2016 err=00000000 sev=4 hang_monitor  
  hxecom for enp0s7 is HUNG!
  Max run time (set in mdt) = 7200 seconds.
  Current elasped time = 28807 seconds.

  htx_messages      Jul 24 00:29:08 2016 err=00000000 sev=4 hang_monitor  
  hxecom for enp0s7 is HUNG!
  Max run time (set in mdt) = 7200 seconds.
  Current elasped time = 36007 seconds.

  /dev/enp0s4       Jul 25 04:56:13 2016 err=00e30000 sev=4 hxecom        
  R(103.1.2.20.36665) connected to W(103.1.1.20.40459)
  Miscompare with packet 0, data size = 30000. miscompare_count=0 pakLen = 
30020 
  Miscompare at displacement (decimal) = 17508 
  wbuf = ffffffffffffffffffffffffffffffffffffffff
  rbuf = e7f0e11400000000333039373600ffffffffffff

  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads= 42699991,  good bytes read=     1731422119820

  /dev/enp0s4       Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 1
  R(103.1.2.20.36665) connected to W(103.1.1.20.40459)
  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads=        0,  good bytes read=                 0

  /dev/enp0s5       Jul 25 04:56:13 2016 err=00fb0068 sev=1 hxecom        
  W(103.1.1.20.40459) connected to R(103.1.2.20.36665)
  Wrote 13877 characters, expected 40551
  Write failed - Connection reset by peer.
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes= 42700201,  good bytes written=     
1731430540751

  htx_messages      Jul 25 04:56:14 2016 err=00000000 sev=1 hxssup        
  hxecom HE program for enp0s4 terminated by exit(0) call.

  /dev/enp0s5       Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 10
  W(103.1.1.20.40459) connected to R(103.1.2.20.36665)
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes=        0,  good bytes written=                 0

  The bridges in the guest xml:
      <interface type='bridge'>
        <mac address='52:54:00:ab:08:11'/>
        <source bridge='brenP3p5s0f0'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:96:cb:70'/>
        <source bridge='brenP3p5s0f2'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:26:6f:ec'/>
        <source bridge='brenP3p5s0f3'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:a8:b0:09'/>
        <source bridge='brenP1p3s0f0'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:a1:e6:4d'/>
        <source bridge='brenP1p3s0f1'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x07' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:b5:27:2a'/>
        <source bridge='brenp1s0f0'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x08' 
function='0x0'/>
      </interface>
      <interface type='bridge'>
        <mac address='52:54:00:be:da:6c'/>
        <source bridge='brenp1s0f1'/>
        <model type='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x09' 
function='0x0'/>
      </interface>

  There is also another defect opened (SW360179) on a glacier park logging a 
miscompare on another ubuntu 16.04 guest, but on a ubuntu 16.04 KVM host.
  ==== State: Open by: avkuttan on 26 July 2016 04:49:05 ====

  Summary :
  =========

  device                      : /dev/enp0s6
  Miscompare offset (decimal) : 23876
  Transfer size (decimal)     : 51082
  Miscompare length (decimal) : 16

  Reader with IP (102.1.2.20.50854) is connected to Writer with 
(102.1.1.20.39073) on enp0s6.
  This miscompare shows 16 bytes of unexpected data in the read buffer.

  Analysis details:
  ================

  /tmp/htxerr:
  ------------
  /dev/enp0s6       Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom              
      <<=== device (/dev/enp0s6)
  R(102.1.2.20.50854) connected to W(102.1.1.20.39073)                          
      <<=== TCP Connection details : Writer(102.1.1.20.39073) and Reader 
(102.1.2.20.50854)
  Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 
51082      <<=== Transfer size (51062 + 20 for header)
  Miscompare at displacement (decimal) = 23876                                  
      <<=== Miscompare offset (23876)
  wbuf = ffffffffffffffffffffffffffffffffffffffff
  rbuf = 55555555555555555555555555555555ffffffff

  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads= 70739865,  good bytes read=     2868570843930

  /dev/enp0s6       Jul 23 14:29:00 2016 err=00d80000 sev=4 hxecom
  Shutting down testing due to error flag - 1
  R(102.1.2.20.50854) connected to W(102.1.1.20.39073)
  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads=        0,  good bytes read=                 0

  Expected data (write buffer) at miscompare location:
  ----------------------------------------------------
  00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff  ........23680...
  00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff  ........23808...
  00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005d30: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d40: ffff ffff ffff ffff ffff ffff ffff ffff  ................     <<=== 
start of miscompare
  00005d50: ffff ffff ffff ffff ffff ffff ffff ffff  ................     <<=== 
end of miscompare
  00005d60: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d70: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff  ........23936...
  00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005db0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005de0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005df0: ffff ffff ffff ffff ffff ffff ffff ffff  ................

  Actual data (read buffer) at miscompare location:
  -------------------------------------------------
  00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff  ........23680...
  00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff  ........23808...
  00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005d30: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d40: ffff ffff 5555 5555 5555 5555 5555 5555  ....UUUUUUUUUUUU     <<=== 
 start of miscompare
  00005d50: 5555 5555 ffff ffff ffff ffff ffff ffff  UUUU............     <<=== 
end of miscompare
  00005d60: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d70: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff  ........23936...
  00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000  ../dev/enp0s7...
  00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff  .000014294......
  00005db0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005de0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
  00005df0: ffff ffff ffff ffff ffff ffff ffff ffff  ................

  Miscompare side wise:
  ====================
  00005d30: ffff ffff ffff ffff ffff ffff ffff ffff  ..........   00005d30: 
ffff ffff ffff ffff ffff ffff ffff ffff  ..........
  00005d40: ffff ffff ffff ffff ffff ffff ffff ffff  .......... | 00005d40: 
ffff ffff 5555 5555 5555 5555 5555 5555  ....UUUUUU
  00005d50: ffff ffff ffff ffff ffff ffff ffff ffff  .......... | 00005d50: 
5555 5555 ffff ffff ffff ffff ffff ffff  UUUU......
  00005d60: ffff ffff ffff ffff ffff ffff ffff ffff  ..........   00005d60: 
ffff ffff ffff ffff ffff ffff ffff ffff  ..........

  Miscompare started at = 0x00005d45
  Miscompare ended at   = 0x00005d55
  Miscompare length     = x00005d55 - 0x00005d45 = 0x10 = 16 (decimal)

  Buffer files are dumped into readable form as:
  /tmp/htxenp0s6.wbuf_1 ==>> /tmp/wbuf_1_dump
  /tmp/htxenp0s6.rbuf_1 ==>> /tmp/rbuf_1_dump

  Transfer this defect to Linux team for further analysis.
  ==== State: Open by: kalok on 26 July 2016 06:32:03 ====

  transferred to ppclinux_nonconfidential_BZilla using CMVC. updates not 
visible in BQ.
  ==== State: Open by: panico on 26 July 2016 10:11:29 ====

  The same host logged another two miscompares on an ubuntu 16.10 guest
  (iap02.aus.stglabs.ibm.com (9.3.242.19)) on the austin adapter in slot
  C10 and the glacier park adapter in slot C7.

  [htx@iap02]  [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr

  /dev/enp0s10      Jul 26 01:24:25 2016 err=00e30000 sev=4 hxecom        
  R(101.1.2.19.33643) connected to W(101.1.1.19.40555)
  Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 
51082 
  Miscompare at displacement (decimal) = 35954 
  wbuf = aaaaaaaaaaaaaaaaaaaaaaaaaaaa6b3af7020000
  rbuf = fffffffffffffffffffffffffffffffff7020000

  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads= 32932090,  good bytes read=     1335345881380

  /dev/enp0s10      Jul 26 01:24:25 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 1
  R(101.1.2.19.33643) connected to W(101.1.1.19.40555)
  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads=        0,  good bytes read=                 0

  htx_messages      Jul 26 01:24:26 2016 err=00000000 sev=1 hxssup        
  hxecom HE program for enp0s10 terminated by exit(0) call.

  /dev/enp0s11      Jul 26 01:24:26 2016 err=00fb0068 sev=1 hxecom        
  W(101.1.1.19.40555) connected to R(101.1.2.19.33643)
  Wrote 1980 characters, expected 51082
  Write failed - Connection reset by peer.
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes= 32932264,  good bytes written=     
1335354769648

  /dev/enp0s11      Jul 26 01:24:26 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 10
  W(101.1.1.19.40555) connected to R(101.1.2.19.33643)
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes=        0,  good bytes written=                 0

  /dev/enp0s11      Jul 26 01:24:26 2016 err=002a0068 sev=4 hxecom        
  SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection reset by 
peer.

  /dev/enp0s11      Jul 26 01:24:39 2016 err=002a006f sev=4 hxecom        
  SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection refused.

  /dev/enp0s4       Jul 26 01:25:14 2016 err=00e30000 sev=4 hxecom        
  R(103.1.2.19.43035) connected to W(103.1.1.19.53023)
  Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 
51082 
  Miscompare at displacement (decimal) = 16166 
  wbuf = 3433313400aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  rbuf = 0200000000323232373200aaaa01002faaaaaaaa

  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads= 12475849,  good bytes read=      505864438618

  /dev/enp0s4       Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 1
  R(103.1.2.19.43035) connected to W(103.1.1.19.53023)
  For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good reads=        0,  good bytes read=                 0

  htx_messages      Jul 26 01:25:14 2016 err=00000000 sev=1 hxssup        
  hxecom HE program for enp0s4 terminated by exit(0) call.

  /dev/enp0s5       Jul 26 01:25:14 2016 err=00fb0068 sev=1 hxecom        
  W(103.1.1.19.53023) connected to R(103.1.2.19.43035)
  Wrote 10254 characters, expected 51082
  Write failed - Connection reset by peer.
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes= 12475989,  good bytes written=      
505871590098

  /dev/enp0s5       Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom        
  Shutting down testing due to error flag - 10
  W(103.1.1.19.53023) connected to R(103.1.2.19.43035)
  For stanza 0, TCP connection,  bufmin=30000, bufmax=65000, bufinc=10531
  Consecutive prior good writes=        0,  good bytes written=                 0

  /dev/enp0s5       Jul 26 01:25:14 2016 err=002a0068 sev=4 hxecom        
  SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection reset by 
peer.

  /dev/enp0s5       Jul 26 01:25:26 2016 err=002a006f sev=4 hxecom        
  SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection refused.

  == Comment: #84 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-10 
11:43:26 ==
  This is going to be a long post (my apologies). But it has, I believe, the
  root cause analysis for this issue, based on what I have discovered till
  now.

  As has been mentioned this issue has been worked on, on a 24X7
  including myself.

  First I want to correct something I said in comment 61. I did the analysis as
  mentioned there. But when looking at the data on the receiver - which had
  captured the miscompare packet - I had misconstrued the position of the error.
  I had mistakenly picked the one at the end - and failed to notice the actual
  miscompare (the "XXXX...X" in a see of "YYYYYY"). That had mistakenly pushed
  me towards considering an additional problem (of possible reordering of
  packets as well). This caused a proliferation of possibilities to consider -
  on which I spent quite a bit of time before I went back to drawing board again
  to revalidate. I tracked the packets down through every concerned module and
  realized that there was no reordering. I then went back to the old data and
  discovered the missed sequences.

  So that at least narrowed down the problem to one (a nasty one albeit) - a
  case of data modification with a specific signature. the last 16 bytes of a
  page were getting corrupted. This was visible when transmitting the packet.
  The visibility/detection signature:
    -- always on the receive side
    -- always on the tun output and bridge on sender
    -- sometimes on the vhost transmit
    -- sometimes (rare) on the guest transmit.

  But from the signature, there is no clue as to whether the corruption is on
    - receive side
    - send side
  OR if it is on:
    - virtio/guest side
    - vhost/host side

  While my hunch was that the problem was likely on the receive side (because
  it was so close to the edge of a page, it almost seemed like someone had
  mistakenly read past their page), it has been very hard to narrow down or to
  validate. I had put in my own logs on the host and guest to capture the
  descriptor flows in order to be able to validate the access of the addresses
  and if the suspect address (the one NEXT to the corrupted one, in the next
  page) was being accessed - especially be receivers. As it turned out sometimes
  they were indeed accessed by receivers and also by transmitters. I had to put
  timestamps to sort them out.

  One had to  consider the possibility of races in the virtio/vhost 
communication
   -- memory barrier issues
   -- race possibilities
   -- cases where one side prematurely accessed descriptors before they were
      advertised by the producer
   -- descriptor corruption
   -- other scenarios.

  I did observe the following interesting points:
    - typically the virtio queues were running at close to full
       (especially for the slower Austin adapter)
    - the vhost receive queues for the MLX adapters were empty i.e. there were
       lots of skbs lying on the tun receive queues, waiting to be delivered to
       the guest.
  The above does bring into the picture the possibility of possible barrier/race
  causing premature exposure and leading to the corruption seen.

  I did consider various scenarios in this realm, both for send and receives on
  the guest/host but my logging didn't validate any of the scenarios. There is
  no point in cataloging them now - suffice it to say that none of them panned
  out.

  Eventually, I decided to focus on the receive side since it appeared to be
  the likely place for the data modification based on the signatures that were
  seen. I instrumented the receive side especially looking for the signatures
  we have observed. This was in addition to the logs I had encoded in the
  host/guest.

  This is when I started observing similar packet modification signatures as
  has been observed for this problem. I dumped the packet before and after
  looking for modification signatures like we are seeing.

  IN EVERY CASE I SAW A DATA CORRUPTION in the transmit path, I saw the SAME
  SIGNATURE WAS FLAGGED in the receive path on tun_recvmsg.

  Comparing timestamps, the modification observed on the tun receive path
  was always before (some 30/40ms to 100/200 ms before). Also the change
  happened during the receive:
    -- before the receive when the packet was dumped it had original data
    -- after the receive, it was a corrupted packet
    -- and if I let it go through it will then hit the miscompare on the
       transmit side

  The signature captured in this receive intercept was similar to the
  miscompare... An example below:

  ---------------------------------------------
  Previously (last two lines)
  ffffffffffffffffffffffffffffffff <END PAGE - 16>
  ffffffffffffffffffffffffffffffff <END PAGE>

  current  (last two lines)
  ffffffffffffffffffffffffffffffff <END PAGE - 16>
  cccccccccccccccccccccccccccccccc <END PAGE>
  cccccccccccccccccccccccccccccccc <START new PAGE>
  ---------------------------------------------

  When I let it run thru, this scenario will trigger a miscompare in
  the transmit path like we have been seeing all along.

  Thereafter, I put some instrumentation in the code to determine what
  was causing this. After going through a few this is the general
  sequence, which I am illustrating with an example below.

  Sequence leading to the corruption
  ---------------------------------
  A large packet is read in. Two iovecs assigned for this:

  0. Buffers are posted via vhost (the sizes are important... esp
  the first one ... here they are 0x7700 and 0x10000). In every
  instance I observed, the first one had a buffer size of 7700.

  c000001fb485dd88:  00003ffb5ad90000 0000000000007700   ...Z.?...w......
  c000001fb485dd98:  00003fff40cd0000 0000000000010000   ...@.?..........

  1.struct iovec {
    iov_base = 0x3ffb5ad90000,
    iov_len = 30464
  }

  
  2.<header> read from skb1
  3.several GSO segmnents read in from the fragments of skb1
  4.Once over, it reads the next skb2 from fraglist
  5.In this skb, there is no linear data,
  6.It reads the fragments from skb2.
  7.After four complete fragments, the fifth one can not be
  accommodated in this iovec. This is where it gets interesting.

  8. The iov_offset is at 716e when it issues a read for 0x5a8
  bytes (one MSS/1448 bytes). But it can fit only 0x592 bytes.

  c000000001995a30:  00000000000005a8 000000000000716e
                                      ^^^iov_offset^^^
  c000000001995a40:  0000000000000592 0000000000000000   ................
  c000000001995a50:  00000000000005a8 c000001ec7937842   ........Bx......
                                      ^^^^from^^^^^^
  c000000001995a60:  00003ffb5ad9716e 0000000000000592
                     ^^^to/userspc^^^  ^^^amount read^^

  The iov has only 7700-716e=0x592  space left, so it can not fit
  one full GSO segment.

  Now from is advanced: c000001ec7937242+0x592=c000001ec79377d4
  to/userspace_buf is advanced: 716e+592=7700 i.e. end of iov. So next
  iovec is invoked.

  struct iovec {
    iov_base = 0x3fff40cd0000,
    iov_len = 65536
  }

  Now we see the next read (into new iov)
  c000000001995aa0:  00000000000005a8 0000000000007700   .........w......
                                      ^^^offset: first iov done
  c000000001995ab0:  0000000000000016 0000000000000026   ........&.......
                     ^^amount to read ^^^left=returned by copy_to_user
  c000000001995ac0:  00000000000005a8 c000001ec7937dd4   .........}......
                     ^^amount wanted  ^^^from^^^^^^^^
  c000000001995ad0:  00003fff40cd0000 fffffffffffffff0   ...@.?..........
                     ^^^to^^^^^^^^^   ^^^len=-10!!!!!^^^

  So we wanted to read 0x16 bytes, but __copy_to_user_inatomic returned
  0x26 bytes as the amount still remaining to be read!!!!

  __copy_to_user_inatomic(00003fff40cd0000, c000001ec7937dd4, 0x16) =
  0x26

  THIS IS THE PROBLEM.

  Now our iov_len becomes negative = 0x16-0x26=fffffffffffffff0

  We then apply this negative offset and modify the 16 buytes before the
  start of this page on the next read. Here is the next read:

  c000000001995b50:  fffffffffffffff0 0000000000000026   ........&.......
                     ^^^offset^^^^^^  ^^^ bytes to write
  c000000001995b60:  0000000000000000 00000000000005a8   ................
                                      ^^^total wanted - from the initial
  c000000001995b70:  c000001ec7937dc4 00003fff40ccfff0   .}.........@.?..
                     ^^^from^^^^^^^^  ^^^^to- previous page
  c000000001995b80:  0000000000000026 0000000000000016   &...............
                     ^^^iov_len=amount read
  note that both from and to has regressed!!!
  from=c000001ec7937dd4+fffffffffffffff0=c000001ec7937dc4
  to=0x3fff40cd0000+fffffffffffffff0=00003fff40ccfff0
  i.e. we are re-reading 16 bytes from from
  and we are reading into the last 16 bytes of the previous page

  our offset is -16(dec) (will read into i.e. overwrite the the previous 16
  bytes

  After this the reads normalize and behave the way they ought to.
  But those 16 bytes were completely messed up (16 re-read and 16 read
  into wrong place) - which were obviously wrong/unnecessary.

  For the instances, I observed, the following were common:
    - read into first iov of size 0x7700
    - next iov starts on a new page
    - GSO segments read properly till truncation due to first iov
      running out of space. Last read=0x592. So 0x16 bytes remaining.
      At this point the "from" address seems to end in "d4" - not sure
      if that is just a coincidence.
    - next iov is start of page
    - __copy_to_user_inatomic(from, to, 0x16)=0x26 -- CULPRIT!!!
    - Now our len, offset will become negative -16(decimal) and we
      read into the last 16 bytes of the previous page
    - after this everything normalizes. Other than the 16 bytes of
      mayhem everything is fine!

  
  I investigated the call chain. Here is the call chain in invoking
  the copy_from_user routine...

  Call Chain
  ==========
  tun_put_user
    --> skb_copy_datagram_iter
          --> copy_to_iter
          --> copy_page_to_iter
              --> copy_page_to_iter_iovec
                   -->__copy_to_user_inatomic  (using atomic kmaps)
                       --> __copy_tofrom_user

  ******************************************
  __copy_to_user_inatomic pretty much passes everything to __copy_tofrom_user
  So that is likely where the focus has to be.

  The __copy_tofrom_user is in Linux/arch/powerpc/lib/copyuser_64.S
  -----------------------------------------------------------------

  _GLOBAL_TOC(__copy_tofrom_user)
  BEGIN_FTR_SECTION
          nop
  FTR_SECTION_ELSE
          b       __copy_tofrom_user_power7
  ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
  _GLOBAL(__copy_tofrom_user_base)
          /* first check for a whole page copy on a page boundary */
          cmpldi  cr1,r5,16
  ...
  ******************************************

  Note that NOT all of the above corruption will be detected!! In our
  case, if the page corrupted is one of the pages being CURRENTLY
  transmitted (on the vhost send queue) we will detect it. But the
  queue size is pretty small (256) - so probability is quite low. Or
  if we have to hold on to packets due to the possibility of TCP
  retransmission, they will be long-lived and will increase the
  probability as well. I did observe one TCP retransmission in
  one of the instances but I wasn't actively seeking those
  occurrences.
  The corrupted packets will be delivered to the guest - I did see
  a packet of size 0x16 on the guest in one of the crashes - which
  seemed bizarre (how can that be generated?) but now makes sense.
  It was just thrown away.

  I let the code run with some checks in copy_page_to_iter_iovec to
  capture the anomalies in  __copy_tofrom_user. The instances are
  pretty rare and generally conform to the following pattern:

   0x26=__copy_tofrom_user(to, from, 0x16)

  However, it is not always 26/16. This is confirmed/validated by
  prior observations of the corruption being less than 16 bytes as
  well (any where from 4 bytes to 16 bytes - 16 being most common).

  [ 2365.918381] error in copying ... buf=00003fff1e7e0000 
from=c000001f6e03cd6c left=26 copy=16
  [ 8789.412510] error in copying ... buf=00003ffeef060000 
from=c000001f6ecb99f6 left=26 copy=16
  [11930.325914] error in copying ... buf=00003fff23f40000 
from=c000001f53041fd4 left=26 copy=16
  [15400.086279] error in copying ... buf=00003fff165b0000 
from=c000001f5380ab66 left=26 copy=16
  [22717.969918] error in copying ... buf=00003fff199a0000 
from=c000001f947a6606 left=24 copy=14 <<<<<-----NOTE
  [27274.429858] error in copying ... buf=00003fff14120000 
from=c000001f677c31d4 left=26 copy=16
  [32103.463712] error in copying ... buf=00003ffc272e0000 
from=c000001f25e7e68c left=26 copy=16

  The size of the read along with the new user page (and possibly
  even the from address offset) may have something to do with
  this (just conjecture again).

  I believe the solution to the problem lies in figuring out why
  __copy_tofrom_user is returning a number greater than the read
  as "left". As mentioned, the code appears to be in
  linux/arch/powerpc/lib/copyuser_64.S

  While it is possible to put a check in copy_page_to_iter_iovec
  to test for a case where "left" is greater than "copy", the
  correct approach (I believe) is to determine why it is behaving in
  the way observed. Interestingly, "left" is expected to be between
  0 and "copy" - no check for negative or greater than copy. So clearly
  the function is expected behave in a tight range and so the
  community may want to keep it that way (just conjecture).

  This issue doesn't seem unique to POWER. It seems Sparc observed
  something similar as well.

  In linux 4.8, in copy_page_to_iter_iovec() the calls to the
  atomic kmaps are enveloped with
    (IS_ENABLED(CONFIG_HIGHMEM)
  additionally. The reasoning given is efficiency. I don't believe
  it has any impact in our scenario (both Ubuntu 16.04 and PowerKVM
  don't seem to have it set)

  Similar Issues
  -------------
  I did a little rudimentary research regarding any similar issues
  reported in the past. The following URLs report similar relevant issues:

  1 https://lkml.org/lkml/2016/7/8/775
  2 http://www.spinics.net/lists/stable/msg94139.html
  3 https://lkml.org/lkml/2016/8/3/181
  4 https://patchwork.kernel.org/patch/9260733/

  Note that all of them are quite recent.

  Case 1 appears very interesting. I was wondering if we are having
  any page fault related issues for the new user page (we will be
  reading into) especially for a small read. (Case 2 is just a
  little usage case).

  Cases 3 and 4 seem to refer to similar issues where the
  corruption was narrowed down to the commit which had recoded
  the module to use copy_page_to_iter(). I don't believe that
  the root cause was found (but I didn't put too much effort
  into digging up the history because in our case we have a
  smoking gun already, I believe).

  While working on this, I observed another bug in tun_put_user()
  although it didn't have any impact in our case (I verified)

  tun_put_user calls
    skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);

  but never checks the return code (which can be EFAULT). In that case,
  it will blithely copy junk to the vhost (caller's) buffers. That
  though appears to be a secondary issue for this problem.

  I believe though, that we need to figure out how
    __copy_tofrom_user
  is returning the anomalous data described above.

  While I can go through the assembly and make some progress, I believe
  there are experts here who can probably rapidly figure out the issue
  here from what has been described above.

  == Comment: #85 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-10 
13:03:24 ==
  In the meanwhile, I will try the foillowing:
    -- disable atomic kmaps
    -- immediately switch to non-atomic kmaps in case left  > copy

  == Comment: #86 - Pradeep Satyanarayana <prad...@us.ibm.com> - 2016-10-10 
15:12:51 ==
  Paul, Can you please take a look at Dwip's excellent analysis and let us know 
your thoughts?

  == Comment: #87 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-11 
00:53:47 ==
  I wanted to add a couple of updates and a note...

  My effort to try to maintain the performance profile by keeping the
  atomic kmaps and switching over to the non-atomic mode in case of the
  weird read failed. It was fairly quick. The corruption was visible
  within half an hour.

  So I tried to switch off the atomic completely by essentially
  following the code from 4.8 (using a CONFIG_HIGHMEM check although its
  stated introduction purpose was different). It's still running , after
  almost 12 hours (on this machine it reproduces fairly quickly). David
  also did the the same thing but by commenting out the atomic portion.
  I think that is probably still running as well. This may provide a
  helpful data point.

  Finally, I wanted to mention that __copy_tofrom_user() is also called
  in the transmit path via __copy_from_user_inatomic().

  The call sequence is:

  
handle_tx->tun_sendmsg->tun_get_user->skb_copy_datagram_from_iter->copy_page_from_iter->copy_page_from_iter_iovec->__copy_from_user_inatomic->__copy_tofrom_user

  This is done for the non-zerocopy path. By default it is off but the
  vhost code is self-tuning w.r.t zero-copy buffers even when it
  zerocopy mode set, i.e. beyond a threshold of buffers still in use
  (ubufs yet to be released) the code will switch to a copy mode. That
  brings into the picture another invocation sequence of
  __copy_tofrom_user() (with the src/dst reversed w.r.t. kernel/user
  space).

  So that will be a further possible source of mayhem...

  == Comment: #88 - Paul Mackerras <p...@au1.ibm.com> - 2016-10-11 07:09:38 ==
  Dwip, thank you for the excellent analysis. I dug into __copy_tofrom_user 
today to find out how it could return more than the number of bytes it was 
asked to copy. It turns out that there is a bug in __copy_tofrom_user_base 
which has been there since 2002! There is one store instruction in that 
function for which the exception handling goes to the wrong place, resulting in 
the return value being 16 greater than it should be. The fix is very simple and 
I will post it to the upstream mailing list and attach it to this bug.

  == Comment: #89 - Paul Mackerras <p...@au1.ibm.com> - 2016-10-11
  07:21:52 ==

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632466/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to