The problem affects also transfers from RDS instances - it prevents for
example creating an export file using pg_dump from an RDS Postgres
database.

The problem seems to be caused by a sudden collapse of downloader TCP
receive window to "1" (usually with scale 7, so 1*2^7=128 bytes) after
transferring several tens of GB of data over a single connection. The
TCP receive window never recovers.

Analysis of a single stalled transfer from RDS PostgreSQL:

tcpdump -nn -vvv -r pcap --dont-verify-checksums; done | xz -1 >
pcap.txt.xz

$ xzgrep wscale pcap.txt.xz 
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [S], seq 3273208485, win 
26883, options [mss 8961,sackOK,TS val 1284903196 ecr 0,nop,wscale 7], length 0
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [S.], seq 2908863056, ack 
3273208486, win 28960, options [mss 1460,sackOK,TS val 120076048 ecr 
1284903196,nop,wscale 10], length 0

So the window scale for this TCP connection is 7, so each "win N" values
have to be multiplied by 2^7=128.

The window size TCP parameter at the beginning of the file is for example 852, 
which means after scaling 852*2^7=109056 bytes
$ xzgrep '10.16.14.237.33578' pcap.txt.xz | head -100000 | tail -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2256009, ack 
201198741, win 852, options [nop,nop,TS val 1284911021 ecr 120078004], length 0

But near the end of the file it is 1, which means after scaling 1*2^7=128 bytes:
$ xzcat pcap.txt.xz | tail -1000 | grep '10.16.14.237.33578' | head -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 
3742538664, win 1, options [nop,nop,TS val 1286238401 ecr 120409852], length 0

And indeed the RDS server is sending 128 bytes of data and waiting for 
confirmation before sending another 128:
13:47:27.170534 IP (tos 0x0, ttl 255, id 11479, offset 0, flags [DF], proto TCP 
(6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 
3742538664:3742538792, ack 2266022, win 174, options [nop,nop,TS val 120409903 
ecr 1286238401], length 128
13:47:27.170542 IP (tos 0x0, ttl 64, id 28256, offset 0, flags [DF], proto TCP 
(6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 
3742538792, win 1, options [nop,nop,TS val 1286238605 ecr 120409903], length 0
13:47:27.374539 IP (tos 0x0, ttl 255, id 11480, offset 0, flags [DF], proto TCP 
(6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 
3742538792:3742538920, ack 2266022, win 174, options [nop,nop,TS val 120409954 
ecr 1286238605], length 128

The switch from "win 600" (76800 bytes) to "win 1" (128 bytes) is sudden and I 
have no idea what could have caused it:
13:33:26.230782 IP (tos 0x0, ttl 64, id 24124, offset 0, flags [DF], proto TCP 
(6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 
3741933728, win 600, options [nop,nop,TS val 1285397677 ecr 120199669], length 0
13:33:26.230868 IP (tos 0x0, ttl 255, id 7295, offset 0, flags [DF], proto TCP 
(6), length 4396)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 
3741933728:3741938072, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 4344
13:33:26.230918 IP (tos 0x0, ttl 255, id 7298, offset 0, flags [DF], proto TCP 
(6), length 43492)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 
3741938072:3741981512, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 43440
13:33:26.230932 IP (tos 0x0, ttl 255, id 7328, offset 0, flags [DF], proto TCP 
(6), length 13708)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 
3741981512:3741995168, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 13656
13:33:26.230948 IP (tos 0x0, ttl 255, id 7338, offset 0, flags [DF], proto TCP 
(6), length 2948)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 
3741995168:3741998064, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 2896
13:33:26.230969 IP (tos 0x0, ttl 255, id 7340, offset 0, flags [DF], proto TCP 
(6), length 5348)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 
3741998064:3742003360, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 5296
13:33:26.231759 IP (tos 0x0, ttl 255, id 7344, offset 0, flags [DF], proto TCP 
(6), length 4396)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 
3742003360:3742007704, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 4344
13:33:26.231775 IP (tos 0x0, ttl 255, id 7347, offset 0, flags [DF], proto TCP 
(6), length 2876)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 
3742007704:3742010528, ack 2266022, win 174, options [nop,nop,TS val 120199669 
ecr 1285397677], length 2824
13:33:26.233238 IP (tos 0x0, ttl 64, id 24125, offset 0, flags [DF], proto TCP 
(6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 
3742010528, win 1, options [nop,nop,TS val 1285397679 ecr 120199669], length 0

----

There's one changelog entry which seems to touch TCP receive window
calculations: tcp: "avoid integer overflows in tcp_rcv_space_adjust()".
I don't know if this is the change which caused this.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1796469

Title:
  aws s3 cp --recursive hangs on the last file on a large file transfer
  to instance

Status in linux-aws package in Ubuntu:
  Confirmed

Bug description:
  aws s3 cp --recursive hangs on the last file on a large transfer to an
  instance

  I have confirmed that this works on version Linux/4.15.0-1021-aws

  aws cli version
  aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13

  Ubuntu version
  Description:  Ubuntu 18.04.1 LTS
  Release:      18.04
  eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - 
ami-00035f41c82244dab

  Package version
  linux-aws:
    Installed: 4.15.0.1023.23
    Candidate: 4.15.0.1023.23
    Version table:
   *** 4.15.0.1023.23 500
          500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu 
bionic-updates/main amd64 Packages
          500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 
Packages
          100 /var/lib/dpkg/status
       4.15.0.1007.7 500
          500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 
Packages

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1796469/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to