I'm still seeing 10,000 queries/second, even after reducing my bandwidth in the pool by 90%. I'm seeing some weird things that are probably clues, but that I don't know how to interpret.
My first hypothesis was that reducing my bandwidth in the pool would pretty quickly lead to a proportional drop in queries, because it was probably badly-behaved clients that were constantly trying to sync with whatever was in the pool. I dropped my bandwidth setting from 100 Mbps to 50 Mbps, and saw no decline over a few days. I dropped from 50 to 10 Mbps today, and am still doing 10k qps. (10,967/second over the past 5 minutes or so.) I was also seeing this insane query volume (though I wasn't quantifying it) even when I had fallen out of the pool for a low score. This isn't unreasonable, of course; another server I run (not in Brazil) is still getting queries months after I took it out of the pool. But if it was simply a giant mass of clients, I'd expect to start seeming _some_ change pretty quickly when I dropped my overall bandwidth by 90%. I did not. Also very interesting to me -- a bit over 98% of incoming queries are NTPv3. On other servers that are or have been in the pool, that number is below 50%. Here's the latest sysstat: $ /usr/local/bin/ntpq -c sysstat uptime: 545693 sysstats reset: 545693 packets received: 5215604712 current version: 85369465 older version: 5130135559 bad length or format: 98762 authentication failed: 64726 declined: 19 restricted: 1248 rate limited: 464893288 KoD responses: 77028191 processed for time: 2772 (Only 'current version' and 'older version' are reported, but tcpdump and mrulist show that it's pretty much all v3 and v4.) My theory, and one that looking at "mrulist limited" seems to support, is that the giant increase in traffic is all v3. If I look at just 'current version', there are 85369465 queries over 545693 seconds, which is well in keeping with the 250-400qps range I used to see. The other thing I can't figure out -- if I run something like "mrulist limited", the overwhelming majority of clients that are listed are of the form "191-247-nnn-nn.3g.claro.net.br" (with the 'nnn' representing IP octets, of course). I tried adding "sortorder=count" to rule out it being stored by IP range or something. So on the surface, it seemed like that netblock was the culprit. I added iptables rules to try to prove it -- I accept traffic on a bunch of subnets I saw coming up often, and then I default-accept the rest of the traffic. Of the enumerated IP ranges, 191.244.0.0/14 is indeed the busiest, with 17GB passed since I last restarted iptables. The next-closest subnet is about 2GB. BUT, the default accept has matched 311GB, so it seems that they're not the source of the bulk of my traffic. I think I need to dig into that 311GB a bit and try to break it down a bit more. The "3g" bit of the hostnames piques my interest as well -- could they be cell phones? Cellular modems? Maybe they have a faulty NTP client? (But I thought GSM had its own way of syncing time?) People have suggested two theories that seem very reasonable, and that fit Occam's Razor, but that I'm not sure explain quite what I'm seeing. The first was that it's a DDoS attack, and the second was that this is caused by the drop in servers in the Brazil zone. (Although both might be contributing in a small part.) According to monlist, the most abusive client (whether it's an actual IP or a spoofed one that attackers want to have me attack) has queried me about 750k times -- certainly badly-behaved, but less than 2pps on average. And only 9 total IPs have sent me more than 15k queries. (Though is there a way for me to see how often this list has rolled over? It's probably being purged a lot at this scale?) It doesn't seem like attackers would be accomplishing anything more than sending a few kbps of traffic to any given IP. I also don't know if the decline in the number of servers can entirely explain this. Previously, I was doing a few hundred queries per second; 10k/second is an enormous leap that wouldn't seem to make sense when half the servers left. And if I look at only NTPv4 traffic, my count is pretty similar to what it used to be. (Though I don't have those metrics from before this started.) I'm thinking of writing a small tool to stream tcpdump output and keep a per-subnet counter, unless there's already something doing this? I've put my server in Canada, which sits on a 100Mbps unmetered connection, in the pool and asked if it might be moved into the Brazil pool to try to absorb some of the load. I can't keep paying for this level of bandwidth and the larger instance I moved this to to handle the load. But I also suspect that something, somewhere has broken to cause this, and I can't figure out what it is. -- Matt On Fri, May 22, 2015 at 2:28 PM, Matt Wagner <[email protected]> wrote: > > Does anyone else here run an NTP server in Brazil? I'm wondering if you are seeing the same crazy load I am. > > For a long time I saw maybe 400 queries/second, but I got email last weekend that I had fallen out of the pool for being unreachable. Indeed, I couldn't even SSH in. It turns out that it's because my server (a t1.micro instance) was dying under the load, which is close to 10,000 queries per second right now. For giggles, I upsized to a larger instance and moved the IP to watch what was happening on a machine that could handle the load. > > Yes, I'm patched against the old monlist exploit. > > $ /usr/local/bin/ntpq -c sysstat > uptime: 77729 > sysstats reset: 77729 > packets received: 670434339 > current version: 10573419 > older version: 659857017 > bad length or format: 3276 > authentication failed: 7916 > declined: 3 > restricted: 126 > rate limited: 60293937 > KoD responses: 10096867 > processed for time: 636 > > There are definitely some abusive clients, but it's not a crazy DoS from one IP or anything. Less than 10% of requests hit rate limits, and if I watch tcpdump or something, it's from a huge range of IPs. Only a handful of clients have made more than 50,000 requests (over the ~77000 second uptime), and none are way over that. Trying to profile random IPs from tcpdump, none seem to be behaving too wildly. It seems like I'm just serving a huge number of clients. > > My bandwidth is set at 100 Mbps, which it has been at for a while. The jump from a few hundred queries/second to 10,000 queries/second seems to have come out of nowhere. > > Is anyone else seeing this? I'm happy to keep soaking up some of the load, but I'm not eager to pay for 50GB of NTP traffic a day for too long. _______________________________________________ pool mailing list [email protected] http://lists.ntp.org/listinfo/pool
