[GitHub] [accumulo] ctubbsii edited a comment on issue #2016: Github QA occasionally hangs while running unit-tests

GitBox Wed, 14 Apr 2021 14:50:51 -0700


ctubbsii edited a comment on issue #2016:
URL: https://github.com/apache/accumulo/issues/2016#issuecomment-819793131



   Okay, so fixing the timeout works. I was able to get the logs. It looks like 
services are starting up okay, but cannot talk to each other. The services 
register themselves using the local host name determined by using reverse DNS 
on the local IP address. When services are reached on localhost, everything 
works fine (e.g. services can talk to zookeeper on `localhost:33647` just 
fine). I don't see any errors with sending to the tracer service, but do see it 
listening on an IP address (`[tracer.AsyncSpanReceiver] INFO : starting span 
receiver with hostname 10.1.0.83`) instead of resolving a hostname.
   
   Tservers and the master in 1.10 (the build I was testing) show that they are 
listening on hostname `fv-az95-160`, but when the master tries to talk to the 
tservers, it fails to connect and times out:
   
   ```
   2021-04-14T18:10:40,775 [manager.Manager] INFO : New servers: 
[fv-az95-160:45343[100000b12f00006], fv-az95-160:36911[100000b12f00002]]
   2021-04-14T18:10:40,794 [manager.EventCoordinator] INFO : There are now 2 
tablet servers            
   2021-04-14T18:10:40,803 [manager.Manager] INFO : tserver availability check 
disabled, continuing with-2 servers. To enable, set 
manager.startup.tserver.avail.min.count
   2021-04-14T18:10:40,956 [server.ServerUtil] WARN : System swappiness setting 
is greater than ten (60) which can cause time-sensitive operations to be 
delayed. Accumulo is time sensitive because it needs to maintain distributed 
lock agreement.
   2021-04-14T18:10:40,980 [manager.Manager] INFO : Setting manager lock data 
to fv-az95-160:35861     
   2021-04-14T18:10:41,040 [metrics.ManagerMetricsFactory] INFO : Registered 
replication metrics module   
   2021-04-14T18:10:41,061 [metrics.ManagerMetricsFactory] INFO : Registered 
FATE metrics module       
   2021-04-14T18:10:41,061 [manager.Manager] INFO : All metrics modules 
registered                        
   2021-04-14T18:10:41,330 [balancer.TableLoadBalancer] INFO : Loaded class 
org.apache.accumulo.core.spi.balancer.SimpleLoadBalancer for table +r
   2021-04-14T18:10:41,331 [manager.Manager] INFO : Assigning 1 tablets         
                       
   2021-04-14T18:11:20,829 [rpc.ThriftUtil] WARN : Failed to open transport to 
fv-az95-160:36911          
   2021-04-14T18:11:20,830 [rpc.ThriftUtil] WARN : Failed to open transport to 
fv-az95-160:45343          
   2021-04-14T18:11:20,830 [manager.Manager] ERROR: unable to get tablet server 
status fv-az95-160:36911[100000b12f00002] 
org.apache.thrift.transport.TTransportException: 
java.nio.channels.ClosedByInterruptException
   ```
   
   There is an additional stack trace further along, but it doesn't have any 
additional information, just that there was a timeout trying to connect to the 
tserver.
   
   So, either there is a problem with DNS/rDNS mapping between the hostname and 
IP address of the runner, or there is some other security / firewall policy 
preventing services from talking on the non-localhost IP address.
   
   This is clearly the result of some change in GitHub Actions runners, and not 
in our code, since it also affects minicluster in 1.10.
   
   The most likely change I can think of that could have caused this is the 
switch of `ubuntu-latest` from mapping to `ubuntu-18.04` to `ubuntu-20.04`. 
However, I don't have an Ubuntu instance to experiment with at the moment, so 
this is where I'm stuck for now.
   
   There's a few options forward, if it is an issue with Ubuntu 20.04:
   1. ~Force using ubuntu-18.04 instead of ubuntu-latest~ didn't work, test 
still hung and timed out with the same errors connecting
   2. ~Disable firewalld (or other firewall, if it's running on Ubuntu 20.04 
runners)~ didn't work, the firewall is already inactive
   3. ~Add firewall rules (if the problem is the firewall)~ firewall is already 
inactive
   4. Fix rDNS name lookup for the local IP address by adding a hosts entry to 
/etc/hosts or doing something in /etc/nsswitch.conf to use the myhostname local 
name service rather than DNS and hostnamectl (if the problem is DNS)
   5. Force minicluster services to use localhost
   6. Get GitHub to fix it, if it's a problem with runner
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] ctubbsii edited a comment on issue #2016: Github QA occasionally hangs while running unit-tests

Reply via email to