ctubbsii edited a comment on issue #2016:
URL: https://github.com/apache/accumulo/issues/2016#issuecomment-819793131
Okay, so fixing the timeout works. I was able to get the logs. It looks like
services are starting up okay, but cannot talk to each other. The services
register themselves using the local host name determined by using reverse DNS
on the local IP address. When services are reached on localhost, everything
works fine (e.g. services can talk to zookeeper on `localhost:33647` just
fine). I don't see any errors with sending to the tracer service, but do see it
listening on an IP address (`[tracer.AsyncSpanReceiver] INFO : starting span
receiver with hostname 10.1.0.83`) instead of resolving a hostname.
Tservers and the master in 1.10 (the build I was testing) show that they are
listening on hostname `fv-az95-160`, but when the master tries to talk to the
tservers, it fails to connect and times out:
```
2021-04-14T18:10:40,775 [manager.Manager] INFO : New servers:
[fv-az95-160:45343[100000b12f00006], fv-az95-160:36911[100000b12f00002]]
2021-04-14T18:10:40,794 [manager.EventCoordinator] INFO : There are now 2
tablet servers
2021-04-14T18:10:40,803 [manager.Manager] INFO : tserver availability check
disabled, continuing with-2 servers. To enable, set
manager.startup.tserver.avail.min.count
2021-04-14T18:10:40,956 [server.ServerUtil] WARN : System swappiness setting
is greater than ten (60) which can cause time-sensitive operations to be
delayed. Accumulo is time sensitive because it needs to maintain distributed
lock agreement.
2021-04-14T18:10:40,980 [manager.Manager] INFO : Setting manager lock data
to fv-az95-160:35861
2021-04-14T18:10:41,040 [metrics.ManagerMetricsFactory] INFO : Registered
replication metrics module
2021-04-14T18:10:41,061 [metrics.ManagerMetricsFactory] INFO : Registered
FATE metrics module
2021-04-14T18:10:41,061 [manager.Manager] INFO : All metrics modules
registered
2021-04-14T18:10:41,330 [balancer.TableLoadBalancer] INFO : Loaded class
org.apache.accumulo.core.spi.balancer.SimpleLoadBalancer for table +r
2021-04-14T18:10:41,331 [manager.Manager] INFO : Assigning 1 tablets
2021-04-14T18:11:20,829 [rpc.ThriftUtil] WARN : Failed to open transport to
fv-az95-160:36911
2021-04-14T18:11:20,830 [rpc.ThriftUtil] WARN : Failed to open transport to
fv-az95-160:45343
2021-04-14T18:11:20,830 [manager.Manager] ERROR: unable to get tablet server
status fv-az95-160:36911[100000b12f00002]
org.apache.thrift.transport.TTransportException:
java.nio.channels.ClosedByInterruptException
```
There is an additional stack trace further along, but it doesn't have any
additional information, just that there was a timeout trying to connect to the
tserver.
So, either there is a problem with DNS/rDNS mapping between the hostname and
IP address of the runner, or there is some other security / firewall policy
preventing services from talking on the non-localhost IP address.
This is clearly the result of some change in GitHub Actions runners, and not
in our code, since it also affects minicluster in 1.10.
The most likely change I can think of that could have caused this is the
switch of `ubuntu-latest` from mapping to `ubuntu-18.04` to `ubuntu-20.04`.
However, I don't have an Ubuntu instance to experiment with at the moment, so
this is where I'm stuck for now.
There's a few options forward, if it is an issue with Ubuntu 20.04:
1. ~Force using ubuntu-18.04 instead of ubuntu-latest~ didn't work, test
still hung and timed out with the same errors connecting
2. ~Disable firewalld (or other firewall, if it's running on Ubuntu 20.04
runners)~ didn't work, the firewall is already inactive
3. ~Add firewall rules (if the problem is the firewall)~ firewall is already
inactive
4. Fix rDNS name lookup for the local IP address by adding a hosts entry to
/etc/hosts or doing something in /etc/nsswitch.conf to use the myhostname local
name service rather than DNS and hostnamectl (if the problem is DNS)
5. Force minicluster services to use localhost
6. Get GitHub to fix it, if it's a problem with runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]