Josh Elser created HBASE-25692:
----------------------------------
Summary: Failure to instantiate WALCellCodec leaks socket
Key: HBASE-25692
URL: https://issues.apache.org/jira/browse/HBASE-25692
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.4.2, 2.4.1, 2.3.4, 2.3.2, 2.2.6, 2.2.5, 2.4.0, 2.2.4,
2.1.9, 2.3.3, 2.2.3, 2.1.8, 2.2.2, 2.1.7, 2.1.6, 2.2.1, 2.1.5, 2.0.6, 2.1.4,
2.3.1, 2.3.0, 2.1.3, 2.1.2, 2.1.1, 2.2.0, 2.1.0
Reporter: Josh Elser
Assignee: Josh Elser
I was looking at an HBase user's cluster with [~danilocop] where they saw two
otherwise identical clusters where one of them was regularly had sockets in
CLOSE_WAIT going from RegionServers to a distributed storage appliance.
After a lot of analysis, we eventually figured out that these sockets in
CLOSE_WAIT were directly related to an FSDataInputStream which we forgot to
close inside of the RegionServer. The subtlety was that only one of these HBase
clusters was set up to do replication (to the other cluster). The HBase cluster
experiencing this problem was shipping edits to a peer, and had previously been
using Phoenix. At some point, the cluster had Phoenix removed from it.
What we found was that replication still had WALs to ship which were for
Phoenix tables. Phoenix, in this version, still used the custom WALCellCodec;
however, this codec class was missing from the RS classpath after the owner of
the cluster removed Phoenix.
When we try to instantiate the Codec implementation via ReflectionUtils, we end
up throwing an UnsupportedOperationException which wraps a
NoClassDefFoundException. However, in WALFactory, we _only_ close the
FSDataInputStream when we catch an IOException.
Thus, replication sits in a "fast" loop, trying to ship these edits, each time
leaking a new socket because of the InputStream not being closed. There is an
obvious workaround for this specific issue, but we should not leak this inside
HBase.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)