[
https://issues.apache.org/jira/browse/HBASE-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stack updated HBASE-25692:
----------------------------------
Fix Version/s: 2.3.6
2.4.3
2.5.0
3.0.0-alpha-1
Status: In Progress (was: Patch Available)
> Failure to instantiate WALCellCodec leaks socket in replication
> ---------------------------------------------------------------
>
> Key: HBASE-25692
> URL: https://issues.apache.org/jira/browse/HBASE-25692
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.4.2, 2.4.1, 2.3.4, 2.3.2, 2.2.6, 2.2.5, 2.4.0, 2.2.4,
> 2.1.9, 2.3.3, 2.2.3, 2.1.8, 2.2.2, 2.1.7, 2.1.6, 2.2.1, 2.1.5, 2.0.6, 2.1.4,
> 2.3.1, 2.3.0, 2.1.3, 2.1.2, 2.1.1, 2.2.0, 2.1.0
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3, 2.3.6
>
>
> I was looking at an HBase user's cluster with [~danilocop] where they saw two
> otherwise identical clusters where one of them was regularly had sockets in
> CLOSE_WAIT going from RegionServers to a distributed storage appliance.
> After a lot of analysis, we eventually figured out that these sockets in
> CLOSE_WAIT were directly related to an FSDataInputStream which we forgot to
> close inside of the RegionServer. The subtlety was that only one of these
> HBase clusters was set up to do replication (to the other cluster). The HBase
> cluster experiencing this problem was shipping edits to a peer, and had
> previously been using Phoenix. At some point, the cluster had Phoenix removed
> from it.
> What we found was that replication still had WALs to ship which were for
> Phoenix tables. Phoenix, in this version, still used the custom WALCellCodec;
> however, this codec class was missing from the RS classpath after the owner
> of the cluster removed Phoenix.
> When we try to instantiate the Codec implementation via ReflectionUtils, we
> end up throwing an UnsupportedOperationException which wraps a
> NoClassDefFoundException. However, in WALFactory, we _only_ close the
> FSDataInputStream when we catch an IOException.
> Thus, replication sits in a "fast" loop, trying to ship these edits, each
> time leaking a new socket because of the InputStream not being closed. There
> is an obvious workaround for this specific issue, but we should not leak this
> inside HBase.
> Approximate, 2.1.x stack trace which lead us to this is below.
> {noformat}
> 2021-03-11 18:19:20,364 ERROR
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader:
> Failed to read stream of replication entries
> java.io.IOException: Cannot get log reader
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:366)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:354)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:302)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:293)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:174)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:100)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> Caused by: java.lang.UnsupportedOperationException: Unable to find
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
> at
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:47)
> at
> org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.create(WALCellCodec.java:106)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.getCodec(ProtobufLogReader.java:301)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:311)
> at
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:81)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321)
> ... 10 more
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:264)
> at
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
> ... 16 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)