[ 
https://issues.apache.org/jira/browse/HBASE-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-25692.
-----------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Merged to 2.3+. Shout if you want it to go elsewhere [~elserj].

> Failure to instantiate WALCellCodec leaks socket in replication
> ---------------------------------------------------------------
>
>                 Key: HBASE-25692
>                 URL: https://issues.apache.org/jira/browse/HBASE-25692
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.1.0, 2.2.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 2.3.1, 2.1.4, 
> 2.0.6, 2.1.5, 2.2.1, 2.1.6, 2.1.7, 2.2.2, 2.1.8, 2.2.3, 2.3.3, 2.1.9, 2.2.4, 
> 2.4.0, 2.2.5, 2.2.6, 2.3.2, 2.3.4, 2.4.1, 2.4.2
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3, 2.3.6
>
>
> I was looking at an HBase user's cluster with [~danilocop] where they saw two 
> otherwise identical clusters where one of them was regularly had sockets in 
> CLOSE_WAIT going from RegionServers to a distributed storage appliance.
> After a lot of analysis, we eventually figured out that these sockets in 
> CLOSE_WAIT were directly related to an FSDataInputStream which we forgot to 
> close inside of the RegionServer. The subtlety was that only one of these 
> HBase clusters was set up to do replication (to the other cluster). The HBase 
> cluster experiencing this problem was shipping edits to a peer, and had 
> previously been using Phoenix. At some point, the cluster had Phoenix removed 
> from it.
> What we found was that replication still had WALs to ship which were for 
> Phoenix tables. Phoenix, in this version, still used the custom WALCellCodec; 
> however, this codec class was missing from the RS classpath after the owner 
> of the cluster removed Phoenix.
> When we try to instantiate the Codec implementation via ReflectionUtils, we 
> end up throwing an UnsupportedOperationException which wraps a 
> NoClassDefFoundException. However, in WALFactory, we _only_ close the 
> FSDataInputStream when we catch an IOException. 
> Thus, replication sits in a "fast" loop, trying to ship these edits, each 
> time leaking a new socket because of the InputStream not being closed. There 
> is an obvious workaround for this specific issue, but we should not leak this 
> inside HBase.
> Approximate, 2.1.x stack trace which lead us to this is below.
> {noformat}
> 2021-03-11 18:19:20,364 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader: 
> Failed to read stream of replication entries
> java.io.IOException: Cannot get log reader
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:366)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:354)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:302)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:293)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:174)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:100)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> Caused by: java.lang.UnsupportedOperationException: Unable to find 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
>       at 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:47)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.create(WALCellCodec.java:106)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.getCodec(ProtobufLogReader.java:301)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:311)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:81)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321)
>       ... 10 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>       at java.lang.Class.forName0(Native Method)
>       at java.lang.Class.forName(Class.java:264)
>       at 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
>       ... 16 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to