[ 
https://issues.apache.org/jira/browse/HDDS-10882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848306#comment-17848306
 ] 

Wei-Chiu Chuang commented on HDDS-10882:
----------------------------------------

I think it's hard to tell where this supposedly open file descriptor leak is 
from, whether it's from Ozone client or HBase RegionServer (most likely Ozone 
client though)

It's unrelated to HDDS-10488 because that one happens on Ozone DataNode side.

Symptom:

lsof -p shows large number of open sockets. The details are unknown because 
they look like this:

{noformat}
java    166480 hbase 1085u     sock                0,7       0t0  871469913 
protocol: TCP
{noformat}
it doesn't show the state of the socket and where it connects to/from.

Moreover, netstat -p doesn't show those sockets. nsenter doesn't show any more 
details.

Combining those two pieces evidences, it suggests there were sockets that got 
created but were never used to listen or connect.

These two discussion threads look related: 
https://serverfault.com/questions/153983/sockets-found-by-lsof-but-not-by-netstat
https://serverfault.com/questions/1000338/in-lsof-output-what-are-those-sock-lines

Let's continue dive deeper using the file leak detector agent to find out where 
the sockets got created: https://file-leak-detector.kohsuke.org/

> [Hbase-Ozone] Regionservers down with "newSocketStream(..) failed: Too many 
> open files"
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-10882
>                 URL: https://issues.apache.org/jira/browse/HDDS-10882
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Pratyush Bhatt
>            Priority: Major
>
> HBase Regionservers are down with "{_}Errors$NativeIoException: 
> newSocketStream(..) failed: Too many open files{_}". Seeing Multiple errors 
> related to this in the log just before the RS aborts.
> {code:java}
> 2024-05-17 07:15:12,599 ERROR 
> org.apache.hadoop.hbase.regionserver.StoreEngine: Failed to commit store file 
> ofs://ozone1712905605/volhbase22-renew/buckethbase1/hbase/data/default/table_1715608461/05ba06b87308b00d9d02029812ce68c2/.tmp/family/6f4c8b33255044468b99b8711afaa47c
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile 
> Trailer from file 
> ofs://ozone1712905605/volhbase22-renew/buckethbase1/hbase/data/default/table_1715608461/05ba06b87308b00d9d02029812ce68c2/family/6f4c8b33255044468b99b8711afaa47c
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileInfo.initTrailerAndContext(HFileInfo.java:349)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileInfo.<init>(HFileInfo.java:123)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.initHFileInfo(StoreFileInfo.java:706)
>         at 
> org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:364)
>         at 
> org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:485)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:224)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:217)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.commitStoreFiles(StoreEngine.java:434)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2045)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2838)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2573)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2545)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2415)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:602)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:572)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:65)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:344)
> Caused by: java.io.IOException: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:435)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:343)
>         at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
>         at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:159)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:338)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:319)
>         at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:235)
>         at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$getBlock$0(ContainerProtocolCalls.java:209)
>         at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:158)
>         at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:208)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getBlockDataUsingClient(BlockInputStream.java:288)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getBlockData(BlockInputStream.java:260)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:164)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:370)
>         at 
> org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56)
>         at 
> org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:54)
>         at 
> org.apache.hadoop.hdds.scm.storage.MultipartInputStream.readWithStrategy(MultipartInputStream.java:96)
>         at 
> org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56)
>         at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:81)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at 
> org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:394)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileInfo.initTrailerAndContext(HFileInfo.java:339)
>         ... 16 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
>         at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:406)
>         ... 37 more
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
>         at 
> org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:537)
>         at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.thirdparty.io.netty.channel.ChannelException: 
> Unable to create Channel from class class 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollSocketChannel
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:46)
>         at 
> org.apache.ratis.thirdparty.io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:310)
>         at 
> org.apache.ratis.thirdparty.io.netty.bootstrap.AbstractBootstrap.register(AbstractBootstrap.java:227)
>         at 
> org.apache.ratis.thirdparty.io.grpc.netty.NettyClientTransport.start(NettyClientTransport.java:265)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ForwardingConnectionClientTransport.start(ForwardingConnectionClientTransport.java:34)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.ForwardingConnectionClientTransport.start(ForwardingConnectionClientTransport.java:34)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.InternalSubchannel.startNewTransport(InternalSubchannel.java:259)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.InternalSubchannel.access$400(InternalSubchannel.java:66)
>         at 
> org.apache.ratis.thirdparty.io.grpc.internal.InternalSubchannel$1EndOfCurrentBackoff.run(InternalSubchannel.java:280)
>         at 
> org.apache.ratis.thirdparty.io.grpc.SynchronizationContext$ManagedRunnable.run(SynchronizationContext.java:213)
>         at 
> org.apache.ratis.thirdparty.io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
>         at 
> org.apache.ratis.thirdparty.io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
>         at 
> org.apache.ratis.thirdparty.io.grpc.SynchronizationContext$1.run(SynchronizationContext.java:154)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         ... 1 more
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.GeneratedConstructorAccessor48.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:44)
>         ... 22 more
> Caused by: org.apache.ratis.thirdparty.io.netty.channel.ChannelException: 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> newSocketStream(..) failed: Too many open files
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.newSocketStream0(Socket.java:598)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.LinuxSocket.newSocketStream(LinuxSocket.java:389)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.LinuxSocket.newSocketStream(LinuxSocket.java:397)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollSocketChannel.<init>(EpollSocketChannel.java:50)
>         ... 26 more
> Caused by: 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> newSocketStream(..) failed: Too many open files {code}
> And then the RS aborts just after:
> {code:java}
> 2024-05-17 07:15:13,190 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region 
> server vc0128.xyz.com,22101,1715605583305: Replay of WAL required. Forcing 
> server shutdown *****
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> table_1715608461,user2349,1715608475244.05ba06b87308b00d9d02029812ce68c2.
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2896)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2573)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2545)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2415)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:602)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:572)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:65)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:344)
> Caused by: java.io.IOException: Failed to commit the flush
>         at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.commitStoreFiles(StoreEngine.java:451)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2045)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2838)
>         ... 7 more {code}
> Aborted RS are down with same issue. And soon when all RS are down, HMasters 
> also crashed.
> {code:java}
> 2024-05-17 08:05:32,502 ERROR org.apache.hadoop.hbase.master.HMaster: ***** 
> ABORTING master vc0121.xyz.com,22001,1715605585306: FAILED persisting 
> region=e46df411ff7ff6517882a251d7ea469f state=OPENING *****
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> 2024-05-17T14:58:21.993Z, 
> RpcRetryingCaller{globalStartTime=2024-05-17T14:58:21.897Z, pause=100, 
> maxAttempts=31}, java.io.IOException: java.io.IOException: Cannot append; log 
> is closed, regionName = hbase:meta,,1.1588230740
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(AbstractFSWAL.java:1173)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.append(FSHLog.java:515)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendData(AbstractFSWAL.java:1232)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8463)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4520)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4444)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4374)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4850)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4844)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4840)
>         at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3152)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3006)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2969)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45947)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:387)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:139)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349) 
> {code}
> cc: [~sammichen] [~ashishk] [~weichiu]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to