ChenSammi opened a new pull request, #6690:
URL: https://github.com/apache/ozone/pull/6690

   ## What changes were proposed in this pull request?
   DN is OOM with following stack found in a test cluster. 
   
   ```
   6:52:03.601 AM  WARN  KeyValueHandler Operation: ReadChunk , Trace ID:  , 
Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , 
StorageContainerException Occurred.
   
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.io.IOException: Map failed
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
     at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
     at 
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
     at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
     at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
     at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
     at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
     at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
     at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
     at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
     at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
     at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
     at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
     at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
     at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
     at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
     at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
     at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
     at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
     at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: Map failed
     at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
     at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
     ... 24 more
   Caused by: java.lang.OutOfMemoryError: Map failed
     at sun.nio.ch.FileChannelImpl.map0(Native Method)
     at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
     ... 28 more
   ```
   
   The root cause is there is platform limit for the max mapped region of a 
process.  This limit by default is 65530(max_map_count) on Linux platform. 
Every FileChannel.map() call will consume one quota.  When DN runs out of this 
max_map_count, then it will generate this OOM exception.  
   The mapped buffer will be released by Java GC once the data is sent out.  
When there is heavy read workload on DN, there is chance that DN will exceed 
this max_map_count at some point. 
   
   This task adds a upper limit "ozone.chunk.read.mapped.buffer.max.count", by 
default is 0. 
   Since this max_map_count  configuration could vary from platform to 
platform, or even one the same platform. It's better let admin/user design the 
a appropriate max count value by themselves. 
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-10488
   
   ## How was this patch tested?
   Manual test
   1. setup a docker cluster
   2. put a 56MB file
   3. get the same file
   4. check DN logs
       ```
   2024-05-17 07:25:26,649 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-4] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 127
   2024-05-17 07:25:26,653 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-4] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,713 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-9] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 126
   2024-05-17 07:25:26,714 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-9] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,747 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 125
   2024-05-17 07:25:26,748 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,765 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 124
   2024-05-17 07:25:26,765 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,889 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 123
   2024-05-17 07:25:26,889 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,905 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-8] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 122
   2024-05-17 07:25:26,905 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-8] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:25:26,929 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] 
INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 121
   2024-05-17 07:25:26,930 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] 
INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class 
java.nio.DirectByteBufferR
   2024-05-17 07:26:05,652 
[490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#0] INFO 
interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 
candidate containers.
   2024-05-17 07:27:05,654 
[490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#2] INFO 
interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 
candidate containers.
   2024-05-17 07:28:05,655 
[490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#1] INFO 
interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 
candidate containers.
   2024-05-17 07:28:51,294 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 122
   2024-05-17 07:28:51,294 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 123
   2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 124
   2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 125
   2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 126
   2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 127
   2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap 
semaphore permits increased by 1 to total 128
   ```
   
   Use pmap to verify mapped region status.  After the mapped semaphore permits 
were released, the mapped regions disappeared in the output of pmap command. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to