[ 
https://issues.apache.org/jira/browse/HDDS-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-10488:
------------------------------
    Description: 
When I run command "yarn jar 
/**/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO 
-Dfs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem 
-Dfs.AbstractFileSystem.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzFs 
-Dfs.defaultFS=ofs://ozone1708515436 -Dozone.client.bytes.per.checksum=1KB 
-Dtest.build.data=ofs://ozone1708515436/s3v/testdfsio -write -nrFiles 64 
-fileSize 1024MB" on an installed Ozone cluster, several DN crashed due to OOM. 
Following is the exception stack,
{code:java}
6:52:03.601 AM  WARN  KeyValueHandler Operation: ReadChunk , Trace ID:  , 
Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , 
StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.io.IOException: Map failed
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
  at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
  at 
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
  at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
  at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
  at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
  at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
  at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
  at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
  ... 24 more
Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
  ... 28 more
{code}
In the "Dynamic libraries" section of file hs_err_pid1560151.log, there are 
261425 mapped regions for different block files, for example 
"/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir3/2005/chunks/113750153625618247.block"

On OS level, the max mmap handler count is saved in file 
/proc/sys/vm/max_map_count, which has value 262144.

In HDDS-7117, MappedByteBuffer is introduced to improve the chunk read 
performance. 
Property "ozone.chunk.read.mapped.buffer.threshold" with value 32KB is defined 
as a bar to indicate whether use MappedByteBuffer or normal ByteBuffer to read 
data. 
If read data length is less than "ozone.chunk.read.mapped.buffer.threshold", 
MappedByteBuffer should not be used, which is not enforce in current 
implementation. 
Here is the logs when debug log level is enabled in DN,
{code:java}
2024-02-27 15:19:28,676 DEBUG 
[f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
 mapped: offset=213565440, readLen=0, n=8192, class java.nio.DirectByteBufferR
2024-02-27 15:19:28,676 DEBUG 
[f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
 mapped: offset=213565440, readLen=8192, n=8192, class 
java.nio.DirectByteBufferR
...
2024-02-27 15:19:28,676 DEBUG 
[f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
 mapped: offset=213565440, readLen=327680, n=8192, class 
java.nio.DirectByteBufferR
2024-02-27 15:19:28,676 DEBUG 
[f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
 mapped: offset=213565440, readLen=335872, n=8192, class 
java.nio.DirectByteBufferR
2024-02-27 15:19:28,676 DEBUG 
[f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
 Read 344064 bytes starting at offset 213565440 from 
/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir9/5007/chunks/113750153625622210.block

{code}
Due to this, DN is out of memory when the mmap handlers run out.

  was:
When I run command "yarn jar 
/**/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO 
-Dfs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem 
-Dfs.AbstractFileSystem.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzFs 
-Dfs.defaultFS=ofs://ozone1708515436 -Dozone.client.bytes.per.checksum=1KB  
-Dtest.build.data=ofs://ozone1708515436/s3v/testdfsio -write -nrFiles 64 
-fileSize 1024MB" on an installed Ozone cluster, several DN crashed due to OOM. 
Following is the exception stack,

{code:java}
6:52:03.601 AM  WARN  KeyValueHandler Operation: ReadChunk , Trace ID:  , 
Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , 
StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.io.IOException: Map failed
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
  at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
  at 
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
  at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
  at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
  at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
  at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
  at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
  at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
  at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
  at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
  at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
  at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
  ... 24 more
Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
  ... 28 more
{code}

In the "Dynamic libraries" section of file hs_err_pid1560151.log,  there are 
261425 mapped regions for different block files, for example 
"/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir3/2005/chunks/113750153625618247.block"
 

On OS level, the max mmap handler count is saved in file 
/proc/sys/vm/max_map_count, which has value 262144. 

In HDDS-7117, MappedByteBuffer is introduced to improve the chunk read 
performance. 
Property "ozone.chunk.read.mapped.buffer.threshold" with value 32KB is defined 
as a bar to indicate whether use MappedByteBuffer or normal ByteBuffer to read 
data. 
If read data length is less than "ozone.chunk.read.mapped.buffer.threshold", 
MappedByteBuffer should not be used, which is not enforce in current 
implementation. Due to this,  DN is out of memory when the mmap handler run 
out. 


> Datanode OOO due to run out of mmap handler 
> --------------------------------------------
>
>                 Key: HDDS-10488
>                 URL: https://issues.apache.org/jira/browse/HDDS-10488
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>
> When I run command "yarn jar 
> /**/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar 
> TestDFSIO -Dfs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem 
> -Dfs.AbstractFileSystem.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzFs 
> -Dfs.defaultFS=ofs://ozone1708515436 -Dozone.client.bytes.per.checksum=1KB 
> -Dtest.build.data=ofs://ozone1708515436/s3v/testdfsio -write -nrFiles 64 
> -fileSize 1024MB" on an installed Ozone cluster, several DN crashed due to 
> OOM. Following is the exception stack,
> {code:java}
> 6:52:03.601 AM  WARN  KeyValueHandler Operation: ReadChunk , Trace ID:  , 
> Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , 
> StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.io.IOException: Map failed
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
>   at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
>   at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
>   at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
>   at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
>   at 
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
>   at 
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
>   at 
> org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Map failed
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
>   at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
>   ... 24 more
> Caused by: java.lang.OutOfMemoryError: Map failed
>   at sun.nio.ch.FileChannelImpl.map0(Native Method)
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
>   ... 28 more
> {code}
> In the "Dynamic libraries" section of file hs_err_pid1560151.log, there are 
> 261425 mapped regions for different block files, for example 
> "/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir3/2005/chunks/113750153625618247.block"
> On OS level, the max mmap handler count is saved in file 
> /proc/sys/vm/max_map_count, which has value 262144.
> In HDDS-7117, MappedByteBuffer is introduced to improve the chunk read 
> performance. 
> Property "ozone.chunk.read.mapped.buffer.threshold" with value 32KB is 
> defined as a bar to indicate whether use MappedByteBuffer or normal 
> ByteBuffer to read data. 
> If read data length is less than "ozone.chunk.read.mapped.buffer.threshold", 
> MappedByteBuffer should not be used, which is not enforce in current 
> implementation. 
> Here is the logs when debug log level is enabled in DN,
> {code:java}
> 2024-02-27 15:19:28,676 DEBUG 
> [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
>  mapped: offset=213565440, readLen=0, n=8192, class java.nio.DirectByteBufferR
> 2024-02-27 15:19:28,676 DEBUG 
> [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
>  mapped: offset=213565440, readLen=8192, n=8192, class 
> java.nio.DirectByteBufferR
> ...
> 2024-02-27 15:19:28,676 DEBUG 
> [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
>  mapped: offset=213565440, readLen=327680, n=8192, class 
> java.nio.DirectByteBufferR
> 2024-02-27 15:19:28,676 DEBUG 
> [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
>  mapped: offset=213565440, readLen=335872, n=8192, class 
> java.nio.DirectByteBufferR
> 2024-02-27 15:19:28,676 DEBUG 
> [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils:
>  Read 344064 bytes starting at offset 213565440 from 
> /hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir9/5007/chunks/113750153625622210.block
> {code}
> Due to this, DN is out of memory when the mmap handlers run out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to