Hi Meiko, I hope you were able to debug some of the issues already but here some things you might want to check/consider: - The client side also needs hugepages. The out-of-memory error means it couldn't allocate hugepages locally. The number of hugepages required is determined by the cachelimit size in the client configuration. If you have allocated hugepages on the client, but you still get the error, check if you have the appropriate permissions and "crail.cachepath" is set correctly in the crail conf at the client side. - You need to set "crail.namenode.rpctype" on: namenode, datanode and the client to the same RPC type, e.g. "org.apache.crail.namenode.rpc.darpc.DaRPCNameNode" for RDMA.
Best, Jonas On Tuesday, March 25th, 2025 at 11:03, Prilop, Meiko <e12123...@student.tuwien.ac.at> wrote: > > > Dear Sir or Madam, > > since my first email, I was able to setup Crail natively on my Ubuntu 18.04 > machine having a namenode and datanode that recognize eachother instead of > trying to use docker. However, now when I try to test my setup using crails > inbuild tools on another machine, I get some issues that I wasnt able to > resolve myself. > > TLDR: All approaches to setting up the name and datanode resulted in the > client machine to run into: > java.lang.OutOfMemoryError: Map failed > > On two machines separately, Ive setup Soft-iWARP, using > https://github.com/animeshtrivedi/blog/blob/master/post/2019-06-26-siw.md > and getting the expected output using ibv_devices. > > Further, rping is able to establish connection between both machines. > > I then setup crail following the description of > https://crail.readthedocs.io/en/latest/source.html > https://crail.readthedocs.io/en/latest/config.html > where I setup my crail-site.conf to look like: > > crail.namenode.address crail://128.131.57.140:9060 > crail.namenode.rpctype org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > crail.cachepath /dev/hugepages/cache > crail.regionsize 268435456 > crail.cachelimit 268435456 > crail.storage.types org.apache.crail.storage.rdma.RdmaStorageTier > crail.storage.rdma.interface enp1s0 > crail.storage.rdma.datapath /dev/hugepages/data > crail.storage.rdma.storagelimit 268435456 > > > On both machines. Here I drastically reduced the default values on sizing. I > changed core-site.xml to hold the address here as well at fs.defaultFS . > > Further, when checking cat /proc/meminfo I get for the client: > > MemTotal: 16424160 kB > MemFree: 4983380 kB > MemAvailable: 9580404 kB > Buffers: 975056 kB > Cached: 3347588 kB > SwapCached: 1920 kB > Active: 3856868 kB > Inactive: 2377380 kB > Active(anon): 1339104 kB > Inactive(anon): 552592 kB > Active(file): 2517764 kB > Inactive(file): 1824788 kB > Unevictable: 0 kB > Mlocked: 0 kB > SwapTotal: 4194300 kB > SwapFree: 4176624 kB > Dirty: 108 kB > Writeback: 0 kB > AnonPages: 1909928 kB > Mapped: 632464 kB > Shmem: 132512 kB > Slab: 866876 kB > SReclaimable: 594772 kB > SUnreclaim: 272104 kB > KernelStack: 16768 kB > PageTables: 20932 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 10309228 kB > Committed_AS: 8433608 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 0 kB > VmallocChunk: 0 kB > HardwareCorrupted: 0 kB > AnonHugePages: 34816 kB > ShmemHugePages: 0 kB > ShmemPmdMapped: 0 kB > CmaTotal: 0 kB > CmaFree: 0 kB > HugePages_Total: 2048 > HugePages_Free: 2044 > HugePages_Rsvd: 2044 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 728940 kB > DirectMap2M: 11853824 kB > DirectMap1G: 6291456 kB > > And on my server: > > MemTotal: 16424160 kB > MemFree: 3771716 kB > MemAvailable: 6517888 kB > Buffers: 309848 kB > Cached: 2576436 kB > SwapCached: 0 kB > Active: 2398708 kB > Inactive: 1465044 kB > Active(anon): 977800 kB > Inactive(anon): 356 kB > Active(file): 1420908 kB > Inactive(file): 1464688 kB > Unevictable: 0 kB > Mlocked: 0 kB > SwapTotal: 4194300 kB > SwapFree: 4194300 kB > Dirty: 76 kB > Writeback: 0 kB > AnonPages: 977544 kB > Mapped: 233700 kB > Shmem: 820 kB > Slab: 294456 kB > SReclaimable: 202184 kB > SUnreclaim: 92272 kB > KernelStack: 6096 kB > PageTables: 12184 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 8212076 kB > Committed_AS: 2616856 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 0 kB > VmallocChunk: 0 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > ShmemPmdMapped: 0 kB > CmaTotal: 0 kB > CmaFree: 0 kB > HugePages_Total: 4096 > HugePages_Free: 3968 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 210796 kB > DirectMap2M: 7129088 kB > DirectMap1G: 11534336 kB > > Checking if my hugetables are mounted using mount | grep huge on the server > and client, the output is: > hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) > > Spinning up a namenode on the server, the output is as follows: > 25/03/25 13:57:14 INFO crail: initalizing namenode > 25/03/25 13:57:14 INFO crail: crail.version 3101 > 25/03/25 13:57:14 INFO crail: crail.directorydepth 16 > 25/03/25 13:57:14 INFO crail: crail.tokenexpiration 10 > 25/03/25 13:57:14 INFO crail: crail.blocksize 1048576 > 25/03/25 13:57:14 INFO crail: crail.cachelimit 268435456 > 25/03/25 13:57:14 INFO crail: crail.cachepath /dev/hugepages/cache > 25/03/25 13:57:14 INFO crail: crail.user crail > 25/03/25 13:57:14 INFO crail: crail.shadowreplication 1 > 25/03/25 13:57:14 INFO crail: crail.debug false > 25/03/25 13:57:14 INFO crail: crail.statistics true > 25/03/25 13:57:14 INFO crail: crail.rpctimeout 1000 > 25/03/25 13:57:14 INFO crail: crail.datatimeout 1000 > 25/03/25 13:57:14 INFO crail: crail.buffersize 1048576 > 25/03/25 13:57:14 INFO crail: crail.slicesize 524288 > 25/03/25 13:57:14 INFO crail: crail.singleton true > 25/03/25 13:57:14 INFO crail: crail.regionsize 268435456 > 25/03/25 13:57:14 INFO crail: crail.directoryrecord 512 > 25/03/25 13:57:14 INFO crail: crail.directoryrandomize true > 25/03/25 13:57:14 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > 25/03/25 13:57:14 INFO crail: crail.locationmap > 25/03/25 13:57:14 INFO crail: crail.namenode.address > crail://128.131.57.140:9060?id=0&size=1 > 25/03/25 13:57:14 INFO crail: crail.namenode.blockselection roundrobin > 25/03/25 13:57:14 INFO crail: crail.namenode.fileblocks 16 > 25/03/25 13:57:14 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > 25/03/25 13:57:14 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > 25/03/25 13:57:14 INFO crail: crail.namenode.log > 25/03/25 13:57:14 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > 25/03/25 13:57:14 INFO crail: crail.storage.classes 1 > 25/03/25 13:57:14 INFO crail: crail.storage.rootclass 0 > 25/03/25 13:57:14 INFO crail: crail.storage.keepalive 2 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.scaleup 0.4 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.scaledown 0.1 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.maxnodes 10 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.minnodes 1 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.policyrunner.interval > 1000 > 25/03/25 13:57:14 INFO crail: crail.elasticstore.logging false > 25/03/25 13:57:14 INFO crail: round robin block selection > 25/03/25 13:57:14 INFO crail: rpc group started, recvQueue 32 > 25/03/25 13:57:14 INFO darpc: running resource management, index 0, > affinity 2, timeout 2147483647 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.polling false > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.type passive > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.affinity 1 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.maxinline 0 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.recvQueue 32 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.sendQueue 32 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.pollsize 32 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.clustersize 128 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.backlog 100 > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.connecttimeout 1000 > 25/03/25 13:57:14 INFO crail: opened server at /128.131.57.140:9060 > > > Now spinning up the datanode using $CRAIL_HOME/bin/crail datanode -t > org.apache.crail.storage.rdma.RdmaStorageTier: > > 25/03/25 13:59:05 INFO crail: crail.version 3101 > 25/03/25 13:59:05 INFO crail: crail.directorydepth 16 > 25/03/25 13:59:05 INFO crail: crail.tokenexpiration 10 > 25/03/25 13:59:05 INFO crail: crail.blocksize 1048576 > 25/03/25 13:59:05 INFO crail: crail.cachelimit 268435456 > 25/03/25 13:59:05 INFO crail: crail.cachepath /dev/hugepages/cache > 25/03/25 13:59:05 INFO crail: crail.user crail > 25/03/25 13:59:05 INFO crail: crail.shadowreplication 1 > 25/03/25 13:59:05 INFO crail: crail.debug false > 25/03/25 13:59:05 INFO crail: crail.statistics true > 25/03/25 13:59:05 INFO crail: crail.rpctimeout 1000 > 25/03/25 13:59:05 INFO crail: crail.datatimeout 1000 > 25/03/25 13:59:05 INFO crail: crail.buffersize 1048576 > 25/03/25 13:59:05 INFO crail: crail.slicesize 524288 > 25/03/25 13:59:05 INFO crail: crail.singleton true > 25/03/25 13:59:05 INFO crail: crail.regionsize 268435456 > 25/03/25 13:59:05 INFO crail: crail.directoryrecord 512 > 25/03/25 13:59:05 INFO crail: crail.directoryrandomize true > 25/03/25 13:59:05 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > 25/03/25 13:59:05 INFO crail: crail.locationmap > 25/03/25 13:59:05 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > 25/03/25 13:59:05 INFO crail: crail.namenode.blockselection roundrobin > 25/03/25 13:59:05 INFO crail: crail.namenode.fileblocks 16 > 25/03/25 13:59:05 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > 25/03/25 13:59:05 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > 25/03/25 13:59:05 INFO crail: crail.namenode.log > 25/03/25 13:59:05 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > 25/03/25 13:59:05 INFO crail: crail.storage.classes 1 > 25/03/25 13:59:05 INFO crail: crail.storage.rootclass 0 > 25/03/25 13:59:05 INFO crail: crail.storage.keepalive 2 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.scaleup 0.4 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.scaledown 0.1 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.maxnodes 10 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.minnodes 1 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.policyrunner.interval > 1000 > 25/03/25 13:59:05 INFO crail: crail.elasticstore.logging false > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.interface enp1s0 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.port 50020 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.storagelimit 268435456 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.allocationsize > 268435456 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.localmap true > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.queuesize 32 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.type passive > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.backlog 100 > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.connecttimeout 1000 > 25/03/25 13:59:05 INFO crail: rdma storage server started, address > /128.131.57.140:50020, persistent false, maxWR 1, maxSge 1, cqSize 1 > 25/03/25 13:59:05 INFO crail: rpc group started, recvQueue 32 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.polling false > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.type passive > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.affinity 1 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.maxinline 0 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.recvQueue 32 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.sendQueue 32 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.pollsize 32 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.clustersize 128 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.backlog 100 > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.connecttimeout 1000 > 25/03/25 13:59:06 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > 25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256 > 25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256 > > > Finally, on my client I try to test using the inbuild functions of iobench > and fsck and further I have setup a disni client following the examples and > the configuration found in > https://github.com/brianfrankcooper/YCSB/tree/master/crail/src/main/java/site/ycsb/db/crail > > The first attempt of connecting to the namenode is the experiment of fsck > ping. With $CRAIL_HOME/bin/crail fsck -t ping I get: > 25/03/25 14:10:58 INFO crail: crail.version 3101 > 25/03/25 14:10:58 INFO crail: crail.directorydepth 16 > 25/03/25 14:10:58 INFO crail: crail.tokenexpiration 10 > 25/03/25 14:10:58 INFO crail: crail.blocksize 1048576 > 25/03/25 14:10:58 INFO crail: crail.cachelimit 268435456 > 25/03/25 14:10:58 INFO crail: crail.cachepath /dev/hugepages/cache > 25/03/25 14:10:58 INFO crail: crail.user crail > 25/03/25 14:10:58 INFO crail: crail.shadowreplication 1 > 25/03/25 14:10:58 INFO crail: crail.debug false > 25/03/25 14:10:58 INFO crail: crail.statistics true > 25/03/25 14:10:58 INFO crail: crail.rpctimeout 1000 > 25/03/25 14:10:58 INFO crail: crail.datatimeout 1000 > 25/03/25 14:10:58 INFO crail: crail.buffersize 1048576 > 25/03/25 14:10:58 INFO crail: crail.slicesize 524288 > 25/03/25 14:10:58 INFO crail: crail.singleton true > 25/03/25 14:10:58 INFO crail: crail.regionsize 268435456 > 25/03/25 14:10:58 INFO crail: crail.directoryrecord 512 > 25/03/25 14:10:58 INFO crail: crail.directoryrandomize true > 25/03/25 14:10:58 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > 25/03/25 14:10:58 INFO crail: crail.locationmap > 25/03/25 14:10:58 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > 25/03/25 14:10:58 INFO crail: crail.namenode.blockselection roundrobin > 25/03/25 14:10:58 INFO crail: crail.namenode.fileblocks 16 > 25/03/25 14:10:58 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.tcp.TcpNameNode > 25/03/25 14:10:58 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > 25/03/25 14:10:58 INFO crail: crail.namenode.log > 25/03/25 14:10:58 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > 25/03/25 14:10:58 INFO crail: crail.storage.classes 1 > 25/03/25 14:10:58 INFO crail: crail.storage.rootclass 0 > 25/03/25 14:10:58 INFO crail: crail.storage.keepalive 2 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.scaleup 0.4 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.scaledown 0.1 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.maxnodes 10 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.minnodes 1 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.policyrunner.interval > 1000 > 25/03/25 14:10:58 INFO crail: crail.elasticstore.logging false > 25/03/25 14:10:58 INFO crail: buffer cache, allocationCount 1, > bufferCount 256 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.interface enp1s0 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.port 50020 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.storagelimit 268435456 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.allocationsize > 268435456 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.localmap true > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.queuesize 32 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.type passive > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.backlog 100 > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.connecttimeout 1000 > 25/03/25 14:10:58 INFO narpc: new NaRPC client group v1.5.0, queueDepth > 32, messageSize 512, nodealy true > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.queueDepth 32 > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.messageSize 512 > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.cores 1 > 25/03/25 14:10:58 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > > Where it then is stuck. Here I noticed that it defaults back to NaRPC instead > of DaRPC. This is the case for all experiments. Changing anything in the > crail-site.conf on the client did not achieve anything, also attempting to > manipulate this to something obviously wrong was just ignored. On the server > side, this ping went unnoticed. > > I then tried the iobenchmark using the default $CRAIL_HOME/bin/crail iobench > -t write -f /filename -s $((10241024)) -k 1024. This resulted in > > 25/03/25 14:13:25 INFO crail: creating singleton crail file system > 25/03/25 14:13:25 INFO crail: crail.version 3101 > 25/03/25 14:13:25 INFO crail: crail.directorydepth 16 > 25/03/25 14:13:25 INFO crail: crail.tokenexpiration 10 > 25/03/25 14:13:25 INFO crail: crail.blocksize 1048576 > 25/03/25 14:13:25 INFO crail: crail.cachelimit 268435456 > 25/03/25 14:13:25 INFO crail: crail.cachepath /dev/hugepages/cache > 25/03/25 14:13:25 INFO crail: crail.user crail > 25/03/25 14:13:25 INFO crail: crail.shadowreplication 1 > 25/03/25 14:13:25 INFO crail: crail.debug false > 25/03/25 14:13:25 INFO crail: crail.statistics true > 25/03/25 14:13:25 INFO crail: crail.rpctimeout 1000 > 25/03/25 14:13:25 INFO crail: crail.datatimeout 1000 > 25/03/25 14:13:25 INFO crail: crail.buffersize 1048576 > 25/03/25 14:13:25 INFO crail: crail.slicesize 524288 > 25/03/25 14:13:25 INFO crail: crail.singleton true > 25/03/25 14:13:25 INFO crail: crail.regionsize 268435456 > 25/03/25 14:13:25 INFO crail: crail.directoryrecord 512 > 25/03/25 14:13:25 INFO crail: crail.directoryrandomize true > 25/03/25 14:13:25 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > 25/03/25 14:13:25 INFO crail: crail.locationmap > 25/03/25 14:13:25 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > 25/03/25 14:13:25 INFO crail: crail.namenode.blockselection roundrobin > 25/03/25 14:13:25 INFO crail: crail.namenode.fileblocks 16 > 25/03/25 14:13:25 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.tcp.TcpNameNode > 25/03/25 14:13:25 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > 25/03/25 14:13:25 INFO crail: crail.namenode.log > 25/03/25 14:13:25 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > 25/03/25 14:13:25 INFO crail: crail.storage.classes 1 > 25/03/25 14:13:25 INFO crail: crail.storage.rootclass 0 > 25/03/25 14:13:25 INFO crail: crail.storage.keepalive 2 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.scaleup 0.4 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.scaledown 0.1 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.maxnodes 10 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.minnodes 1 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.policyrunner.interval > 1000 > 25/03/25 14:13:25 INFO crail: crail.elasticstore.logging false > 25/03/25 14:13:25 INFO crail: buffer cache, allocationCount 1, > bufferCount 256 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.interface enp1s0 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.port 50020 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.storagelimit 268435456 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.allocationsize > 268435456 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.localmap true > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.queuesize 32 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.type passive > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.backlog 100 > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.connecttimeout 1000 > 25/03/25 14:13:25 INFO narpc: new NaRPC client group v1.5.0, queueDepth > 32, messageSize 512, nodealy true > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.queueDepth 32 > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.messageSize 512 > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.cores 1 > 25/03/25 14:13:25 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > write, filename /filename, size 1048576, loop 1024, storageClass 0, > locationClass 0, buffered true > Exception in thread "main" java.io.IOException: Map failed > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938) > at > org.apache.crail.memory.MappedBufferCache.allocateRegion(MappedBufferCache.java:94) > at > org.apache.crail.memory.BufferCache.allocateBuffer(BufferCache.java:95) > at > org.apache.crail.core.CoreDataStore.allocateBuffer(CoreDataStore.java:482) > at org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:85) > at org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070) > Caused by: java.lang.OutOfMemoryError: Map failed > at sun.nio.ch.FileChannelImpl.map0(Native Method) > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935) > ... 5 more > > > Where again it defaults to NaRPC instead of DaRPC. Further, the call went > unnoticed on the server side, although it states it connected to the namenode. > I then changed the setting on my server in crail-site.conf back to use NaRPC. > Only removing this aspect of the configuration, it still was able to spin up. > I tested the fsck experiments again where namenodeDump, getLocations and ping > worked, however directoryDump, blockStatistics and createDirectory went into > the same java.lang.OutOfMemoryError: Map failed. The connection attempts were > recognized on the namenode side. > > After that, I tried iobench again, resulting in the same OutOfMemoryError. > Reducing the size and loop to a bare minimum of $((44)) -k 4, I still got the > same issue. > > > Next, I set back the whole configurations Ive done in crail-site.conf to use > TCP instead of RDMA, to check whether or not this might give me some > insights. For this setup, I got the same issues with fsck -t createDirectory > and the iobench. > > At this point I am absolutely stuck. As Crail is perfectly fitted for my > masters thesis, I dont really want to give up on trying to finish this setup. > > I hope I didnt miss any important information. I will try to find a way to > debug this in the meantime. I am greatful for any advise on this. > > For the last email Ive send, I didnt receive any reply in my email provider. > Would it be possible to put me and my working email > "meiko.prilop-...@ibm.com" in CC, just in case? > > Thanks in advance! > > Best > Meiko Prilop