I wrote a little test program for Meiko which tests maximum possible memory registration per RDMA device. It showed a maximum reservable memory region of just 32 Mbytes at client side. This is almost nothing. The finding should be sufficient to close the current issue from Crail side.
Best, Bernard. > -----Original Message----- > From: Jonas <peppe...@japf.ch> > Sent: Saturday, April 5, 2025 4:17 PM > To: dev@crail.apache.org > Cc: d...@crail.incubator.apache.org; Bernard Metzler <b...@zurich.ibm.com>; > Thomas Pusztai <t.pusz...@dsg.tuwien.ac.at>; Meiko Prilop <Meiko.Prilop- > c...@ibm.com> > Subject: [EXTERNAL] Re: Master thesis - Problems setting up Crail > > Hi Meiko, > > I hope you were able to debug some of the issues already but here some > things you might want to check/consider: > - The client side also needs hugepages. The out-of-memory error means it > couldn't allocate hugepages locally. The number of hugepages required is > determined by the cachelimit size in the client configuration. If you have > allocated hugepages on the client, but you still get the error, check if > you have the appropriate permissions and "crail.cachepath" is set correctly > in the crail conf at the client side. > - You need to set "crail.namenode.rpctype" on: namenode, datanode and the > client to the same RPC type, e.g. > "org.apache.crail.namenode.rpc.darpc.DaRPCNameNode" for RDMA. > > Best, > Jonas > > On Tuesday, March 25th, 2025 at 11:03, Prilop, Meiko > <e12123...@student.tuwien.ac.at> wrote: > > > > > > > Dear Sir or Madam, > > > > since my first email, I was able to setup Crail natively on my Ubuntu > 18.04 machine having a namenode and datanode that recognize eachother > instead of trying to use docker. However, now when I try to test my setup > using crails inbuild tools on another machine, I get some issues that I > wasnt able to resolve myself. > > > > TLDR: All approaches to setting up the name and datanode resulted in the > client machine to run into: > > java.lang.OutOfMemoryError: Map failed > > > > On two machines separately, Ive setup Soft-iWARP, using > > https% > 3A__github.com_animeshtrivedi_blog_blob_master_post_2019-2D06-2D26- > 2Dsiw.md&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOw > dSiLedP4yO55g&m=UVoKBOF_5Q-O7ur- > EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=YlLNNW7gic1_0Fbulgah1wlv > qiw5DcNuibRjgFzMu5Q&e= > > and getting the expected output using ibv_devices. > > > > Further, rping is able to establish connection between both machines. > > > > I then setup crail following the description of > > https% > 3A__crail.readthedocs.io_en_latest_source.html&d=DwIFaQ&c=BSDicqBQBDjDI9RkV > yTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur- > EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=jzNZXg7GCpmqtldtYby8HusA > YBAbSl1T0JpIxIsNm4M&e= > > https% > 3A__crail.readthedocs.io_en_latest_config.html&d=DwIFaQ&c=BSDicqBQBDjDI9RkV > yTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur- > EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=Y1kQTdSOKuLo6Zeh2jo9JJgM > 5Usqw85frVndQYhsik0&e= > > where I setup my crail-site.conf to look like: > > > > crail.namenode.address crail://128.131.57.140:9060 > > crail.namenode.rpctype org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > > crail.cachepath /dev/hugepages/cache > > crail.regionsize 268435456 > > crail.cachelimit 268435456 > > crail.storage.types org.apache.crail.storage.rdma.RdmaStorageTier > > crail.storage.rdma.interface enp1s0 > > crail.storage.rdma.datapath /dev/hugepages/data > > crail.storage.rdma.storagelimit 268435456 > > > > > > On both machines. Here I drastically reduced the default values on > sizing. I changed core-site.xml to hold the address here as well at > fs.defaultFS . > > > > Further, when checking cat /proc/meminfo I get for the client: > > > > MemTotal: 16424160 kB > > MemFree: 4983380 kB > > MemAvailable: 9580404 kB > > Buffers: 975056 kB > > Cached: 3347588 kB > > SwapCached: 1920 kB > > Active: 3856868 kB > > Inactive: 2377380 kB > > Active(anon): 1339104 kB > > Inactive(anon): 552592 kB > > Active(file): 2517764 kB > > Inactive(file): 1824788 kB > > Unevictable: 0 kB > > Mlocked: 0 kB > > SwapTotal: 4194300 kB > > SwapFree: 4176624 kB > > Dirty: 108 kB > > Writeback: 0 kB > > AnonPages: 1909928 kB > > Mapped: 632464 kB > > Shmem: 132512 kB > > Slab: 866876 kB > > SReclaimable: 594772 kB > > SUnreclaim: 272104 kB > > KernelStack: 16768 kB > > PageTables: 20932 kB > > NFS_Unstable: 0 kB > > Bounce: 0 kB > > WritebackTmp: 0 kB > > CommitLimit: 10309228 kB > > Committed_AS: 8433608 kB > > VmallocTotal: 34359738367 kB > > VmallocUsed: 0 kB > > VmallocChunk: 0 kB > > HardwareCorrupted: 0 kB > > AnonHugePages: 34816 kB > > ShmemHugePages: 0 kB > > ShmemPmdMapped: 0 kB > > CmaTotal: 0 kB > > CmaFree: 0 kB > > HugePages_Total: 2048 > > HugePages_Free: 2044 > > HugePages_Rsvd: 2044 > > HugePages_Surp: 0 > > Hugepagesize: 2048 kB > > DirectMap4k: 728940 kB > > DirectMap2M: 11853824 kB > > DirectMap1G: 6291456 kB > > > > And on my server: > > > > MemTotal: 16424160 kB > > MemFree: 3771716 kB > > MemAvailable: 6517888 kB > > Buffers: 309848 kB > > Cached: 2576436 kB > > SwapCached: 0 kB > > Active: 2398708 kB > > Inactive: 1465044 kB > > Active(anon): 977800 kB > > Inactive(anon): 356 kB > > Active(file): 1420908 kB > > Inactive(file): 1464688 kB > > Unevictable: 0 kB > > Mlocked: 0 kB > > SwapTotal: 4194300 kB > > SwapFree: 4194300 kB > > Dirty: 76 kB > > Writeback: 0 kB > > AnonPages: 977544 kB > > Mapped: 233700 kB > > Shmem: 820 kB > > Slab: 294456 kB > > SReclaimable: 202184 kB > > SUnreclaim: 92272 kB > > KernelStack: 6096 kB > > PageTables: 12184 kB > > NFS_Unstable: 0 kB > > Bounce: 0 kB > > WritebackTmp: 0 kB > > CommitLimit: 8212076 kB > > Committed_AS: 2616856 kB > > VmallocTotal: 34359738367 kB > > VmallocUsed: 0 kB > > VmallocChunk: 0 kB > > HardwareCorrupted: 0 kB > > AnonHugePages: 0 kB > > ShmemHugePages: 0 kB > > ShmemPmdMapped: 0 kB > > CmaTotal: 0 kB > > CmaFree: 0 kB > > HugePages_Total: 4096 > > HugePages_Free: 3968 > > HugePages_Rsvd: 0 > > HugePages_Surp: 0 > > Hugepagesize: 2048 kB > > DirectMap4k: 210796 kB > > DirectMap2M: 7129088 kB > > DirectMap1G: 11534336 kB > > > > Checking if my hugetables are mounted using mount | grep huge on the > server and client, the output is: > > hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) > > > > Spinning up a namenode on the server, the output is as follows: > > 25/03/25 13:57:14 INFO crail: initalizing namenode > > 25/03/25 13:57:14 INFO crail: crail.version 3101 > > 25/03/25 13:57:14 INFO crail: crail.directorydepth 16 > > 25/03/25 13:57:14 INFO crail: crail.tokenexpiration 10 > > 25/03/25 13:57:14 INFO crail: crail.blocksize 1048576 > > 25/03/25 13:57:14 INFO crail: crail.cachelimit 268435456 > > 25/03/25 13:57:14 INFO crail: crail.cachepath /dev/hugepages/cache > > 25/03/25 13:57:14 INFO crail: crail.user crail > > 25/03/25 13:57:14 INFO crail: crail.shadowreplication 1 > > 25/03/25 13:57:14 INFO crail: crail.debug false > > 25/03/25 13:57:14 INFO crail: crail.statistics true > > 25/03/25 13:57:14 INFO crail: crail.rpctimeout 1000 > > 25/03/25 13:57:14 INFO crail: crail.datatimeout 1000 > > 25/03/25 13:57:14 INFO crail: crail.buffersize 1048576 > > 25/03/25 13:57:14 INFO crail: crail.slicesize 524288 > > 25/03/25 13:57:14 INFO crail: crail.singleton true > > 25/03/25 13:57:14 INFO crail: crail.regionsize 268435456 > > 25/03/25 13:57:14 INFO crail: crail.directoryrecord 512 > > 25/03/25 13:57:14 INFO crail: crail.directoryrandomize true > > 25/03/25 13:57:14 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > > 25/03/25 13:57:14 INFO crail: crail.locationmap > > 25/03/25 13:57:14 INFO crail: crail.namenode.address > crail://128.131.57.140:9060?id=0&size=1 > > 25/03/25 13:57:14 INFO crail: crail.namenode.blockselection > roundrobin > > 25/03/25 13:57:14 INFO crail: crail.namenode.fileblocks 16 > > 25/03/25 13:57:14 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > > 25/03/25 13:57:14 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > > 25/03/25 13:57:14 INFO crail: crail.namenode.log > > 25/03/25 13:57:14 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > > 25/03/25 13:57:14 INFO crail: crail.storage.classes 1 > > 25/03/25 13:57:14 INFO crail: crail.storage.rootclass 0 > > 25/03/25 13:57:14 INFO crail: crail.storage.keepalive 2 > > 25/03/25 13:57:14 INFO crail: crail.elasticstore.scaleup 0.4 > > 25/03/25 13:57:14 INFO crail: crail.elasticstore.scaledown 0.1 > > 25/03/25 13:57:14 INFO crail: crail.elasticstore.maxnodes 10 > > 25/03/25 13:57:14 INFO crail: crail.elasticstore.minnodes 1 > > 25/03/25 13:57:14 INFO crail: > crail.elasticstore.policyrunner.interval 1000 > > 25/03/25 13:57:14 INFO crail: crail.elasticstore.logging false > > 25/03/25 13:57:14 INFO crail: round robin block selection > > 25/03/25 13:57:14 INFO crail: rpc group started, recvQueue 32 > > 25/03/25 13:57:14 INFO darpc: running resource management, index 0, > affinity 2, timeout 2147483647 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.polling false > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.type passive > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.affinity 1 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.maxinline 0 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.recvQueue 32 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.sendQueue 32 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.pollsize 32 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.clustersize 128 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.backlog 100 > > 25/03/25 13:57:14 INFO crail: crail.namenode.darpc.connecttimeout > 1000 > > 25/03/25 13:57:14 INFO crail: opened server at /128.131.57.140:9060 > > > > > > Now spinning up the datanode using $CRAIL_HOME/bin/crail datanode -t > org.apache.crail.storage.rdma.RdmaStorageTier: > > > > 25/03/25 13:59:05 INFO crail: crail.version 3101 > > 25/03/25 13:59:05 INFO crail: crail.directorydepth 16 > > 25/03/25 13:59:05 INFO crail: crail.tokenexpiration 10 > > 25/03/25 13:59:05 INFO crail: crail.blocksize 1048576 > > 25/03/25 13:59:05 INFO crail: crail.cachelimit 268435456 > > 25/03/25 13:59:05 INFO crail: crail.cachepath /dev/hugepages/cache > > 25/03/25 13:59:05 INFO crail: crail.user crail > > 25/03/25 13:59:05 INFO crail: crail.shadowreplication 1 > > 25/03/25 13:59:05 INFO crail: crail.debug false > > 25/03/25 13:59:05 INFO crail: crail.statistics true > > 25/03/25 13:59:05 INFO crail: crail.rpctimeout 1000 > > 25/03/25 13:59:05 INFO crail: crail.datatimeout 1000 > > 25/03/25 13:59:05 INFO crail: crail.buffersize 1048576 > > 25/03/25 13:59:05 INFO crail: crail.slicesize 524288 > > 25/03/25 13:59:05 INFO crail: crail.singleton true > > 25/03/25 13:59:05 INFO crail: crail.regionsize 268435456 > > 25/03/25 13:59:05 INFO crail: crail.directoryrecord 512 > > 25/03/25 13:59:05 INFO crail: crail.directoryrandomize true > > 25/03/25 13:59:05 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > > 25/03/25 13:59:05 INFO crail: crail.locationmap > > 25/03/25 13:59:05 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > > 25/03/25 13:59:05 INFO crail: crail.namenode.blockselection > roundrobin > > 25/03/25 13:59:05 INFO crail: crail.namenode.fileblocks 16 > > 25/03/25 13:59:05 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.darpc.DaRPCNameNode > > 25/03/25 13:59:05 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > > 25/03/25 13:59:05 INFO crail: crail.namenode.log > > 25/03/25 13:59:05 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > > 25/03/25 13:59:05 INFO crail: crail.storage.classes 1 > > 25/03/25 13:59:05 INFO crail: crail.storage.rootclass 0 > > 25/03/25 13:59:05 INFO crail: crail.storage.keepalive 2 > > 25/03/25 13:59:05 INFO crail: crail.elasticstore.scaleup 0.4 > > 25/03/25 13:59:05 INFO crail: crail.elasticstore.scaledown 0.1 > > 25/03/25 13:59:05 INFO crail: crail.elasticstore.maxnodes 10 > > 25/03/25 13:59:05 INFO crail: crail.elasticstore.minnodes 1 > > 25/03/25 13:59:05 INFO crail: > crail.elasticstore.policyrunner.interval 1000 > > 25/03/25 13:59:05 INFO crail: crail.elasticstore.logging false > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.interface enp1s0 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.port 50020 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.storagelimit > 268435456 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.allocationsize > 268435456 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.localmap true > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.queuesize 32 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.type passive > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.backlog 100 > > 25/03/25 13:59:05 INFO crail: crail.storage.rdma.connecttimeout 1000 > > 25/03/25 13:59:05 INFO crail: rdma storage server started, address > /128.131.57.140:50020, persistent false, maxWR 1, maxSge 1, cqSize 1 > > 25/03/25 13:59:05 INFO crail: rpc group started, recvQueue 32 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.polling false > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.type passive > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.affinity 1 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.maxinline 0 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.recvQueue 32 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.sendQueue 32 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.pollsize 32 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.clustersize 128 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.backlog 100 > > 25/03/25 13:59:05 INFO crail: crail.namenode.darpc.connecttimeout > 1000 > > 25/03/25 13:59:06 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > > 25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256 > > 25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256 > > > > > > Finally, on my client I try to test using the inbuild functions of > iobench and fsck and further I have setup a disni client following the > examples and the configuration found in > > https% > 3A__github.com_brianfrankcooper_YCSB_tree_master_crail_src_main_java_site_y > csb_db_crail&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbq > xyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur- > EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=NA4ldAscrqvqDSXyjSTHOxEi > -tvkmYqm5aZIAdX6rUc&e= > > > > The first attempt of connecting to the namenode is the experiment of fsck > ping. With $CRAIL_HOME/bin/crail fsck -t ping I get: > > 25/03/25 14:10:58 INFO crail: crail.version 3101 > > 25/03/25 14:10:58 INFO crail: crail.directorydepth 16 > > 25/03/25 14:10:58 INFO crail: crail.tokenexpiration 10 > > 25/03/25 14:10:58 INFO crail: crail.blocksize 1048576 > > 25/03/25 14:10:58 INFO crail: crail.cachelimit 268435456 > > 25/03/25 14:10:58 INFO crail: crail.cachepath /dev/hugepages/cache > > 25/03/25 14:10:58 INFO crail: crail.user crail > > 25/03/25 14:10:58 INFO crail: crail.shadowreplication 1 > > 25/03/25 14:10:58 INFO crail: crail.debug false > > 25/03/25 14:10:58 INFO crail: crail.statistics true > > 25/03/25 14:10:58 INFO crail: crail.rpctimeout 1000 > > 25/03/25 14:10:58 INFO crail: crail.datatimeout 1000 > > 25/03/25 14:10:58 INFO crail: crail.buffersize 1048576 > > 25/03/25 14:10:58 INFO crail: crail.slicesize 524288 > > 25/03/25 14:10:58 INFO crail: crail.singleton true > > 25/03/25 14:10:58 INFO crail: crail.regionsize 268435456 > > 25/03/25 14:10:58 INFO crail: crail.directoryrecord 512 > > 25/03/25 14:10:58 INFO crail: crail.directoryrandomize true > > 25/03/25 14:10:58 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > > 25/03/25 14:10:58 INFO crail: crail.locationmap > > 25/03/25 14:10:58 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > > 25/03/25 14:10:58 INFO crail: crail.namenode.blockselection > roundrobin > > 25/03/25 14:10:58 INFO crail: crail.namenode.fileblocks 16 > > 25/03/25 14:10:58 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.tcp.TcpNameNode > > 25/03/25 14:10:58 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > > 25/03/25 14:10:58 INFO crail: crail.namenode.log > > 25/03/25 14:10:58 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > > 25/03/25 14:10:58 INFO crail: crail.storage.classes 1 > > 25/03/25 14:10:58 INFO crail: crail.storage.rootclass 0 > > 25/03/25 14:10:58 INFO crail: crail.storage.keepalive 2 > > 25/03/25 14:10:58 INFO crail: crail.elasticstore.scaleup 0.4 > > 25/03/25 14:10:58 INFO crail: crail.elasticstore.scaledown 0.1 > > 25/03/25 14:10:58 INFO crail: crail.elasticstore.maxnodes 10 > > 25/03/25 14:10:58 INFO crail: crail.elasticstore.minnodes 1 > > 25/03/25 14:10:58 INFO crail: > crail.elasticstore.policyrunner.interval 1000 > > 25/03/25 14:10:58 INFO crail: crail.elasticstore.logging false > > 25/03/25 14:10:58 INFO crail: buffer cache, allocationCount 1, > bufferCount 256 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.interface enp1s0 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.port 50020 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.storagelimit > 268435456 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.allocationsize > 268435456 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.localmap true > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.queuesize 32 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.type passive > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.backlog 100 > > 25/03/25 14:10:58 INFO crail: crail.storage.rdma.connecttimeout 1000 > > 25/03/25 14:10:58 INFO narpc: new NaRPC client group v1.5.0, > queueDepth 32, messageSize 512, nodealy true > > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.queueDepth 32 > > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.messageSize 512 > > 25/03/25 14:10:58 INFO crail: crail.namenode.tcp.cores 1 > > 25/03/25 14:10:58 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > > > > Where it then is stuck. Here I noticed that it defaults back to NaRPC > instead of DaRPC. This is the case for all experiments. Changing anything > in the crail-site.conf on the client did not achieve anything, also > attempting to manipulate this to something obviously wrong was just > ignored. On the server side, this ping went unnoticed. > > > > I then tried the iobenchmark using the default $CRAIL_HOME/bin/crail > iobench -t write -f /filename -s $((10241024)) -k 1024. This resulted in > > > > 25/03/25 14:13:25 INFO crail: creating singleton crail file system > > 25/03/25 14:13:25 INFO crail: crail.version 3101 > > 25/03/25 14:13:25 INFO crail: crail.directorydepth 16 > > 25/03/25 14:13:25 INFO crail: crail.tokenexpiration 10 > > 25/03/25 14:13:25 INFO crail: crail.blocksize 1048576 > > 25/03/25 14:13:25 INFO crail: crail.cachelimit 268435456 > > 25/03/25 14:13:25 INFO crail: crail.cachepath /dev/hugepages/cache > > 25/03/25 14:13:25 INFO crail: crail.user crail > > 25/03/25 14:13:25 INFO crail: crail.shadowreplication 1 > > 25/03/25 14:13:25 INFO crail: crail.debug false > > 25/03/25 14:13:25 INFO crail: crail.statistics true > > 25/03/25 14:13:25 INFO crail: crail.rpctimeout 1000 > > 25/03/25 14:13:25 INFO crail: crail.datatimeout 1000 > > 25/03/25 14:13:25 INFO crail: crail.buffersize 1048576 > > 25/03/25 14:13:25 INFO crail: crail.slicesize 524288 > > 25/03/25 14:13:25 INFO crail: crail.singleton true > > 25/03/25 14:13:25 INFO crail: crail.regionsize 268435456 > > 25/03/25 14:13:25 INFO crail: crail.directoryrecord 512 > > 25/03/25 14:13:25 INFO crail: crail.directoryrandomize true > > 25/03/25 14:13:25 INFO crail: crail.cacheimpl > org.apache.crail.memory.MappedBufferCache > > 25/03/25 14:13:25 INFO crail: crail.locationmap > > 25/03/25 14:13:25 INFO crail: crail.namenode.address > crail://128.131.57.140:9060 > > 25/03/25 14:13:25 INFO crail: crail.namenode.blockselection > roundrobin > > 25/03/25 14:13:25 INFO crail: crail.namenode.fileblocks 16 > > 25/03/25 14:13:25 INFO crail: crail.namenode.rpctype > org.apache.crail.namenode.rpc.tcp.TcpNameNode > > 25/03/25 14:13:25 INFO crail: crail.namenode.rpcservice > org.apache.crail.namenode.NameNodeService > > 25/03/25 14:13:25 INFO crail: crail.namenode.log > > 25/03/25 14:13:25 INFO crail: crail.storage.types > org.apache.crail.storage.rdma.RdmaStorageTier > > 25/03/25 14:13:25 INFO crail: crail.storage.classes 1 > > 25/03/25 14:13:25 INFO crail: crail.storage.rootclass 0 > > 25/03/25 14:13:25 INFO crail: crail.storage.keepalive 2 > > 25/03/25 14:13:25 INFO crail: crail.elasticstore.scaleup 0.4 > > 25/03/25 14:13:25 INFO crail: crail.elasticstore.scaledown 0.1 > > 25/03/25 14:13:25 INFO crail: crail.elasticstore.maxnodes 10 > > 25/03/25 14:13:25 INFO crail: crail.elasticstore.minnodes 1 > > 25/03/25 14:13:25 INFO crail: > crail.elasticstore.policyrunner.interval 1000 > > 25/03/25 14:13:25 INFO crail: crail.elasticstore.logging false > > 25/03/25 14:13:25 INFO crail: buffer cache, allocationCount 1, > bufferCount 256 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.interface enp1s0 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.port 50020 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.storagelimit > 268435456 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.allocationsize > 268435456 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.datapath > /dev/hugepages/data > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.localmap true > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.queuesize 32 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.type passive > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.backlog 100 > > 25/03/25 14:13:25 INFO crail: crail.storage.rdma.connecttimeout 1000 > > 25/03/25 14:13:25 INFO narpc: new NaRPC client group v1.5.0, > queueDepth 32, messageSize 512, nodealy true > > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.queueDepth 32 > > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.messageSize 512 > > 25/03/25 14:13:25 INFO crail: crail.namenode.tcp.cores 1 > > 25/03/25 14:13:25 INFO crail: connected to namenode(s) > /128.131.57.140:9060 > > write, filename /filename, size 1048576, loop 1024, storageClass 0, > locationClass 0, buffered true > > Exception in thread "main" java.io.IOException: Map failed > > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938) > > at > org.apache.crail.memory.MappedBufferCache.allocateRegion(MappedBufferCache. > java:94) > > at > org.apache.crail.memory.BufferCache.allocateBuffer(BufferCache.java:95) > > at > org.apache.crail.core.CoreDataStore.allocateBuffer(CoreDataStore.java:482) > > at > org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:85) > > at > org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070) > > Caused by: java.lang.OutOfMemoryError: Map failed > > at sun.nio.ch.FileChannelImpl.map0(Native Method) > > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935) > > ... 5 more > > > > > > Where again it defaults to NaRPC instead of DaRPC. Further, the call went > unnoticed on the server side, although it states it connected to the > namenode. > > I then changed the setting on my server in crail-site.conf back to use > NaRPC. Only removing this aspect of the configuration, it still was able to > spin up. I tested the fsck experiments again where namenodeDump, > getLocations and ping worked, however directoryDump, blockStatistics and > createDirectory went into the same java.lang.OutOfMemoryError: Map failed. > The connection attempts were recognized on the namenode side. > > > > After that, I tried iobench again, resulting in the same > OutOfMemoryError. Reducing the size and loop to a bare minimum of $((44)) - > k 4, I still got the same issue. > > > > > > Next, I set back the whole configurations Ive done in crail-site.conf to > use TCP instead of RDMA, to check whether or not this might give me some > insights. For this setup, I got the same issues with fsck -t > createDirectory and the iobench. > > > > At this point I am absolutely stuck. As Crail is perfectly fitted for my > masters thesis, I dont really want to give up on trying to finish this > setup. > > > > I hope I didnt miss any important information. I will try to find a way > to debug this in the meantime. I am greatful for any advise on this. > > > > For the last email Ive send, I didnt receive any reply in my email > provider. Would it be possible to put me and my working email > "meiko.prilop-...@ibm.com" in CC, just in case? > > > > Thanks in advance! > > > > Best > > Meiko Prilop