I wrote a little test program for Meiko which tests maximum
possible memory registration per RDMA device. It showed a
maximum reservable memory region of just 32 Mbytes at
client side. This is almost nothing. The finding should
be sufficient to close the current issue from Crail side.

Best,
Bernard.


> -----Original Message-----
> From: Jonas <peppe...@japf.ch>
> Sent: Saturday, April 5, 2025 4:17 PM
> To: dev@crail.apache.org
> Cc: d...@crail.incubator.apache.org; Bernard Metzler <b...@zurich.ibm.com>;
> Thomas Pusztai <t.pusz...@dsg.tuwien.ac.at>; Meiko Prilop <Meiko.Prilop-
> c...@ibm.com>
> Subject: [EXTERNAL] Re: Master thesis - Problems setting up Crail
> 
> Hi Meiko,
> 
> I hope you were able to debug some of the issues already but here some
> things you might want to check/consider:
> - The client side also needs hugepages. The out-of-memory error means it
> couldn't allocate hugepages locally. The number of hugepages required is
> determined by the cachelimit size in the client configuration. If you have
> allocated hugepages on the client, but you still get the error, check if
> you have the appropriate permissions and "crail.cachepath" is set correctly
> in the crail conf at the client side.
> - You need to set "crail.namenode.rpctype" on: namenode, datanode and the
> client to the same RPC type, e.g.
> "org.apache.crail.namenode.rpc.darpc.DaRPCNameNode" for RDMA.
> 
> Best,
> Jonas
> 
> On Tuesday, March 25th, 2025 at 11:03, Prilop, Meiko
> <e12123...@student.tuwien.ac.at> wrote:
> 
> >
> >
> > Dear Sir or Madam,
> >
> > since my first email, I was able to setup Crail natively on my Ubuntu
> 18.04 machine having a namenode and datanode that recognize eachother
> instead of trying to use docker. However, now when I try to test my setup
> using crails inbuild tools on another machine, I get some issues that I
> wasnt able to resolve myself.
> >
> > TLDR: All approaches to setting up the name and datanode resulted in the
> client machine to run into:
> >       java.lang.OutOfMemoryError: Map failed
> >
> > On two machines separately, Ive setup Soft-iWARP, using
> > https% 
> 3A__github.com_animeshtrivedi_blog_blob_master_post_2019-2D06-2D26-
> 2Dsiw.md&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOw
> dSiLedP4yO55g&m=UVoKBOF_5Q-O7ur-
> EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=YlLNNW7gic1_0Fbulgah1wlv
> qiw5DcNuibRjgFzMu5Q&e=
> > and getting the expected output using ibv_devices.
> >
> > Further, rping is able to establish connection between both machines.
> >
> > I then setup crail following the description of
> > https% 
> 3A__crail.readthedocs.io_en_latest_source.html&d=DwIFaQ&c=BSDicqBQBDjDI9RkV
> yTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur-
> EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=jzNZXg7GCpmqtldtYby8HusA
> YBAbSl1T0JpIxIsNm4M&e=
> > https% 
> 3A__crail.readthedocs.io_en_latest_config.html&d=DwIFaQ&c=BSDicqBQBDjDI9RkV
> yTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbqxyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur-
> EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=Y1kQTdSOKuLo6Zeh2jo9JJgM
> 5Usqw85frVndQYhsik0&e=
> > where I setup my crail-site.conf to look like:
> >       
> > crail.namenode.address crail://128.131.57.140:9060
> > crail.namenode.rpctype org.apache.crail.namenode.rpc.darpc.DaRPCNameNode
> > crail.cachepath /dev/hugepages/cache
> > crail.regionsize 268435456
> > crail.cachelimit 268435456
> > crail.storage.types org.apache.crail.storage.rdma.RdmaStorageTier
> > crail.storage.rdma.interface enp1s0
> > crail.storage.rdma.datapath /dev/hugepages/data
> > crail.storage.rdma.storagelimit 268435456
> >
> >
> > On both machines. Here I drastically reduced the default values on
> sizing. I changed core-site.xml to hold the address here as well at
> fs.defaultFS .
> >
> > Further, when checking cat /proc/meminfo I get for the client:
> >
> > MemTotal: 16424160 kB
> > MemFree: 4983380 kB
> > MemAvailable: 9580404 kB
> > Buffers: 975056 kB
> > Cached: 3347588 kB
> > SwapCached: 1920 kB
> > Active: 3856868 kB
> > Inactive: 2377380 kB
> > Active(anon): 1339104 kB
> > Inactive(anon): 552592 kB
> > Active(file): 2517764 kB
> > Inactive(file): 1824788 kB
> > Unevictable: 0 kB
> > Mlocked: 0 kB
> > SwapTotal: 4194300 kB
> > SwapFree: 4176624 kB
> > Dirty: 108 kB
> > Writeback: 0 kB
> > AnonPages: 1909928 kB
> > Mapped: 632464 kB
> > Shmem: 132512 kB
> > Slab: 866876 kB
> > SReclaimable: 594772 kB
> > SUnreclaim: 272104 kB
> > KernelStack: 16768 kB
> > PageTables: 20932 kB
> > NFS_Unstable: 0 kB
> > Bounce: 0 kB
> > WritebackTmp: 0 kB
> > CommitLimit: 10309228 kB
> > Committed_AS: 8433608 kB
> > VmallocTotal: 34359738367 kB
> > VmallocUsed: 0 kB
> > VmallocChunk: 0 kB
> > HardwareCorrupted: 0 kB
> > AnonHugePages: 34816 kB
> > ShmemHugePages: 0 kB
> > ShmemPmdMapped: 0 kB
> > CmaTotal: 0 kB
> > CmaFree: 0 kB
> > HugePages_Total: 2048
> > HugePages_Free: 2044
> > HugePages_Rsvd: 2044
> > HugePages_Surp: 0
> > Hugepagesize: 2048 kB
> > DirectMap4k: 728940 kB
> > DirectMap2M: 11853824 kB
> > DirectMap1G: 6291456 kB
> >
> > And on my server:
> >
> > MemTotal: 16424160 kB
> > MemFree: 3771716 kB
> > MemAvailable: 6517888 kB
> > Buffers: 309848 kB
> > Cached: 2576436 kB
> > SwapCached: 0 kB
> > Active: 2398708 kB
> > Inactive: 1465044 kB
> > Active(anon): 977800 kB
> > Inactive(anon): 356 kB
> > Active(file): 1420908 kB
> > Inactive(file): 1464688 kB
> > Unevictable: 0 kB
> > Mlocked: 0 kB
> > SwapTotal: 4194300 kB
> > SwapFree: 4194300 kB
> > Dirty: 76 kB
> > Writeback: 0 kB
> > AnonPages: 977544 kB
> > Mapped: 233700 kB
> > Shmem: 820 kB
> > Slab: 294456 kB
> > SReclaimable: 202184 kB
> > SUnreclaim: 92272 kB
> > KernelStack: 6096 kB
> > PageTables: 12184 kB
> > NFS_Unstable: 0 kB
> > Bounce: 0 kB
> > WritebackTmp: 0 kB
> > CommitLimit: 8212076 kB
> > Committed_AS: 2616856 kB
> > VmallocTotal: 34359738367 kB
> > VmallocUsed: 0 kB
> > VmallocChunk: 0 kB
> > HardwareCorrupted: 0 kB
> > AnonHugePages: 0 kB
> > ShmemHugePages: 0 kB
> > ShmemPmdMapped: 0 kB
> > CmaTotal: 0 kB
> > CmaFree: 0 kB
> > HugePages_Total: 4096
> > HugePages_Free: 3968
> > HugePages_Rsvd: 0
> > HugePages_Surp: 0
> > Hugepagesize: 2048 kB
> > DirectMap4k: 210796 kB
> > DirectMap2M: 7129088 kB
> > DirectMap1G: 11534336 kB
> >
> > Checking if my hugetables are mounted using mount | grep huge on the
> server and client, the output is:
> >       hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
> >
> > Spinning up a namenode on the server, the output is as follows:
> >       25/03/25 13:57:14 INFO crail: initalizing namenode
> >       25/03/25 13:57:14 INFO crail: crail.version 3101
> >       25/03/25 13:57:14 INFO crail: crail.directorydepth 16
> >       25/03/25 13:57:14 INFO crail: crail.tokenexpiration 10
> >       25/03/25 13:57:14 INFO crail: crail.blocksize 1048576
> >       25/03/25 13:57:14 INFO crail: crail.cachelimit 268435456
> >       25/03/25 13:57:14 INFO crail: crail.cachepath /dev/hugepages/cache
> >       25/03/25 13:57:14 INFO crail: crail.user crail
> >       25/03/25 13:57:14 INFO crail: crail.shadowreplication 1
> >       25/03/25 13:57:14 INFO crail: crail.debug false
> >       25/03/25 13:57:14 INFO crail: crail.statistics true
> >       25/03/25 13:57:14 INFO crail: crail.rpctimeout 1000
> >       25/03/25 13:57:14 INFO crail: crail.datatimeout 1000
> >       25/03/25 13:57:14 INFO crail: crail.buffersize 1048576
> >       25/03/25 13:57:14 INFO crail: crail.slicesize 524288
> >       25/03/25 13:57:14 INFO crail: crail.singleton true
> >       25/03/25 13:57:14 INFO crail: crail.regionsize 268435456
> >       25/03/25 13:57:14 INFO crail: crail.directoryrecord 512
> >       25/03/25 13:57:14 INFO crail: crail.directoryrandomize true
> >       25/03/25 13:57:14 INFO crail: crail.cacheimpl
> org.apache.crail.memory.MappedBufferCache
> >       25/03/25 13:57:14 INFO crail: crail.locationmap
> >       25/03/25 13:57:14 INFO crail: crail.namenode.address
> crail://128.131.57.140:9060?id=0&size=1
> >       25/03/25 13:57:14 INFO crail: crail.namenode.blockselection
> roundrobin
> >       25/03/25 13:57:14 INFO crail: crail.namenode.fileblocks 16
> >       25/03/25 13:57:14 INFO crail: crail.namenode.rpctype
> org.apache.crail.namenode.rpc.darpc.DaRPCNameNode
> >       25/03/25 13:57:14 INFO crail: crail.namenode.rpcservice
> org.apache.crail.namenode.NameNodeService
> >       25/03/25 13:57:14 INFO crail: crail.namenode.log
> >       25/03/25 13:57:14 INFO crail: crail.storage.types
> org.apache.crail.storage.rdma.RdmaStorageTier
> >       25/03/25 13:57:14 INFO crail: crail.storage.classes 1
> >       25/03/25 13:57:14 INFO crail: crail.storage.rootclass 0
> >       25/03/25 13:57:14 INFO crail: crail.storage.keepalive 2
> >       25/03/25 13:57:14 INFO crail: crail.elasticstore.scaleup 0.4
> >       25/03/25 13:57:14 INFO crail: crail.elasticstore.scaledown 0.1
> >       25/03/25 13:57:14 INFO crail: crail.elasticstore.maxnodes 10
> >       25/03/25 13:57:14 INFO crail: crail.elasticstore.minnodes 1
> >       25/03/25 13:57:14 INFO crail:
> crail.elasticstore.policyrunner.interval 1000
> >       25/03/25 13:57:14 INFO crail: crail.elasticstore.logging false
> >       25/03/25 13:57:14 INFO crail: round robin block selection
> >       25/03/25 13:57:14 INFO crail: rpc group started, recvQueue 32
> >       25/03/25 13:57:14 INFO darpc: running resource management, index 0,
> affinity 2, timeout 2147483647
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.polling false
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.type passive
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.affinity 1
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.maxinline 0
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.recvQueue 32
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.sendQueue 32
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.pollsize 32
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.clustersize 128
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.backlog 100
> >       25/03/25 13:57:14 INFO crail: crail.namenode.darpc.connecttimeout
> 1000
> >       25/03/25 13:57:14 INFO crail: opened server at /128.131.57.140:9060
> >
> >
> > Now spinning up the datanode using $CRAIL_HOME/bin/crail datanode -t
> org.apache.crail.storage.rdma.RdmaStorageTier:
> >
> >       25/03/25 13:59:05 INFO crail: crail.version 3101
> >       25/03/25 13:59:05 INFO crail: crail.directorydepth 16
> >       25/03/25 13:59:05 INFO crail: crail.tokenexpiration 10
> >       25/03/25 13:59:05 INFO crail: crail.blocksize 1048576
> >       25/03/25 13:59:05 INFO crail: crail.cachelimit 268435456
> >       25/03/25 13:59:05 INFO crail: crail.cachepath /dev/hugepages/cache
> >       25/03/25 13:59:05 INFO crail: crail.user crail
> >       25/03/25 13:59:05 INFO crail: crail.shadowreplication 1
> >       25/03/25 13:59:05 INFO crail: crail.debug false
> >       25/03/25 13:59:05 INFO crail: crail.statistics true
> >       25/03/25 13:59:05 INFO crail: crail.rpctimeout 1000
> >       25/03/25 13:59:05 INFO crail: crail.datatimeout 1000
> >       25/03/25 13:59:05 INFO crail: crail.buffersize 1048576
> >       25/03/25 13:59:05 INFO crail: crail.slicesize 524288
> >       25/03/25 13:59:05 INFO crail: crail.singleton true
> >       25/03/25 13:59:05 INFO crail: crail.regionsize 268435456
> >       25/03/25 13:59:05 INFO crail: crail.directoryrecord 512
> >       25/03/25 13:59:05 INFO crail: crail.directoryrandomize true
> >       25/03/25 13:59:05 INFO crail: crail.cacheimpl
> org.apache.crail.memory.MappedBufferCache
> >       25/03/25 13:59:05 INFO crail: crail.locationmap
> >       25/03/25 13:59:05 INFO crail: crail.namenode.address
> crail://128.131.57.140:9060
> >       25/03/25 13:59:05 INFO crail: crail.namenode.blockselection
> roundrobin
> >       25/03/25 13:59:05 INFO crail: crail.namenode.fileblocks 16
> >       25/03/25 13:59:05 INFO crail: crail.namenode.rpctype
> org.apache.crail.namenode.rpc.darpc.DaRPCNameNode
> >       25/03/25 13:59:05 INFO crail: crail.namenode.rpcservice
> org.apache.crail.namenode.NameNodeService
> >       25/03/25 13:59:05 INFO crail: crail.namenode.log
> >       25/03/25 13:59:05 INFO crail: crail.storage.types
> org.apache.crail.storage.rdma.RdmaStorageTier
> >       25/03/25 13:59:05 INFO crail: crail.storage.classes 1
> >       25/03/25 13:59:05 INFO crail: crail.storage.rootclass 0
> >       25/03/25 13:59:05 INFO crail: crail.storage.keepalive 2
> >       25/03/25 13:59:05 INFO crail: crail.elasticstore.scaleup 0.4
> >       25/03/25 13:59:05 INFO crail: crail.elasticstore.scaledown 0.1
> >       25/03/25 13:59:05 INFO crail: crail.elasticstore.maxnodes 10
> >       25/03/25 13:59:05 INFO crail: crail.elasticstore.minnodes 1
> >       25/03/25 13:59:05 INFO crail:
> crail.elasticstore.policyrunner.interval 1000
> >       25/03/25 13:59:05 INFO crail: crail.elasticstore.logging false
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.interface enp1s0
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.port 50020
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.storagelimit
> 268435456
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.allocationsize
> 268435456
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.datapath
> /dev/hugepages/data
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.localmap true
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.queuesize 32
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.type passive
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.backlog 100
> >       25/03/25 13:59:05 INFO crail: crail.storage.rdma.connecttimeout 1000
> >       25/03/25 13:59:05 INFO crail: rdma storage server started, address
> /128.131.57.140:50020, persistent false, maxWR 1, maxSge 1, cqSize 1
> >       25/03/25 13:59:05 INFO crail: rpc group started, recvQueue 32
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.polling false
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.type passive
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.affinity 1
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.maxinline 0
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.recvQueue 32
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.sendQueue 32
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.pollsize 32
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.clustersize 128
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.backlog 100
> >       25/03/25 13:59:05 INFO crail: crail.namenode.darpc.connecttimeout
> 1000
> >       25/03/25 13:59:06 INFO crail: connected to namenode(s)
> /128.131.57.140:9060
> >       25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256
> >       25/03/25 13:59:06 INFO crail: datanode statistics, freeBlocks 256
> >
> >
> > Finally, on my client I try to test using the inbuild functions of
> iobench and fsck and further I have setup a disni client following the
> examples and the configuration found in
> > https% 
> 3A__github.com_brianfrankcooper_YCSB_tree_master_crail_src_main_java_site_y
> csb_db_crail&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tYSbq
> xyOwdSiLedP4yO55g&m=UVoKBOF_5Q-O7ur-
> EoRKMLP9OzqhGzIey5eMdSqNOF098VPTXOSHfTVVfXNi7nyv&s=NA4ldAscrqvqDSXyjSTHOxEi
> -tvkmYqm5aZIAdX6rUc&e=
> >
> > The first attempt of connecting to the namenode is the experiment of fsck
> ping. With $CRAIL_HOME/bin/crail fsck -t ping I get:
> >       25/03/25 14:10:58 INFO crail: crail.version 3101
> >       25/03/25 14:10:58 INFO crail: crail.directorydepth 16
> >       25/03/25 14:10:58 INFO crail: crail.tokenexpiration 10
> >       25/03/25 14:10:58 INFO crail: crail.blocksize 1048576
> >       25/03/25 14:10:58 INFO crail: crail.cachelimit 268435456
> >       25/03/25 14:10:58 INFO crail: crail.cachepath /dev/hugepages/cache
> >       25/03/25 14:10:58 INFO crail: crail.user crail
> >       25/03/25 14:10:58 INFO crail: crail.shadowreplication 1
> >       25/03/25 14:10:58 INFO crail: crail.debug false
> >       25/03/25 14:10:58 INFO crail: crail.statistics true
> >       25/03/25 14:10:58 INFO crail: crail.rpctimeout 1000
> >       25/03/25 14:10:58 INFO crail: crail.datatimeout 1000
> >       25/03/25 14:10:58 INFO crail: crail.buffersize 1048576
> >       25/03/25 14:10:58 INFO crail: crail.slicesize 524288
> >       25/03/25 14:10:58 INFO crail: crail.singleton true
> >       25/03/25 14:10:58 INFO crail: crail.regionsize 268435456
> >       25/03/25 14:10:58 INFO crail: crail.directoryrecord 512
> >       25/03/25 14:10:58 INFO crail: crail.directoryrandomize true
> >       25/03/25 14:10:58 INFO crail: crail.cacheimpl
> org.apache.crail.memory.MappedBufferCache
> >       25/03/25 14:10:58 INFO crail: crail.locationmap
> >       25/03/25 14:10:58 INFO crail: crail.namenode.address
> crail://128.131.57.140:9060
> >       25/03/25 14:10:58 INFO crail: crail.namenode.blockselection
> roundrobin
> >       25/03/25 14:10:58 INFO crail: crail.namenode.fileblocks 16
> >       25/03/25 14:10:58 INFO crail: crail.namenode.rpctype
> org.apache.crail.namenode.rpc.tcp.TcpNameNode
> >       25/03/25 14:10:58 INFO crail: crail.namenode.rpcservice
> org.apache.crail.namenode.NameNodeService
> >       25/03/25 14:10:58 INFO crail: crail.namenode.log
> >       25/03/25 14:10:58 INFO crail: crail.storage.types
> org.apache.crail.storage.rdma.RdmaStorageTier
> >       25/03/25 14:10:58 INFO crail: crail.storage.classes 1
> >       25/03/25 14:10:58 INFO crail: crail.storage.rootclass 0
> >       25/03/25 14:10:58 INFO crail: crail.storage.keepalive 2
> >       25/03/25 14:10:58 INFO crail: crail.elasticstore.scaleup 0.4
> >       25/03/25 14:10:58 INFO crail: crail.elasticstore.scaledown 0.1
> >       25/03/25 14:10:58 INFO crail: crail.elasticstore.maxnodes 10
> >       25/03/25 14:10:58 INFO crail: crail.elasticstore.minnodes 1
> >       25/03/25 14:10:58 INFO crail:
> crail.elasticstore.policyrunner.interval 1000
> >       25/03/25 14:10:58 INFO crail: crail.elasticstore.logging false
> >       25/03/25 14:10:58 INFO crail: buffer cache, allocationCount 1,
> bufferCount 256
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.interface enp1s0
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.port 50020
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.storagelimit
> 268435456
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.allocationsize
> 268435456
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.datapath
> /dev/hugepages/data
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.localmap true
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.queuesize 32
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.type passive
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.backlog 100
> >       25/03/25 14:10:58 INFO crail: crail.storage.rdma.connecttimeout 1000
> >       25/03/25 14:10:58 INFO narpc: new NaRPC client group v1.5.0,
> queueDepth 32, messageSize 512, nodealy true
> >       25/03/25 14:10:58 INFO crail: crail.namenode.tcp.queueDepth 32
> >       25/03/25 14:10:58 INFO crail: crail.namenode.tcp.messageSize 512
> >       25/03/25 14:10:58 INFO crail: crail.namenode.tcp.cores 1
> >       25/03/25 14:10:58 INFO crail: connected to namenode(s)
> /128.131.57.140:9060
> >
> > Where it then is stuck. Here I noticed that it defaults back to NaRPC
> instead of DaRPC. This is the case for all experiments. Changing anything
> in the crail-site.conf on the client did not achieve anything, also
> attempting to manipulate this to something obviously wrong was just
> ignored. On the server side, this ping went unnoticed.
> >
> > I then tried the iobenchmark using the default $CRAIL_HOME/bin/crail
> iobench -t write -f /filename -s $((10241024)) -k 1024. This resulted in
> >
> >       25/03/25 14:13:25 INFO crail: creating singleton crail file system
> >       25/03/25 14:13:25 INFO crail: crail.version 3101
> >       25/03/25 14:13:25 INFO crail: crail.directorydepth 16
> >       25/03/25 14:13:25 INFO crail: crail.tokenexpiration 10
> >       25/03/25 14:13:25 INFO crail: crail.blocksize 1048576
> >       25/03/25 14:13:25 INFO crail: crail.cachelimit 268435456
> >       25/03/25 14:13:25 INFO crail: crail.cachepath /dev/hugepages/cache
> >       25/03/25 14:13:25 INFO crail: crail.user crail
> >       25/03/25 14:13:25 INFO crail: crail.shadowreplication 1
> >       25/03/25 14:13:25 INFO crail: crail.debug false
> >       25/03/25 14:13:25 INFO crail: crail.statistics true
> >       25/03/25 14:13:25 INFO crail: crail.rpctimeout 1000
> >       25/03/25 14:13:25 INFO crail: crail.datatimeout 1000
> >       25/03/25 14:13:25 INFO crail: crail.buffersize 1048576
> >       25/03/25 14:13:25 INFO crail: crail.slicesize 524288
> >       25/03/25 14:13:25 INFO crail: crail.singleton true
> >       25/03/25 14:13:25 INFO crail: crail.regionsize 268435456
> >       25/03/25 14:13:25 INFO crail: crail.directoryrecord 512
> >       25/03/25 14:13:25 INFO crail: crail.directoryrandomize true
> >       25/03/25 14:13:25 INFO crail: crail.cacheimpl
> org.apache.crail.memory.MappedBufferCache
> >       25/03/25 14:13:25 INFO crail: crail.locationmap
> >       25/03/25 14:13:25 INFO crail: crail.namenode.address
> crail://128.131.57.140:9060
> >       25/03/25 14:13:25 INFO crail: crail.namenode.blockselection
> roundrobin
> >       25/03/25 14:13:25 INFO crail: crail.namenode.fileblocks 16
> >       25/03/25 14:13:25 INFO crail: crail.namenode.rpctype
> org.apache.crail.namenode.rpc.tcp.TcpNameNode
> >       25/03/25 14:13:25 INFO crail: crail.namenode.rpcservice
> org.apache.crail.namenode.NameNodeService
> >       25/03/25 14:13:25 INFO crail: crail.namenode.log
> >       25/03/25 14:13:25 INFO crail: crail.storage.types
> org.apache.crail.storage.rdma.RdmaStorageTier
> >       25/03/25 14:13:25 INFO crail: crail.storage.classes 1
> >       25/03/25 14:13:25 INFO crail: crail.storage.rootclass 0
> >       25/03/25 14:13:25 INFO crail: crail.storage.keepalive 2
> >       25/03/25 14:13:25 INFO crail: crail.elasticstore.scaleup 0.4
> >       25/03/25 14:13:25 INFO crail: crail.elasticstore.scaledown 0.1
> >       25/03/25 14:13:25 INFO crail: crail.elasticstore.maxnodes 10
> >       25/03/25 14:13:25 INFO crail: crail.elasticstore.minnodes 1
> >       25/03/25 14:13:25 INFO crail:
> crail.elasticstore.policyrunner.interval 1000
> >       25/03/25 14:13:25 INFO crail: crail.elasticstore.logging false
> >       25/03/25 14:13:25 INFO crail: buffer cache, allocationCount 1,
> bufferCount 256
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.interface enp1s0
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.port 50020
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.storagelimit
> 268435456
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.allocationsize
> 268435456
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.datapath
> /dev/hugepages/data
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.localmap true
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.queuesize 32
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.type passive
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.backlog 100
> >       25/03/25 14:13:25 INFO crail: crail.storage.rdma.connecttimeout 1000
> >       25/03/25 14:13:25 INFO narpc: new NaRPC client group v1.5.0,
> queueDepth 32, messageSize 512, nodealy true
> >       25/03/25 14:13:25 INFO crail: crail.namenode.tcp.queueDepth 32
> >       25/03/25 14:13:25 INFO crail: crail.namenode.tcp.messageSize 512
> >       25/03/25 14:13:25 INFO crail: crail.namenode.tcp.cores 1
> >       25/03/25 14:13:25 INFO crail: connected to namenode(s)
> /128.131.57.140:9060
> >       write, filename /filename, size 1048576, loop 1024, storageClass 0,
> locationClass 0, buffered true
> >       Exception in thread "main" java.io.IOException: Map failed
> >        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
> >        at
> org.apache.crail.memory.MappedBufferCache.allocateRegion(MappedBufferCache.
> java:94)
> >        at
> org.apache.crail.memory.BufferCache.allocateBuffer(BufferCache.java:95)
> >        at
> org.apache.crail.core.CoreDataStore.allocateBuffer(CoreDataStore.java:482)
> >        at
> org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:85)
> >        at
> org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070)
> >       Caused by: java.lang.OutOfMemoryError: Map failed
> >        at sun.nio.ch.FileChannelImpl.map0(Native Method)
> >        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
> >        ... 5 more
> >
> >
> > Where again it defaults to NaRPC instead of DaRPC. Further, the call went
> unnoticed on the server side, although it states it connected to the
> namenode.
> > I then changed the setting on my server in crail-site.conf back to use
> NaRPC. Only removing this aspect of the configuration, it still was able to
> spin up. I tested the fsck experiments again where namenodeDump,
> getLocations and ping worked, however directoryDump, blockStatistics and
> createDirectory went into the same java.lang.OutOfMemoryError: Map failed.
> The connection attempts were recognized on the namenode side.
> >
> > After that, I tried iobench again, resulting in the same
> OutOfMemoryError. Reducing the size and loop to a bare minimum of $((44)) -
> k 4, I still got the same issue.
> >
> >
> > Next, I set back the whole configurations Ive done in crail-site.conf to
> use TCP instead of RDMA, to check whether or not this might give me some
> insights. For this setup, I got the same issues with fsck -t
> createDirectory and the iobench.
> >
> > At this point I am absolutely stuck. As Crail is perfectly fitted for my
> masters thesis, I dont really want to give up on trying to finish this
> setup.
> >
> > I hope I didnt miss any important information. I will try to find a way
> to debug this in the meantime. I am greatful for any advise on this.
> >
> > For the last email Ive send, I didnt receive any reply in my email
> provider. Would it be possible to put me and my working email
> "meiko.prilop-...@ibm.com" in CC, just in case?
> >
> > Thanks in advance!
> >
> > Best
> > Meiko Prilop

Reply via email to