I’m starting to run terasort now with the shuffle data going to crail type-1. I’ve got the type-2 set to nqn: nqn.2018-12.com.StorEdgeSystems:cntlr13 @192.168.3.100, which appears to be working okay for the result data. When I introduce type-1 into my config, it looks like the namenode gets confused and picks the nqn of the type-2 crailstore, instead of the one it is assigned (bolded in red below).
In my test run, I have 5 type-2 nodes (192.168.3.100, ns=1-5) and then 5 type-1 added to the spark worker nodes (192.168.3.101-105). All are running the same image as the namenode, which is the spark master. Is there some additional setting I need to include for this? These are the variables that get set for each container (added via environment) # NVMf storage crail.storage.nvmf.ip $NVMF_IP crail.storage.nvmf.port $NVMF_PORT crail.storage.nvmf.nqn $NVMF_NQN crail.storage.nvmf.hostnqn $NVMF_HOSTNQN crail.storage.nvmf.namespace $NVMF_NAMESPACE 19/06/25 10:46:43 INFO crail: creating singleton crail file system 19/06/25 10:46:43 INFO crail: crail.version 3101 19/06/25 10:46:43 INFO crail: crail.directorydepth 16 19/06/25 10:46:43 INFO crail: crail.tokenexpiration 10 19/06/25 10:46:43 INFO crail: crail.blocksize 1048576 19/06/25 10:46:43 INFO crail: crail.cachelimit 0 19/06/25 10:46:43 INFO crail: crail.cachepath /dev/hugepages/cache 19/06/25 10:46:43 INFO crail: crail.user crail 19/06/25 10:46:43 INFO crail: crail.shadowreplication 1 19/06/25 10:46:43 INFO crail: crail.debug true 19/06/25 10:46:43 INFO crail: crail.statistics true 19/06/25 10:46:43 INFO crail: crail.rpctimeout 1000 19/06/25 10:46:43 INFO crail: crail.datatimeout 1000 19/06/25 10:46:43 INFO crail: crail.buffersize 1048576 19/06/25 10:46:43 INFO crail: crail.slicesize 65536 19/06/25 10:46:43 INFO crail: crail.singleton true 19/06/25 10:46:43 INFO crail: crail.regionsize 1073741824 19/06/25 10:46:43 INFO crail: crail.directoryrecord 512 19/06/25 10:46:43 INFO crail: crail.directoryrandomize true 19/06/25 10:46:43 INFO crail: crail.cacheimpl org.apache.crail.memory.MappedBufferCache 19/06/25 10:46:43 INFO crail: crail.locationmap 19/06/25 10:46:43 INFO crail: crail.namenode.address crail://192.168.1.164:9060 19/06/25 10:46:43 INFO crail: crail.namenode.blockselection roundrobin 19/06/25 10:46:43 INFO crail: crail.namenode.fileblocks 16 19/06/25 10:46:43 INFO crail: crail.namenode.rpctype org.apache.crail.namenode.rpc.tcp.TcpNameNode 19/06/25 10:46:43 INFO crail: crail.namenode.log 19/06/25 10:46:43 INFO crail: crail.storage.types org.apache.crail.storage.nvmf.NvmfStorageTier 19/06/25 10:46:43 INFO crail: crail.storage.classes 2 19/06/25 10:46:43 INFO crail: crail.storage.rootclass 0 19/06/25 10:46:43 INFO crail: crail.storage.keepalive 2 19/06/25 10:46:43 INFO crail: buffer cache, allocationCount 0, bufferCount 1024 19/06/25 10:46:43 INFO crail: Initialize Nvmf storage client 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.ip 192.168.3.100 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.port 4420 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.nqn nqn.2018-12.com.StorEdgeSystems:cntlr13 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.hostnqn nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3cca420 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.allocationsize 1073741824 19/06/25 10:46:43 INFO crail: crail.storage.nvmf.queueSize 64 19/06/25 10:46:43 INFO narpc: new NaRPC server group v1.0, queueDepth 32, messageSize 512, nodealy true 19/06/25 10:46:43 INFO crail: crail.namenode.tcp.queueDepth 32 19/06/25 10:46:43 INFO crail: crail.namenode.tcp.messageSize 512 19/06/25 10:46:43 INFO crail: crail.namenode.tcp.cores 1 19/06/25 10:46:43 INFO crail: connected to namenode(s) /192.168.1.164:9060 19/06/25 10:46:43 INFO CrailDispatcher: creating main dir /spark 19/06/25 10:46:43 INFO crail: lookupDirectory: path /spark 19/06/25 10:46:43 INFO crail: lookup: name /spark, success, fd 2 19/06/25 10:46:43 INFO CrailDispatcher: creating main dir /spark 19/06/25 10:46:43 INFO crail: delete: name /spark, recursive true 19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /, fd 0, streamId 1, isDir true, writeHint 0 19/06/25 10:46:43 INFO crail: Connecting to NVMf target at Transport address = /192.168.3.100:4420, subsystem NQN = nqn.2018-12.com.StorEdgeSystems:cntlr13 19/06/25 10:46:43 INFO disni: creating RdmaProvider of type 'nat' 19/06/25 10:46:43 INFO disni: jverbs jni version 32 19/06/25 10:46:43 INFO disni: sock_addr_in size mismatch, jverbs size 28, native size 16 19/06/25 10:46:43 INFO disni: IbvRecvWR size match, jverbs size 32, native size 32 19/06/25 10:46:43 INFO disni: IbvSendWR size mismatch, jverbs size 72, native size 128 19/06/25 10:46:43 INFO disni: IbvWC size match, jverbs size 48, native size 48 19/06/25 10:46:43 INFO disni: IbvSge size match, jverbs size 16, native size 16 19/06/25 10:46:43 INFO disni: Remote addr offset match, jverbs size 40, native size 40 19/06/25 10:46:43 INFO disni: Rkey offset match, jverbs size 48, native size 48 19/06/25 10:46:43 INFO disni: createEventChannel, objId 139964194703408 19/06/25 10:46:43 INFO disni: launching cm processor, cmChannel 0 19/06/25 10:46:43 INFO disni: createId, id 139964194793600 19/06/25 10:46:43 INFO disni: new client endpoint, id 0, idPriv 0 19/06/25 10:46:43 INFO disni: resolveAddr, addres /192.168.3.100:4420 19/06/25 10:46:43 INFO disni: resolveRoute, id 0 19/06/25 10:46:43 INFO disni: allocPd, objId 139964194800080 19/06/25 10:46:43 INFO disni: setting up protection domain, context 461, pd 1 19/06/25 10:46:43 INFO disni: new endpoint CQ processor 19/06/25 10:46:43 INFO disni: createCompChannel, context 139962246807056 19/06/25 10:46:43 INFO disni: createCQ, objId 139964194801488, ncqe 64 19/06/25 10:46:43 INFO disni: createQP, objId 139964194803176, send_wr size 32, recv_wr_size 32 19/06/25 10:46:43 INFO disni: connect, id 0 19/06/25 10:46:43 INFO disni: got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:45059, dstAddress /192.168.3.100:4420 19/06/25 10:46:43 INFO disni: createId, id 139964195036000 19/06/25 10:46:43 INFO disni: new client endpoint, id 1, idPriv 0 19/06/25 10:46:43 INFO disni: resolveAddr, addres /192.168.3.100:4420 19/06/25 10:46:43 INFO disni: resolveRoute, id 0 19/06/25 10:46:43 INFO disni: setting up protection domain, context 461, pd 1 19/06/25 10:46:43 INFO disni: new endpoint CQ processor 19/06/25 10:46:43 INFO disni: createCompChannel, context 139962246807056 19/06/25 10:46:43 INFO disni: createCQ, objId 139964195036752, ncqe 128 19/06/25 10:46:43 INFO disni: createQP, objId 139964195037304, send_wr size 64, recv_wr_size 64 19/06/25 10:46:43 INFO disni: connect, id 0 19/06/25 10:46:43 INFO disni: got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:57619, dstAddress /192.168.3.100:4420 19/06/25 10:46:43 INFO crail: EndpointCache miss /192.168.3.100:4420, fsId 0, cache size 1 19/06/25 10:46:43 INFO crail: delete: name /spark, recursive true, success 19/06/25 10:46:43 INFO crail: CoreOutputStream, close, path /, fd 0, streamId 1, capacity 262656 19/06/25 10:46:43 INFO crail: createNode: name /spark, type DIRECTORY, storageAffinity 0, locationAffinity 0 19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /, fd 0, streamId 2, isDir true, writeHint 0 19/06/25 10:46:43 INFO crail: EndpointCache hit /192.168.3.100:4420, fsId 0 19/06/25 10:46:43 INFO crail: createFile: name /spark, success, fd 4, token 0 19/06/25 10:46:43 INFO crail: CoreOutputStream, close, path /, fd 0, streamId 2, capacity 524800 19/06/25 10:46:43 INFO crail: createNode: name /spark/broadcast, type DIRECTORY, storageAffinity 0, locationAffinity 0 19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /spark, fd 4, streamId 3, isDir true, writeHint 0 19/06/25 10:46:43 INFO crail: Connecting to NVMf target at Transport address = /192.168.3.104:4420, subsystem NQN = nqn.2018-12.com.StorEdgeSystems:cntlr13 (<-- should be nqn.2018-12.com.StorEdgeSystems:worker-4) 19/06/25 10:46:43 INFO disni: createEventChannel, objId 139964195131168 19/06/25 10:46:43 INFO disni: createId, id 139964195141520 19/06/25 10:46:43 INFO disni: new client endpoint, id 2, idPriv 0 19/06/25 10:46:43 INFO disni: launching cm processor, cmChannel 0 19/06/25 10:46:43 INFO disni: resolveAddr, addres /192.168.3.104:4420 19/06/25 10:46:43 INFO disni: resolveRoute, id 0 19/06/25 10:46:43 INFO disni: setting up protection domain, context 461, pd 1 19/06/25 10:46:43 INFO disni: new endpoint CQ processor 19/06/25 10:46:43 INFO disni: createCompChannel, context 139962246807056 19/06/25 10:46:43 INFO disni: createCQ, objId 139964195142224, ncqe 64 19/06/25 10:46:43 INFO disni: createQP, objId 139964195151896, send_wr size 32, recv_wr_size 32 19/06/25 10:46:43 INFO disni: connect, id 0 19/06/25 10:46:43 INFO disni: got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:35873, dstAddress /192.168.3.104:4420 19/06/25 10:46:43 INFO crail: ERROR: failed data operation com.ibm.jnvmf.UnsuccessfulComandException: Command was not successful. {StatusCodeType: 1 - Command Specific, SatusCode: 132 - The host is not allowed to establish an association to any controller in the NVM subsystem or the host is not allowed to establish an association to the specified controller., CID: 0, Do_not_retry: false, More: false, SQHD: 0} at com.ibm.jnvmf.QueuePair.connect(QueuePair.java:128) at com.ibm.jnvmf.AdminQueuePair.connect(AdminQueuePair.java:36) Regards, David