It occurred to me that perhaps since I’m using the NVMf, that the
namenode can only handle one
Subsystem. And for the type-1, these resources would need to be
added under the same subsystem,
but as different namespace’s. Is this correct? I did try this and
it appears to be working. If this IS the
correct way of utilizing these, how do I know which namespace(s) are
being accessed? I don’t see
anything in the logs that identify which is being addressed.
Regards,
David
________________________________
From: David Crespi <david.cre...@storedgesystems.com>
Sent: Tuesday, June 25, 2019 11:09:56 AM
To: dev@crail.apache.org
Subject: Question on crail type-1 with terasort shuffle data
I’m starting to run terasort now with the shuffle data going to
crail type-1.
I’ve got the type-2 set to nqn:
nqn.2018-12.com.StorEdgeSystems:cntlr13 @192.168.3.100,
which appears to be working okay for the result data. When I
introduce type-1 into my config,
it looks like the namenode gets confused and picks the nqn of the
type-2 crailstore, instead of
the one it is assigned (bolded in red below).
In my test run, I have 5 type-2 nodes (192.168.3.100, ns=1-5) and
then 5 type-1 added to the
spark worker nodes (192.168.3.101-105). All are running the same
image as the namenode,
which is the spark master.
Is there some additional setting I need to include for this?
These are the variables that get set for each container (added via
environment)
# NVMf storage
crail.storage.nvmf.ip $NVMF_IP
crail.storage.nvmf.port $NVMF_PORT
crail.storage.nvmf.nqn $NVMF_NQN
crail.storage.nvmf.hostnqn $NVMF_HOSTNQN
crail.storage.nvmf.namespace $NVMF_NAMESPACE
19/06/25 10:46:43 INFO crail: creating singleton crail file system
19/06/25 10:46:43 INFO crail: crail.version 3101
19/06/25 10:46:43 INFO crail: crail.directorydepth 16
19/06/25 10:46:43 INFO crail: crail.tokenexpiration 10
19/06/25 10:46:43 INFO crail: crail.blocksize 1048576
19/06/25 10:46:43 INFO crail: crail.cachelimit 0
19/06/25 10:46:43 INFO crail: crail.cachepath /dev/hugepages/cache
19/06/25 10:46:43 INFO crail: crail.user crail
19/06/25 10:46:43 INFO crail: crail.shadowreplication 1
19/06/25 10:46:43 INFO crail: crail.debug true
19/06/25 10:46:43 INFO crail: crail.statistics true
19/06/25 10:46:43 INFO crail: crail.rpctimeout 1000
19/06/25 10:46:43 INFO crail: crail.datatimeout 1000
19/06/25 10:46:43 INFO crail: crail.buffersize 1048576
19/06/25 10:46:43 INFO crail: crail.slicesize 65536
19/06/25 10:46:43 INFO crail: crail.singleton true
19/06/25 10:46:43 INFO crail: crail.regionsize 1073741824
19/06/25 10:46:43 INFO crail: crail.directoryrecord 512
19/06/25 10:46:43 INFO crail: crail.directoryrandomize true
19/06/25 10:46:43 INFO crail: crail.cacheimpl
org.apache.crail.memory.MappedBufferCache
19/06/25 10:46:43 INFO crail: crail.locationmap
19/06/25 10:46:43 INFO crail: crail.namenode.address
crail://192.168.1.164:9060
19/06/25 10:46:43 INFO crail: crail.namenode.blockselection
roundrobin
19/06/25 10:46:43 INFO crail: crail.namenode.fileblocks 16
19/06/25 10:46:43 INFO crail: crail.namenode.rpctype
org.apache.crail.namenode.rpc.tcp.TcpNameNode
19/06/25 10:46:43 INFO crail: crail.namenode.log
19/06/25 10:46:43 INFO crail: crail.storage.types
org.apache.crail.storage.nvmf.NvmfStorageTier
19/06/25 10:46:43 INFO crail: crail.storage.classes 2
19/06/25 10:46:43 INFO crail: crail.storage.rootclass 0
19/06/25 10:46:43 INFO crail: crail.storage.keepalive 2
19/06/25 10:46:43 INFO crail: buffer cache, allocationCount 0,
bufferCount 1024
19/06/25 10:46:43 INFO crail: Initialize Nvmf storage client
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.ip 192.168.3.100
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.port 4420
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.nqn
nqn.2018-12.com.StorEdgeSystems:cntlr13
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.hostnqn
nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3cca420
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.allocationsize
1073741824
19/06/25 10:46:43 INFO crail: crail.storage.nvmf.queueSize 64
19/06/25 10:46:43 INFO narpc: new NaRPC server group v1.0,
queueDepth 32, messageSize 512, nodealy true
19/06/25 10:46:43 INFO crail: crail.namenode.tcp.queueDepth 32
19/06/25 10:46:43 INFO crail: crail.namenode.tcp.messageSize 512
19/06/25 10:46:43 INFO crail: crail.namenode.tcp.cores 1
19/06/25 10:46:43 INFO crail: connected to namenode(s)
/192.168.1.164:9060
19/06/25 10:46:43 INFO CrailDispatcher: creating main dir /spark
19/06/25 10:46:43 INFO crail: lookupDirectory: path /spark
19/06/25 10:46:43 INFO crail: lookup: name /spark, success, fd 2
19/06/25 10:46:43 INFO CrailDispatcher: creating main dir /spark
19/06/25 10:46:43 INFO crail: delete: name /spark, recursive true
19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /, fd 0,
streamId 1, isDir true, writeHint 0
19/06/25 10:46:43 INFO crail: Connecting to NVMf target at Transport
address = /192.168.3.100:4420, subsystem NQN =
nqn.2018-12.com.StorEdgeSystems:cntlr13
19/06/25 10:46:43 INFO disni: creating RdmaProvider of type 'nat'
19/06/25 10:46:43 INFO disni: jverbs jni version 32
19/06/25 10:46:43 INFO disni: sock_addr_in size mismatch, jverbs
size 28, native size 16
19/06/25 10:46:43 INFO disni: IbvRecvWR size match, jverbs size 32,
native size 32
19/06/25 10:46:43 INFO disni: IbvSendWR size mismatch, jverbs size
72, native size 128
19/06/25 10:46:43 INFO disni: IbvWC size match, jverbs size 48,
native size 48
19/06/25 10:46:43 INFO disni: IbvSge size match, jverbs size 16,
native size 16
19/06/25 10:46:43 INFO disni: Remote addr offset match, jverbs size
40, native size 40
19/06/25 10:46:43 INFO disni: Rkey offset match, jverbs size 48,
native size 48
19/06/25 10:46:43 INFO disni: createEventChannel, objId
139964194703408
19/06/25 10:46:43 INFO disni: launching cm processor, cmChannel 0
19/06/25 10:46:43 INFO disni: createId, id 139964194793600
19/06/25 10:46:43 INFO disni: new client endpoint, id 0, idPriv 0
19/06/25 10:46:43 INFO disni: resolveAddr, addres
/192.168.3.100:4420
19/06/25 10:46:43 INFO disni: resolveRoute, id 0
19/06/25 10:46:43 INFO disni: allocPd, objId 139964194800080
19/06/25 10:46:43 INFO disni: setting up protection domain, context
461, pd 1
19/06/25 10:46:43 INFO disni: new endpoint CQ processor
19/06/25 10:46:43 INFO disni: createCompChannel, context
139962246807056
19/06/25 10:46:43 INFO disni: createCQ, objId 139964194801488, ncqe
64
19/06/25 10:46:43 INFO disni: createQP, objId 139964194803176,
send_wr size 32, recv_wr_size 32
19/06/25 10:46:43 INFO disni: connect, id 0
19/06/25 10:46:43 INFO disni: got event type +
RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:45059, dstAddress
/192.168.3.100:4420
19/06/25 10:46:43 INFO disni: createId, id 139964195036000
19/06/25 10:46:43 INFO disni: new client endpoint, id 1, idPriv 0
19/06/25 10:46:43 INFO disni: resolveAddr, addres
/192.168.3.100:4420
19/06/25 10:46:43 INFO disni: resolveRoute, id 0
19/06/25 10:46:43 INFO disni: setting up protection domain, context
461, pd 1
19/06/25 10:46:43 INFO disni: new endpoint CQ processor
19/06/25 10:46:43 INFO disni: createCompChannel, context
139962246807056
19/06/25 10:46:43 INFO disni: createCQ, objId 139964195036752, ncqe
128
19/06/25 10:46:43 INFO disni: createQP, objId 139964195037304,
send_wr size 64, recv_wr_size 64
19/06/25 10:46:43 INFO disni: connect, id 0
19/06/25 10:46:43 INFO disni: got event type +
RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:57619, dstAddress
/192.168.3.100:4420
19/06/25 10:46:43 INFO crail: EndpointCache miss
/192.168.3.100:4420, fsId 0, cache size 1
19/06/25 10:46:43 INFO crail: delete: name /spark, recursive true,
success
19/06/25 10:46:43 INFO crail: CoreOutputStream, close, path /, fd 0,
streamId 1, capacity 262656
19/06/25 10:46:43 INFO crail: createNode: name /spark, type
DIRECTORY, storageAffinity 0, locationAffinity 0
19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /, fd 0,
streamId 2, isDir true, writeHint 0
19/06/25 10:46:43 INFO crail: EndpointCache hit /192.168.3.100:4420,
fsId 0
19/06/25 10:46:43 INFO crail: createFile: name /spark, success, fd
4, token 0
19/06/25 10:46:43 INFO crail: CoreOutputStream, close, path /, fd 0,
streamId 2, capacity 524800
19/06/25 10:46:43 INFO crail: createNode: name /spark/broadcast,
type DIRECTORY, storageAffinity 0, locationAffinity 0
19/06/25 10:46:43 INFO crail: CoreOutputStream, open, path /spark,
fd 4, streamId 3, isDir true, writeHint 0
19/06/25 10:46:43 INFO crail: Connecting to NVMf target at Transport
address = /192.168.3.104:4420, subsystem NQN =
nqn.2018-12.com.StorEdgeSystems:cntlr13 (<-- should be
nqn.2018-12.com.StorEdgeSystems:worker-4)
19/06/25 10:46:43 INFO disni: createEventChannel, objId
139964195131168
19/06/25 10:46:43 INFO disni: createId, id 139964195141520
19/06/25 10:46:43 INFO disni: new client endpoint, id 2, idPriv 0
19/06/25 10:46:43 INFO disni: launching cm processor, cmChannel 0
19/06/25 10:46:43 INFO disni: resolveAddr, addres
/192.168.3.104:4420
19/06/25 10:46:43 INFO disni: resolveRoute, id 0
19/06/25 10:46:43 INFO disni: setting up protection domain, context
461, pd 1
19/06/25 10:46:43 INFO disni: new endpoint CQ processor
19/06/25 10:46:43 INFO disni: createCompChannel, context
139962246807056
19/06/25 10:46:43 INFO disni: createCQ, objId 139964195142224, ncqe
64
19/06/25 10:46:43 INFO disni: createQP, objId 139964195151896,
send_wr size 32, recv_wr_size 32
19/06/25 10:46:43 INFO disni: connect, id 0
19/06/25 10:46:43 INFO disni: got event type +
RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.3.13:35873, dstAddress
/192.168.3.104:4420
19/06/25 10:46:43 INFO crail: ERROR: failed data operation
com.ibm.jnvmf.UnsuccessfulComandException: Command was not
successful. {StatusCodeType: 1 - Command Specific, SatusCode: 132 -
The host is not allowed to establish an association to any controller
in the NVM subsystem or the host is not allowed to establish an
association to the specified controller., CID: 0, Do_not_retry:
false, More: false, SQHD: 0}
at com.ibm.jnvmf.QueuePair.connect(QueuePair.java:128)
at
com.ibm.jnvmf.AdminQueuePair.connect(AdminQueuePair.java:36)
Regards,
David