Thanks for the help. I've been able to get the RDMA setup working and am
troubleshooting a few issues with the bench tests.  The issues so far have
all been configuration related: ulimit -l, incorrect value for
"crail.namenode.rpctype"
I am ignoring the TCP tier for now since I don't really need it yet.

I have more questions about data locality and Spark which I'll ask in
another post.

Thanks for all your help,

Sumit

On Sat, Jun 9, 2018 at 1:30 AM Animesh Trivedi <animesh.triv...@gmail.com>
wrote:

> Hi Sumit,
>
> Great that you attended the talk. Please also join the crail mailing list
> (cr...@crail.apache.org, cc'ed) and post issues there so that others can
> benefit from it. As you might have figured out that we are a new project,
> so we are still learning the ropes :)
>
> Having said that :
>
> 1) The RDMA tier failure looks like (i) if the Infiniband device is not
> setup properly (what does ibvc_devices show?) ; and/or (ii) you do no have
> permission to register large memory segments (check with ulimit -l). I
> think the default is 64kB. If that is so, then you have to increase the
> memory limit (https://access.redhat.com/solutions/61334, memlock). For
> the RDMA tier, crail needs to register memory that is typically more than
> just few kBs.
>
> 2) The TPC tier error is more cryptic. So may be other develops might have
> an idea what might be wrong. Could you also please post your crail
> configuration.
>
> Cheers,
> --
> Animesh
>
>
> On Sat, Jun 9, 2018 at 1:00 AM, Sumit Sen <sumit.1....@gmail.com> wrote:
>
>> Hi Animesh,
>>
>> I've just started trying to use Crail on a cluster running SLES12. I
>> attended the talk at Spark Summit which mentioned crail.  Our nodes are
>> connected with both ethernet and infiniband.  I want to run some of the
>> benchmarks to see what sort of performance I can get.  However I am running
>> into problems and haven't been able to figure out what to do.  Can you help
>> me or give me the name of someone else who can help?  I've given some
>> details below. I'd appreciate any help I can get to come up to speed on
>> this.
>>
>> Thanks,
>> Sumit
>>
>> Here are the issues I'm facing:
>> *RDMA configuration:*
>> Unable to start data node:
>> Exception in thread "main" java.io.IOException: Memory registration
>> failed with -1
>>         at
>> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:80)
>>         at
>> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:33)
>>         at
>> org.apache.crail.storage.rdma.RdmaStorageServer.allocateResource(RdmaStorageServer.java:120)
>>         at
>> org.apache.crail.storage.StorageServer.main(StorageServer.java:152)
>>
>> *TCP configuration:*
>> - both namenode and datanode start up
>> However, I can't run "iobench -t write". I get an immediate error that
>> crashes the jvm on the datanode
>> I see the following stack on the iobench console:
>> warmUp, warmupFile /tmp.dat2001725267, operations 32
>> Exception in thread "main" java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException: java.io.IOException: Connection
>> reset by peer
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>>         at
>> org.apache.crail.tools.CrailBenchmark.warmUp(CrailBenchmark.java:978)
>>         at
>> org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:97)
>>         at
>> org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070)
>> Caused by: java.util.concurrent.ExecutionException:
>> java.util.concurrent.ExecutionException: java.io.IOException: Connection
>> reset by peer
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>>         ... 3 more
>> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
>> Connection reset by peer
>>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:73)
>>         at
>> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:56)
>>         at
>> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:30)
>>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>>         ... 4 more
>> Caused by: java.io.IOException: Connection reset by peer
>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>         at com.ibm.narpc.NaRPCChannel.fetchBuffer(NaRPCChannel.java:51)
>>         at com.ibm.narpc.NaRPCEndpoint.pollResponse(NaRPCEndpoint.java:74)
>>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:70)
>>         ... 7 more
>>
>>
>

Reply via email to