Hi Sumit,

Great that you attended the talk. Please also join the crail mailing list (
cr...@crail.apache.org, cc'ed) and post issues there so that others can
benefit from it. As you might have figured out that we are a new project,
so we are still learning the ropes :)

Having said that :

1) The RDMA tier failure looks like (i) if the Infiniband device is not
setup properly (what does ibvc_devices show?) ; and/or (ii) you do no have
permission to register large memory segments (check with ulimit -l). I
think the default is 64kB. If that is so, then you have to increase the
memory limit (https://access.redhat.com/solutions/61334, memlock). For the
RDMA tier, crail needs to register memory that is typically more than just
few kBs.

2) The TPC tier error is more cryptic. So may be other develops might have
an idea what might be wrong. Could you also please post your crail
configuration.

Cheers,
--
Animesh


On Sat, Jun 9, 2018 at 1:00 AM, Sumit Sen <sumit.1....@gmail.com> wrote:

> Hi Animesh,
>
> I've just started trying to use Crail on a cluster running SLES12. I
> attended the talk at Spark Summit which mentioned crail.  Our nodes are
> connected with both ethernet and infiniband.  I want to run some of the
> benchmarks to see what sort of performance I can get.  However I am running
> into problems and haven't been able to figure out what to do.  Can you help
> me or give me the name of someone else who can help?  I've given some
> details below. I'd appreciate any help I can get to come up to speed on
> this.
>
> Thanks,
> Sumit
>
> Here are the issues I'm facing:
> *RDMA configuration:*
> Unable to start data node:
> Exception in thread "main" java.io.IOException: Memory registration failed
> with -1
>         at com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(
> NatRegMrCall.java:80)
>         at com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(
> NatRegMrCall.java:33)
>         at org.apache.crail.storage.rdma.RdmaStorageServer.
> allocateResource(RdmaStorageServer.java:120)
>         at org.apache.crail.storage.StorageServer.main(
> StorageServer.java:152)
>
> *TCP configuration:*
> - both namenode and datanode start up
> However, I can't run "iobench -t write". I get an immediate error that
> crashes the jvm on the datanode
> I see the following stack on the iobench console:
> warmUp, warmupFile /tmp.dat2001725267, operations 32
> Exception in thread "main" java.util.concurrent.ExecutionException:
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException:
> java.io.IOException: Connection reset by peer
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>         at org.apache.crail.tools.CrailBenchmark.warmUp(
> CrailBenchmark.java:978)
>         at org.apache.crail.tools.CrailBenchmark.write(
> CrailBenchmark.java:97)
>         at org.apache.crail.tools.CrailBenchmark.main(
> CrailBenchmark.java:1070)
> Caused by: java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException:
> java.io.IOException: Connection reset by peer
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93)
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>         ... 3 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
> Connection reset by peer
>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:73)
>         at org.apache.crail.storage.tcp.TcpStorageFuture.get(
> TcpStorageFuture.java:56)
>         at org.apache.crail.storage.tcp.TcpStorageFuture.get(
> TcpStorageFuture.java:30)
>         at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78)
>         ... 4 more
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at com.ibm.narpc.NaRPCChannel.fetchBuffer(NaRPCChannel.java:51)
>         at com.ibm.narpc.NaRPCEndpoint.pollResponse(NaRPCEndpoint.java:74)
>         at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:70)
>         ... 7 more
>
>

Reply via email to