Hello,
I have run into a production issue that I think stems from either a defect in
the com.basho.riak:riak-client:jar:1.0.4 library, or a misunderstanding in my
use of it. I'm actively trying to fix the issue, but I thought I'd put a
feeler out to this list to see if others have encountered the issue, or whether
there's a clear problem in our use of the library.
Environment:
---------------------
* Java web application running in Tomcat:
** JDK: jdk1.6.0_24-jce6
** Tomcat: apache-tomcat-7.0.23
** Basho Riak Client version: com.basho.riak:riak-client:jar:1.0.4
* 6 node Riak cluster running Riak 1.0.1
Error sequence:
-----------------------
The production issue has happened a few times, and it follows this sequence:
1. We get a rash of SocketException: Connection Reset errors
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at com.basho.riak.pbc.RiakConnection.receive(RiakConnection.java:92)
at com.basho.riak.pbc.RiakClient.processFetchReply(RiakClient.java:278)
at com.basho.riak.pbc.RiakClient.fetch(RiakClient.java:252)
at com.basho.riak.pbc.RiakClient.fetch(RiakClient.java:241)
at
com.basho.riak.client.raw.pbc.PBClientAdapter.fetch(PBClientAdapter.java:156)
at
com.basho.riak.client.raw.pbc.PBClientAdapter.fetch(PBClientAdapter.java:139)
at com.basho.riak.client.raw.ClusterClient.fetch(ClusterClient.java:107)
2. Followed 50 milliseconds later by a steady stream of SocketException: Broken
pipe messages, until we restart the Tomcat container.
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at com.basho.riak.pbc.RiakConnection.send(RiakConnection.java:82)
at com.basho.riak.pbc.RiakClient.fetch(RiakClient.java:251)
at com.basho.riak.pbc.RiakClient.fetch(RiakClient.java:241)
at
com.basho.riak.client.raw.pbc.PBClientAdapter.fetch(PBClientAdapter.java:156)
at
com.basho.riak.client.raw.pbc.PBClientAdapter.fetch(PBClientAdapter.java:139)
at com.basho.riak.client.raw.ClusterClient.fetch(ClusterClient.java:107)
3. Within our 6-node Riak cluster, almost exactly 1 minute after the initial
connection reset errors, one node emits a crash log:
2012-03-27 14:56:51 =ERROR REPORT====
** Generic server <0.289.0> terminating
** Last message in was {inet_async,#Port<0.3319>,41462,{ok,#Port<0.73913715>}}
** When Server state == {state,riak_kv_pb_listener,#Port<0.3319>,{state,8087}}
** Reason for termination ==
** {timeout,{'gen_server2',call,[<0.1838.1574>,{set_socket,#Port<0.73913715>}]}}
2012-03-27 14:57:08 =CRASH REPORT====
crasher:
initial call: gen_nb_server:init/1
pid: <0.289.0>
registered_name: []
exception exit:
{timeout,{'gen_server2',call,[<0.1838.1574>,{set_socket,#Port<0.73913715>}]}}
in function gen_server:terminate/6
in call from proc_lib:init_p_do_apply/3
ancestors: [riak_kv_sup,<0.194.0>]
messages: [{#Ref<0.0.704.111074>,ok}]
links: [<0.200.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 377
stack_size: 24
reductions: 117962
neighbours:
2012-03-27 14:57:09 =SUPERVISOR REPORT====
Supervisor: {local,riak_kv_sup}
Context: child_terminated
Reason:
{timeout,{'gen_server2',call,[<0.1838.1574>,{set_socket,#Port<0.73913715>}]}}
Offender:
[{pid,<0.289.0>},{name,riak_kv_pb_listener},{mfargs,{riak_kv_pb_listener,start_link,[]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
The Riak cluster seems to bounce back to health (all nodes connected and
responding) by the time we see the errors and check it, but the clients (the
Tomcat application) never recover until we restart them. It seems pretty clear
that a process within the Riak cluster is dying and taking its sockets with it,
after which the clients are not recovering.
Theory
----------
The working theory is that the client library is never flushing out bad
connections. You can see here that connections are always returned to the
pool. I have not yet seen any evidence that connections are ever tested for
health once allocated.
>From com.basho.riak.pbc.RiakClient, line 224-237:
public RiakObject[] fetch(ByteString bucket, ByteString key, int
readQuorum)
throws IOException {
RpbGetReq req = RPB.RpbGetReq.newBuilder().setBucket(bucket)
.setKey(key).setR(readQuorum).build();
RiakConnection c = getConnection();
try {
c.send(MSG_GetReq, req);
return processFetchReply(c, bucket, key).getObjects();
} finally {
release(c);
}
}
And this is how our reference to the client is created, using client-side
cluster configs:
// set up a PBClientConfig per host
for(String host : hosts) {
PBClientConfig pbcConfig = new PBClientConfig.Builder()
.withInitialPoolSize(config.getInitialPoolSize())
.withPort(config.getRiakPort())
.withHost(host.trim())
.withConnectionTimeoutMillis(config.getConnectionWaitTimeoutMillis())
.withIdleConnectionTTLMillis(config.getIdleConnectionTTLMillis())
.withSocketBufferSizeKb(config.getBufferSizeKb())
.withPoolSize(config.getMaximunPoolSize())
.build();
clusterConfig.addClient(pbcConfig);
}
// Connection pooling is done internally in the PBClusterClient
created by the factory
RawClient rawClient =
PBClusterClientFactory.getInstance().newClient(clusterConfig);
this.client = rawClient;
Is the expectation within the client library that our own application code
would detect bad connections and recreate the pool / client once we've detected
them?
Thanks,
Will
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com