We're seeing instances of a JVM app which talks to riak run out of memory when riak operations rise in latency or riak becomes otherwise unresponsive. A heap dump of the JVM at the time of the OOM show that 91% of the 1G (active) heap is consumed by large byte[] instances. In our case 3 of those byte[]s are in the 200MB range with size dropping off after that. The byte[] instances cannot be traced back to a specific variable as their references appear to be stack-allocated local method variables. But, based on the name of the thread, we can tell that the thread is doing a store operation against riak@localhost.
Inspection of the data in one of these byte[]s shows what looks like an r_object response with headers and footer boilerplate around our object payload. This 200+MB byte[] is filled with 0s after the 338th element which is really confusing and indicates that far too much space is being allocated to read the protobuf payload. Here's a dump of one of these instances: https://gist.github.com/40ef9b2ff561e973a72c It's also worth mentioning that, according to /stats, get_fsm_objsize_100 is consistently under 1MB so there is no reason to think that our objects are actually this large. At this point I'm suspicious of the following code creating too large a byte[] from possibly too large a return from dis.readInt() https://github.com/basho/riak-java-client/blob/master/src/main/java/com/basho/riak/pbc/RiakConnection.java#L110 Unsure if that indicates a problem in the driver or the server-side erlang protobuf server. Suspicious that requests pile up and many of these byte[]s are hanging out--enough to cause an OOM. It's possible that they are always very large, but are short-lived enough as to not cause a problem until latencies rise increasing their numbers briefly. Thoughts? Thanks, D _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
