Chaim, Some comments inline:
On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <[email protected]> wrote: > Hi, > > I've been running into an issue with the yz search acting up. > > I've been getting a lot of these: > > 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index > object {<<"bucketname">>,<<"123">>} with error {"Failed to index > docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s > > rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file, > > "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]}, > > {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil > > e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"} > > ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}] > > and the Java process uses a lot of CPU and eventually runs out of memory > or something like that and gets stuck. Killing the process gets the cluster > back up and running. > > I am guessing that it may be data corruption on the yz data on one node. > > Clearing away the yz data on that node and restarting riak makes the > system work again - and I guess AAE will rebuild the index. > > This sounds very similar to the issue last week. I would certainly like to rule out any sort of data corruption (are you thinking your disks are corrupting the data or are you assuming Solr is?). However, it is also possible, like the last issue, that the node/cluster simply does not have enough memory. When you delete the data Solr no longer has anything to cache in-memory thus using significantly less. As discussed, the recommended minimum > But I'm wondering why a crashing Java on one node practically takes down > the search on the cluster. Shouldn't Riak be more resilient than that? > The hard part here is, at least initially, the Java process doesn't crash, it just starts to timeout. In distributed systems a slow-node is often worse than a down node. Riak, prior to 1.4 had something called "health check" that would mark a node down in this situation. Unfortunately in some workloads, and I believe given your cluster's limited resources it would happen here, this often results in excessive work being offloaded to another node, which also does not have sufficient resources and around we go until the entire cluster falls over. A capacity problem, typically, can only be solved by adding more capacity. > > Is there a explicit reindex command for the full text search subsystem? > > Could Riak keep an eye on the java process and restart it if it crashes or > runs away? > > Riak does manage the JVM process (starting/stopping/restarting) .I agree that if we could include run-away process, like in your case, that would be even better. I would have to think a bit more about how this would work (to prevent the same problems mentioned above with the old-style health check) Jordan > Chaim Solomon > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
