If Solr is stumbling over bad data, your node's solr.log should be filled up. If Yokozuna is stumbling over bad data that it's trying to send Solr in a loop, the console.log should be full. If yokozuna is going ahead and indexing bad values (such as unparsable json), it will go ahead and index a blank object with _yz_err (just search for existence). If you have a case of sibling explosion, you'll have many duplicates of the same object with different _yz_vtag fields (again search for existence).
You said it's not a resource issue, but just to rule that out, how much RAM does each node have? Also, how much is made available to Solr? You can adjust the max heap size given to Solr in riak.conf, by changing search.solr.jvm_options max heap size values from -Xmx1g to -Xmx2g or more. Eric On Aug 11, 2014, at 8:03 AM, Chaim Solomon <[email protected]> wrote: > Hi, > > I don't think that it is a resource issue now. > > After removing the data, the other nodes had low load and are handling the > workload just fine. > And the Java process - when it crashed - was really dead, on shutting down > Riak it stayed around and needed a -9 to go away. > > I don't think the disks are a problem but rather suspect that a crash may > have caused Solr to stumble over bad data and then crash. > > Chaim Solomon > > > > On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <[email protected]> wrote: > Chaim, > > Some comments inline: > > On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <[email protected]> > wrote: > Hi, > > I've been running into an issue with the yz search acting up. > > I've been getting a lot of these: > > 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index > object {<<"bucketname">>,<<"123">>} with error {"Failed to index > docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s > rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file, > "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]}, > {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil > e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"} > ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}] > > and the Java process uses a lot of CPU and eventually runs out of memory or > something like that and gets stuck. Killing the process gets the cluster back > up and running. > > I am guessing that it may be data corruption on the yz data on one node. > > Clearing away the yz data on that node and restarting riak makes the system > work again - and I guess AAE will rebuild the index. > > > This sounds very similar to the issue last week. I would certainly like to > rule out any sort of data corruption (are you thinking your disks are > corrupting the data or are you assuming Solr is?). > > However, it is also possible, like the last issue, that the node/cluster > simply does not have enough memory. When you delete the data Solr no longer > has anything to cache in-memory thus using significantly less. As discussed, > the recommended minimum > > But I'm wondering why a crashing Java on one node practically takes down the > search on the cluster. Shouldn't Riak be more resilient than that? > > The hard part here is, at least initially, the Java process doesn't crash, it > just starts to timeout. In distributed systems a slow-node is often worse > than a down node. Riak, prior to 1.4 had something called "health check" that > would mark a node down in this situation. Unfortunately in some workloads, > and I believe given your cluster's limited resources it would happen here, > this often results in excessive work being offloaded to another node, which > also does not have sufficient resources and around we go until the entire > cluster falls over. A capacity problem, typically, can only be solved by > adding more capacity. > > > Is there a explicit reindex command for the full text search subsystem? > > Could Riak keep an eye on the java process and restart it if it crashes or > runs away? > > > Riak does manage the JVM process (starting/stopping/restarting) .I agree that > if we could include run-away process, like in your case, that would be even > better. I would have to think a bit more about how this would work (to > prevent the same problems mentioned above with the old-style health check) > > Jordan > > > Chaim Solomon > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
