Hi Ryan,
Thanks for your thoughts on this. However, we are afraid that it's not the case
really.
To further rule this out, we went ahead and reinstalled all the nodes in our
cluster by downloading current riak release from here:
http://downloads.basho.com.s3-website-us-east-1.amazonaws.com/riak/CURRENT/riak-1.2.1.tar.gz
~~~~~~~~~~~~
We start riak nodes with following updates inside etc/app.config file:
{pb_ip, "172.17.3.82" },
{http, [ {"172.17.3.82", 8098 } ]},
{storage_backend, riak_kv_eleveldb_backend},
{riak_search, [{enabled, true}]},
and ofcourse we update etc/vm.args to have first line as:
-name [email protected]
~~~~~~~~~~~~
Finally we start 2 nodes in the cluster, join, plan, commit to finally have:
$ ./bin/riak-admin member-status
================================= Membership ==================================
Status Ring Pending Node
-------------------------------------------------------------------------------
valid 50.0% -- '[email protected]'
valid 50.0% -- '[email protected]'
-------------------------------------------------------------------------------
Valid:2 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
~~~~~~~~~~~~
Also after joining nodes, we install necessary search hook by running following
command on [email protected] node only:
$ ./bin/search-cmd install ejabberd_doc
:: Installing Riak Search <--> KV hook on bucket 'ejabberd_doc'.
~~~~~~~~~~~~
All search query run fine with no docs inside riak. However as soon as we even
insert a single document, search queries start to fail.
As before, here are the logs from inside of riak error.log file (query is made
to node [email protected])
error.log on [email protected] contains:
2012-12-20 16:27:37.877 [error] <0.1821.0>@mi_server:handle_info:524
lookup/range failure:
{{badfun,#Fun<riak_search_client.9.56347389>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
2012-12-20 16:27:37.878 [error] emulator Error in process <0.4075.0> on node
'[email protected]' with exit value:
{{badfun,#Fun<riak_search_client.9.56347389>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
2012-12-20 16:27:37.878 [error] <0.1940.0>@mi_server:handle_info:524
lookup/range failure:
{{badfun,#Fun<riak_search_client.9.56347389>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
2012-12-20 16:27:37.882 [error] emulator Error in process <0.4077.0> on node
'[email protected]' with exit value:
{{badfun,#Fun<riak_search_client.9.56347389>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
error.log on [email protected] contains:
2012-12-20 16:28:17.944 [error] <0.2511.0> gen_server <0.2511.0> terminated
with reason:
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
2012-12-20 16:28:17.949 [error] <0.2511.0> CRASH REPORT Process <0.2511.0> with
2 neighbours exited with reason:
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
in gen_server:terminate/6 line 737
2012-12-20 16:28:17.952 [error] <0.172.0> Supervisor riak_api_pb_sup had child
undefined started with {riak_api_pb_server,start_link,undefined} at <0.2511.0>
exit with reason
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
in context child_terminated
2012-12-20 16:28:37.888 [error] <0.2504.0> gen_server <0.2504.0> terminated
with reason:
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
2012-12-20 16:28:37.901 [error] <0.2504.0> CRASH REPORT Process <0.2504.0> with
1 neighbours exited with reason:
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
in gen_server:terminate/6 line 737
2012-12-20 16:28:37.904 [error] <0.172.0> Supervisor riak_api_pb_sup had child
undefined started with {riak_api_pb_server,start_link,undefined} at <0.2504.0>
exit with reason
{error,{case_clause,timeout},[{riak_search_client,search_doc,8,[{file,"src/riak_search_client.erl"},{line,165}]},{riak_search_utils,run_query,7,[{file,"src/riak_search_utils.erl"},{line,283}]},{riak_search_pb_query,run_query,7,[{file,"src/riak_search_pb_query.erl"},{line,96}]},{riak_search_pb_query,process,2,[{file,"src/riak_search_pb_query.erl"},{line,80}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,203}]},{riak_api_pb_server,handle_info,2,[{file,"src/..."},...]},...]}
in context child_terminated
~~~~~~~~~~~~
We don't really have mixed riak release. But yes we do have a mixed erlang
releases. Not sure if that makes any difference here.
[email protected]
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0]
[kernel-poll:false]
[email protected]
Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe]
[kernel-poll:false]
~~~~~~~~~~~~
Unfortunately none of these error happens on our local dev environment.
On my local dev box, I run a 5 node cluster (ofcourse all nodes on same
physical machine).
I will really appreciate if you can share your thoughts further on this
particular issue.
There is still some time before we hit our application in production and really
would like to fix this before app release.
Thanks in advance for your time.
--
Abhinav Singh
http://abhinavsingh.com/
On 14-Dec-2012, at 10:04 PM, Ryan Zezeski <[email protected]> wrote:
>
> Hi, comments inline
>
> On Wed, Dec 5, 2012 at 8:10 AM, Abhinav Singh <[email protected]> wrote:
>
> We are facing an issue where search queries works fine on my local dev box
> (which have riak-1.2.1rc2 installed).
> However same queries timeout on our production boxes (which have riak-1.2.1
> installed):
>
> 2012-12-05 14:49:59.777 [error] <0.1035.0>@mi_server:handle_info:524
> lookup/range failure:
> {{badfun,#Fun<riak_search_client.9.56347389>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
>
> Did you recently upgrade your production boxes? The 'badfun' error is an
> indication that you currently have a "mixed" cluster. The error will occur
> when two or more machines are involved and they are not all the same version.
> This is a bug in Riak Search.
>
>
> This query does succeed sometimes (1-5%), but fails most of the times.
> I want to know if the above logs indicate towards a particular error with our
> riak cluster?
>
> Yes, so in 1-5% of the cases the nodes involved in a query are all the same
> version. The reasons this is non-deterministic is because Riak Search uses
> some randomness during query time to help spread load around.
>
>
> Since this query has never failed on my local development box,
> I suspect either it has to do with something that changed between 1.2.1rc2
> and 1.2.1-stable release or something that is related to our production riak
> cluster.
>
>
>
> As I said above. I strongly suspect a mixed cluster scenario. That's the
> only time I've seen an error like the above. The second email also strongly
> indicates a mixed cluster scenario given the behavior you are seeing.
>
> -Z
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com