[jira] [Commented] (COUCHDB-3009) Cluster node databases unreadable when first node in cluster is down

Jason Gordon (JIRA) Mon, 09 May 2016 07:20:41 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276415#comment-15276415
 ]


Jason Gordon commented on COUCHDB-3009:
---------------------------------------

Thanks for the pointer!  You're right.  What seems to be happening is that the 
script /dev/run puts the shards on one less (n-1).  When ./dev/run -n 2 shards 
are only placed on node1.  When ./dev/run (default n=3), shards get placed on 
node1 and node2.  And ./dev/run -n 4 shards get place on node1, node2, and 
node3.

BTW, tried the same thing on a MAC instead of CentOS and did not have this 
issue.

This is the scenario where two nodes are running a single CentOS machine.  With 
all nodes up and everything running fine (./dev/run -n 2 -a admin:xxxxxxxxx):

curl -X GET "http://198.72.252.244:15984/_membership"; --user admin:xxxxxxxxx
{"all_nodes":["[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]"]}

curl -X GET "http://198.72.252.244:25984/_membership"; --user admin:xxxxxxxxx
{"all_nodes":["[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]"]}

curl -X GET "http://198.72.252.244:15984/_users"; --user admin:xxxxxxxxx

{"db_name":"_users","update_seq":"1-g1AAAAFreJzLYWBg4MhgTmEQyctPSTV0MLS00DM01DMyNdIzMjHJAcoyJTIkyf___z8rkQG_uiQFIJlkT5RSB5DSeLBSRgJKE0BK64kxNY8FSDI0ACmg6vlEKl8AUb6fSOUHIMrvE6n8AUQ5yO1ZAL6wXv8","sizes":{"file":38110,"external":2003,"active":2199},"purge_seq":0,"other":{"data_size":2003},"doc_del_count":0,"doc_count":1,"disk_size":38110,"disk_format_version":6,"data_size":2199,"compact_running":false,"instance_start_time":"0"}


curl -X GET "http://198.72.252.244:25984/_users"; --user admin:xxxxxxxxx

{"db_name":"_users","update_seq":"1-g1AAAAFreJzLYWBg4MhgTmEQyctPSTV0MLS00DM01DMyNdIzMjHJAcoyJTIkyf___z8rkQG_uiQFIJlkT5RSB5DSeLBSRgJKE0BK64kxNY8FSDI0ACmg6vlEKl8AUb6fSOUHIMrvE6n8AUQ5yO1ZAL6wXv8","sizes":{"file":38110,"external":2003,"active":2199},"purge_seq":0,"other":{"data_size":2003},"doc_del_count":0,"doc_count":1,"disk_size":38110,"disk_format_version":6,"data_size":2199,"compact_running":false,"instance_start_time":"0"}

curl -X GET "http://198.72.252.244:15984/_users/_shards"; --user admin:xxxxxxxxx

{"shards":{"00000000-1fffffff":["[email protected]"],"20000000-3fffffff":["[email protected]"],"40000000-5fffffff":["[email protected]"],"60000000-7fffffff":["[email protected]"],"80000000-9fffffff":["[email protected]"],"a0000000-bfffffff":["[email protected]"],"c0000000-dfffffff":["[email protected]"],"e0000000-ffffffff":["[email protected]"]}}

curl -X GET "http://198.72.252.244:25984/_users/_shards"; --user admin:xxxxxxxxx

{"shards":{"00000000-1fffffff":["[email protected]"],"20000000-3fffffff":["[email protected]"],"40000000-5fffffff":["[email protected]"],"60000000-7fffffff":["[email protected]"],"80000000-9fffffff":["[email protected]"],"a0000000-bfffffff":["[email protected]"],"c0000000-dfffffff":["[email protected]"],"e0000000-ffffffff":["[email protected]"]}}

> Cluster node databases unreadable when first node in cluster is down
> --------------------------------------------------------------------
>
>                 Key: COUCHDB-3009
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-3009
>             Project: CouchDB
>          Issue Type: Bug
>          Components: BigCouch, Database Core
>    Affects Versions: 2.0.0
>            Reporter: Jason Gordon
>
> After creating 3 nodes in a cluster.  If the first node is taken down, the 
> other two nodes' default databases (_global_changes,_metadata, _replicator, 
> _users ) become unreadable with the error 500 
> {"error":"nodedown","reason":"progress not possible"}.
> Bringing up the first node, restores access.  However if the first node is 
> down, restarting nodes 2 and 3 does not restore access and also causes the 
> user databases to become unreachable.
> Note, only the first node created in the cluster causes this problem.  As 
> long as node 1 is up, nodes 2 and 3 can go up and down without having an 
> issue.
> Log messages seen on nodes 2 and 3:
> 15:23:46.388 [notice] cassim_metadata_cache changes listener died 
> {{nocatch,{error,timeout}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,190}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
> 15:23:46.388 [error] Error in process <0.27407.0> on node 
> '[email protected]' with exit value:
> {{nocatch,{error,timeout}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,190}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
> 15:23:46.389 [notice] chttpd_auth_cache changes listener died 
> {{nocatch,{error,timeout}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,190}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
> 15:23:46.389 [error] Error in process <0.27414.0> on node 
> '[email protected]' with exit value:
> {{nocatch,{error,timeout}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,190}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
> 15:23:51.391 [error] gen_server chttpd_auth_cache terminated with reason: no 
> case clause matching {error,read_failure} in 
> chttpd_auth_cache:ensure_auth_ddoc_exists/2 line 187
> 15:23:51.391 [error] CRASH REPORT Process chttpd_auth_cache with 1 neighbours 
> exited with reason: no case clause matching {error,read_failure} in 
> chttpd_auth_cache:ensure_auth_ddoc_exists/2 line 187 in 
> gen_server:terminate/7 line 826
> 15:23:51.391 [error] Supervisor chttpd_sup had child undefined started with 
> chttpd_auth_cache:start_link() at <0.27413.0> exit with reason no case clause 
> matching {error,read_failure} in chttpd_auth_cache:ensure_auth_ddoc_exists/2 
> line 187 in context child_terminated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COUCHDB-3009) Cluster node databases unreadable when first node in cluster is down

Reply via email to