Re: [ceph-users] Ceph stops responding

Georgios Dimitrakakis Wed, 05 Mar 2014 03:24:07 -0800

Actually there are two monitors (my bad in the previous e-mail).
One at the MASTER and one at the CLIENT.


The monitor in CLIENT is failing with the following

2014-03-05 13:08:38.821135 7f76ba82b700 1mon.client1@0(leader).paxos(paxos active c 25603..26314) is_readablenow=2014-03-05 13:08:38.821136 lease_expire=2014-03-05 13:08:40.845978has v0 lc 263142014-03-05 13:08:40.599287 7f76bb22c700 0mon.client1@0(leader).data_health(86) update_stats avail 4% total51606140 used 46645692 avail 23390082014-03-05 13:08:40.599527 7f76bb22c700 -1mon.client1@0(leader).data_health(86) reached critical levels ofavailable space on data store -- shutdown!2014-03-05 13:08:40.599530 7f76bb22c700 0 ** Shutdown via Data HealthService **2014-03-05 13:08:40.599557 7f76b9328700 -1 mon.client1@0(leader) e2 ***Got Signal Interrupt ***2014-03-05 13:08:40.599568 7f76b9328700 1 mon.client1@0(leader) e2shutdown

2014-03-05 13:08:40.599602 7f76b9328700  0 quorum service shutdown

2014-03-05 13:08:40.599609 7f76b9328700 0mon.client1@0(shutdown).health(86) HealthMonitor::service_shutdown 1services

2014-03-05 13:08:40.599613 7f76b9328700  0 quorum service shutdown


The thing is that there is plenty of space in that host (CLIENT)

# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/vg_one-lv_root     50G    45G  2.3G  96% /
tmpfs                          5.9G     0  5.9G   0% /dev/shm
/dev/sda1                      485M   76M  384M  17% /boot
/dev/mapper/vg_one-lv_home     862G   249G 569G  31% /home

On the other hand the other host (MASTER) is running low on disk space(93% is full).

But why is the CLIENT failing while the MASTER is still running eventhough is running low on disk space?


I 'll try to free some space and see what happens next...

Best,

G.



On Wed, 05 Mar 2014 11:50:57 +0100, Wido den Hollander wrote:

On 03/05/2014 11:21 AM, Georgios Dimitrakakis wrote:

My setup consists of two nodes.

The first node (master) is running:

-mds
-mon
-osd.0



and the second node (CLIENT) is running:

-osd.1


Therefore I 've restarted ceph services on both nodes

Leaving the "ceph -w" running for as long as it can after a fewseconds

the error that is produced is this:

2014-03-05 12:08:17.715699 7fba13fff700 0 monclient: hunting fornew mon2014-03-05 12:08:17.716108 7fba102f8700 0 -- 192.168.0.10:0/1008298>>

X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7fba080090b0).fault


(where X.Y.Z.X is the public IP of the CLIENT node).

And it keep goes on...

"ceph-health" after a few minutes shows the following

2014-03-05 12:12:58.355677 7effc52fb700  0 monclient(hunting):
authenticate timed out after 300
2014-03-05 12:12:58.355717 7effc52fb700  0 librados: client.admin
authentication error (110) Connection timed out
Error connecting to cluster: TimedOut


Any ideas now??


Is the monitor actually running on the first node? If not, checked
the logs in /var/log/ceph as to why it isn't running.

Or maybe you just need to start it.

Wido

Best,

G.

On Wed, 5 Mar 2014 15:10:25 +0530, Srinivasa Rao Ragolu wrote:

First try to start OSD nodes by restarting the ceph service on ceph
nodes. If it works file then you could able to see ceph-osd process

running in process list. And do not need to add any public orprivate

network in ceph.conf. If none of the OSDs run then you need to
reconfigure them from monitor node.

Please check ceph-mon process is running on monitor node or not?
ceph-mds should not run.

also check /etc/hosts file with valid ip address of cluster nodes

Finally check ceph.client.admin.keyring andceph.bootstrap-osd.keyring

should be matched in all the cluster nodes.

Best of luck.
Srinivas.

On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis  wrote:

Hi!

I have installed ceph and created two osds and was very happy with
that but apparently not everything was correct.

Today after a system reboot the cluster comes up and for a few
moments it seems that its ok (using the "ceph health" command) but
after a few seconds the "ceph health" command doesnt produce any
output at all.

It justs stays there without anything on the screen...

ceph -w is doing the same as well...

If I restart the ceph services ("service ceph restart") again fora

few seconds is working but after a few more it stays frozen.

Initially I thought that this was a firewall problem butapparently

it isnt.

Then I though that this had to do with the

public_network

cluster_network

not defined in ceph.conf and changed that.

No matter whatever I do the cluster works for a few seconds after
the service restart and then it stops responding...

Any help much appreciated!!!

Best,

G.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com [1]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]




Links:
------
[1] mailto:ceph-users@lists.ceph.com
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] mailto:gior...@acmac.uoc.gr

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph stops responding

Reply via email to