Re: [ceph-users] OSD won't go up after node reboot

2015-09-01 Thread Евгений Д .
Data lives in another container attached to OSD container as Docker volume.
According to `deis ps -a`, this volume was created two weeks ago, though
all files in `current` are very recent. I suspect that something removed
files in the data volume after reboot. As reboot was caused by CoreOS
update, it might be newer version of Docker (1.6 -> 1.7) that introduced
the problem. Or maybe it was container initialization process that somehow
removed and recreated files.
I don't have this data volume anymore, so can only guess.

2015-08-31 18:28 GMT+03:00 Jan Schermer :

> Is it possible that something else was mounted there?
> Or is it possible nothing was mounted there?
> That would explain such behaviour...
>
> Jan
>
> On 31 Aug 2015, at 17:07, Евгений Д.  wrote:
>
> No, it really was in the cluster. Before reboot cluster had HEALTH_OK.
> Though now I've checked `current` directory and it doesn't contain any
> data:
>
> root@staging-coreos-1:/var/lib/ceph/osd/ceph-0# ls current
> commit_op_seq  meta  nosnap  omap
>
> while other OSDs do. It really looks like something was broken on reboot,
> probably during container start, so it's not really related to Ceph. I'll
> go with OSD recreation.
>
> Thank you.
>
> 2015-08-31 11:50 GMT+03:00 Gregory Farnum :
>
>> On Sat, Aug 29, 2015 at 3:32 PM, Евгений Д.  wrote:
>> > I'm running 3-node cluster with Ceph (it's Deis cluster, so Ceph
>> daemons are
>> > containerized). There are 3 OSDs and 3 mons. After rebooting all nodes
>> one
>> > by one all monitors are up, but only two OSDs of three are up. 'Down'
>> OSD is
>> > really running but is never marked up/in.
>> > All three mons are reachable from inside the OSD container.
>> > I've run `log dump` for this OSD and found this line:
>> >
>> > Aug 29 06:19:39 staging-coreos-1 sh[7393]: -99> 2015-08-29
>> 06:18:51.855432
>> > 7f5902009700  3 osd.0 0 handle_osd_map epochs [1,90], i have 0, src has
>> > [1,90]
>> >
>> > Is it the reason why OSD cannot connect to the cluster? If yes, why
>> could it
>> > happen? I haven't removed any data from /var/lib/ceph/osd.
>> > Is it possible to bring this OSD back to cluster without completely
>> > recreating it?
>> >
>> > Ceph version is:
>> >
>> > root@staging-coreos-1:/# ceph -v
>> > ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>>
>> It's pretty unlikely. I presume (since the OSD has no maps) that it's
>> never actually been up and in the cluster? Or else its data store has
>> been pretty badly corrupted since it doesn't have any of the requisite
>> metadata. In which case you'll probably be best off recreating it
>> (with 3 OSDs I assume all your PGs are still active).
>> -Greg
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD won't go up after node reboot

2015-08-31 Thread Евгений Д .
No, it really was in the cluster. Before reboot cluster had HEALTH_OK.
Though now I've checked `current` directory and it doesn't contain any data:

root@staging-coreos-1:/var/lib/ceph/osd/ceph-0# ls current
commit_op_seq  meta  nosnap  omap

while other OSDs do. It really looks like something was broken on reboot,
probably during container start, so it's not really related to Ceph. I'll
go with OSD recreation.

Thank you.

2015-08-31 11:50 GMT+03:00 Gregory Farnum :

> On Sat, Aug 29, 2015 at 3:32 PM, Евгений Д.  wrote:
> > I'm running 3-node cluster with Ceph (it's Deis cluster, so Ceph daemons
> are
> > containerized). There are 3 OSDs and 3 mons. After rebooting all nodes
> one
> > by one all monitors are up, but only two OSDs of three are up. 'Down'
> OSD is
> > really running but is never marked up/in.
> > All three mons are reachable from inside the OSD container.
> > I've run `log dump` for this OSD and found this line:
> >
> > Aug 29 06:19:39 staging-coreos-1 sh[7393]: -99> 2015-08-29
> 06:18:51.855432
> > 7f5902009700  3 osd.0 0 handle_osd_map epochs [1,90], i have 0, src has
> > [1,90]
> >
> > Is it the reason why OSD cannot connect to the cluster? If yes, why
> could it
> > happen? I haven't removed any data from /var/lib/ceph/osd.
> > Is it possible to bring this OSD back to cluster without completely
> > recreating it?
> >
> > Ceph version is:
> >
> > root@staging-coreos-1:/# ceph -v
> > ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> It's pretty unlikely. I presume (since the OSD has no maps) that it's
> never actually been up and in the cluster? Or else its data store has
> been pretty badly corrupted since it doesn't have any of the requisite
> metadata. In which case you'll probably be best off recreating it
> (with 3 OSDs I assume all your PGs are still active).
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD won't go up after node reboot

2015-08-31 Thread Gregory Farnum
On Sat, Aug 29, 2015 at 3:32 PM, Евгений Д.  wrote:
> I'm running 3-node cluster with Ceph (it's Deis cluster, so Ceph daemons are
> containerized). There are 3 OSDs and 3 mons. After rebooting all nodes one
> by one all monitors are up, but only two OSDs of three are up. 'Down' OSD is
> really running but is never marked up/in.
> All three mons are reachable from inside the OSD container.
> I've run `log dump` for this OSD and found this line:
>
> Aug 29 06:19:39 staging-coreos-1 sh[7393]: -99> 2015-08-29 06:18:51.855432
> 7f5902009700  3 osd.0 0 handle_osd_map epochs [1,90], i have 0, src has
> [1,90]
>
> Is it the reason why OSD cannot connect to the cluster? If yes, why could it
> happen? I haven't removed any data from /var/lib/ceph/osd.
> Is it possible to bring this OSD back to cluster without completely
> recreating it?
>
> Ceph version is:
>
> root@staging-coreos-1:/# ceph -v
> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

It's pretty unlikely. I presume (since the OSD has no maps) that it's
never actually been up and in the cluster? Or else its data store has
been pretty badly corrupted since it doesn't have any of the requisite
metadata. In which case you'll probably be best off recreating it
(with 3 OSDs I assume all your PGs are still active).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD won't go up after node reboot

2015-08-29 Thread Евгений Д .
I'm running 3-node cluster with Ceph (it's Deis cluster, so Ceph daemons
are containerized). There are 3 OSDs and 3 mons. After rebooting all nodes
one by one all monitors are up, but only two OSDs of three are up. 'Down'
OSD is really running but is never marked up/in.
All three mons are reachable from inside the OSD container.
I've run `log dump` for this OSD and found this line:

Aug 29 06:19:39 staging-coreos-1 sh[7393]: -99 2015-08-29
06:18:51.855432 7f5902009700  3 osd.0 0 handle_osd_map epochs [1,90],
i have 0, src has [1,90]

Is it the reason why OSD cannot connect to the cluster? If yes, why could
it happen? I haven't removed any data from /var/lib/ceph/osd.
Is it possible to bring this OSD back to cluster without completely
recreating it?

Ceph version is:

root@staging-coreos-1:/# ceph -v
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com