Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-09 Thread Christian Balzer
On Thu, 8 Jan 2015 21:17:12 -0700 Robert LeBlanc wrote:

 On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote:
  On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
  Which of course currently means a strongly consistent lockup in these
  scenarios. ^o^
 
 That is one way of putting it
 
If I had the time and more importantly the talent to help with code, I'd
do so. 
Failing that, pointing out the often painful truth is something I can do.

  Slightly off-topic and snarky, that strong consistency is of course of
  limited use when in the case of a corrupted PG Ceph basically asks you
  to toss a coin.
  As in minor corruption, impossible for a mere human to tell which
  replica is the good one, because one OSD is down and the 2 remaining
  ones differ by one bit or so.
 
 This is where checksumming is supposed to come in. I think Sage has been
 leading that initiative. 

Yeah, I'm aware of that effort. 
Of course in the meantime even a very simple majority vote would be most
welcome and helpful in nearly all cases (with 3 replicas available).

One wonders if this is basically acknowledging that while offloading some
things like checksums to the underlying layer/FS are desirable from a
codebase/effort/complexity view, neither BTRFS or ZFS are fully production
ready and won't be for some time.

 Basically, when an OSD reads an object it should
 be able to tell if there was bit rot by hashing what it just read and
 checking the MD5SUM that it did when it first received the object. If it
 doesn't match it can ask another OSD until it finds one that matches.
 
 This provides a number of benefits:
 
1. Protect against bit rot. Checked on read and on deep scrub.
2. Automatically recover the correct version of the object.
3. If the client computes the MD5SUM before it sent over the wire, the
data can be guaranteed through the memory of several
machines/devices/cables/etc.
4. Getting by with size 2 is less risky for those who really want to
do that.
 
 With all these benefits, there is a trade-off associated with it, mostly
 CPU. However with the inclusion of AES in silicon, it may not be a huge
 issue now. But, I'm not a programmer and familiar with the aspect of the
 Ceph code to be authoritative in any way.

Yup, all very useful and pertinent points.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-09 Thread Robert LeBlanc
On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote:
 On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
 Which of course currently means a strongly consistent lockup in these
 scenarios. ^o^

That is one way of putting it

 Slightly off-topic and snarky, that strong consistency is of course of
 limited use when in the case of a corrupted PG Ceph basically asks you to
 toss a coin.
 As in minor corruption, impossible for a mere human to tell which
 replica is the good one, because one OSD is down and the 2 remaining ones
 differ by one bit or so.

This is where checksumming is supposed to come in. I think Sage has been
leading that initiative. Basically, when an OSD reads an object it should
be able to tell if there was bit rot by hashing what it just read and
checking the MD5SUM that it did when it first received the object. If it
doesn't match it can ask another OSD until it finds one that matches.

This provides a number of benefits:

   1. Protect against bit rot. Checked on read and on deep scrub.
   2. Automatically recover the correct version of the object.
   3. If the client computes the MD5SUM before it sent over the wire, the
   data can be guaranteed through the memory of several
   machines/devices/cables/etc.
   4. Getting by with size 2 is less risky for those who really want to
   do that.

With all these benefits, there is a trade-off associated with it, mostly
CPU. However with the inclusion of AES in silicon, it may not be a huge
issue now. But, I'm not a programmer and familiar with the aspect of the
Ceph code to be authoritative in any way.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-08 Thread Christian Balzer
On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:

 On Wed, Jan 7, 2015 at 10:55 PM, Christian Balzer ch...@gol.com wrote:
  Which of course begs the question of why not having min_size at 1
  permanently, so that in the (hopefully rare) case of loosing 2 OSDs at
  the same time your cluster still keeps working (as it should with a
  size of 3).
 
 The idea is that when a write happens at least min_size has it
 committed on disk before the write is committed back to the client.
 Just in case something happens to the disk before it can be
 replicated. It also goes against the strongly consistent model of
 Ceph.
 
Which of course currently means a strongly consistent lockup in these
scenarios. ^o^

Slightly off-topic and snarky, that strong consistency is of course of
limited use when in the case of a corrupted PG Ceph basically asks you to
toss a coin.
As in minor corruption, impossible for a mere human to tell which
replica is the good one, because one OSD is down and the 2 remaining ones
differ by one bit or so.

 I believe there is work to resolve the issue when the number of
 replicas drops below min_number. Ceph should automatically start
 backfilling to get to at least min_num so that I/O can continue. I
 believe this work is also tied to prioritizing backfilling so that
 things like this are backfilled first, then backfilling min_num to get
 back to size.
 
Yeah, I suppose that is what Greg referred to. 
Hopefully soon and backported if possible.

 I am interested in a not-so-strict eventual consistency option in Ceph
 so that under normal circumstances instead of needing [size] writes to
 OSDs to complete, only [min_num] is needed and the primary OSD then
 ensures that the laggy OSD(s) eventually gets the write committed.
 
This is exactly where I was coming from/getting at. 

And basically what artificially setting min size to 1 in a replica 3
cluster should get you, unless I'm missing something.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-08 Thread Gregory Farnum
On Wed, Jan 7, 2015 at 9:55 PM, Christian Balzer ch...@gol.com wrote:
 On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote:

 On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote:

  However, I suspect that temporarily setting min size to a lower number
  could be enough for the PGs to recover.  If ceph osd pool pool set
  min_size 1 doesn't get the PGs going, I suppose restarting at least
  one of the OSDs involved in the recovery, so that they PG undergoes
  peering again, would get you going again.
 

 It depends on how incomplete your incomplete PGs are.

 min_size is defined as Sets the minimum number of replicas required for
 I/O..  By default, size is 3 and min_size is 2 on recent versions of
 ceph.

 If the number of replicas you have drops below min_size, then Ceph will
 mark the PG as incomplete.  As long as you have one copy of the PG, you
 can recover by lowering the min_size to the number of copies you do
 have, then restoring the original value after recovery is complete.  I
 did this last week when I deleted the wrong PGs as part of a toofull
 experiment.

 Which of course begs the question of why not having min_size at 1
 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
 same time your cluster still keeps working (as it should with a size of 3).

You no longer have write durability if you only have one copy of a PG.

Sam is fixing things up so that recovery will work properly as long as
you have a whole copy of the PG, which should make things behave as
people expect.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-08 Thread Robert LeBlanc
On Wed, Jan 7, 2015 at 10:55 PM, Christian Balzer ch...@gol.com wrote:
 Which of course begs the question of why not having min_size at 1
 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
 same time your cluster still keeps working (as it should with a size of 3).

The idea is that when a write happens at least min_size has it
committed on disk before the write is committed back to the client.
Just in case something happens to the disk before it can be
replicated. It also goes against the strongly consistent model of
Ceph.

I believe there is work to resolve the issue when the number of
replicas drops below min_number. Ceph should automatically start
backfilling to get to at least min_num so that I/O can continue. I
believe this work is also tied to prioritizing backfilling so that
things like this are backfilled first, then backfilling min_num to get
back to size.

I am interested in a not-so-strict eventual consistency option in Ceph
so that under normal circumstances instead of needing [size] writes to
OSDs to complete, only [min_num] is needed and the primary OSD then
ensures that the laggy OSD(s) eventually gets the write committed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-07 Thread Craig Lewis
On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote:

 However, I suspect that temporarily setting min size to a lower number
 could be enough for the PGs to recover.  If ceph osd pool pool set
 min_size 1 doesn't get the PGs going, I suppose restarting at least one
 of the OSDs involved in the recovery, so that they PG undergoes peering
 again, would get you going again.


It depends on how incomplete your incomplete PGs are.

min_size is defined as Sets the minimum number of replicas required for
I/O..  By default, size is 3 and min_size is 2 on recent versions of ceph.

If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete.  As long as you have one copy of the PG, you can
recover by lowering the min_size to the number of copies you do have, then
restoring the original value after recovery is complete.  I did this last
week when I deleted the wrong PGs as part of a toofull experiment.

If the number of replicas drops to 0, I think you can use ceph pg
force_create_pg, but I haven't tested it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-07 Thread Christian Balzer
On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote:

 On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote:
 
  However, I suspect that temporarily setting min size to a lower number
  could be enough for the PGs to recover.  If ceph osd pool pool set
  min_size 1 doesn't get the PGs going, I suppose restarting at least
  one of the OSDs involved in the recovery, so that they PG undergoes
  peering again, would get you going again.
 
 
 It depends on how incomplete your incomplete PGs are.
 
 min_size is defined as Sets the minimum number of replicas required for
 I/O..  By default, size is 3 and min_size is 2 on recent versions of
 ceph.
 
 If the number of replicas you have drops below min_size, then Ceph will
 mark the PG as incomplete.  As long as you have one copy of the PG, you
 can recover by lowering the min_size to the number of copies you do
 have, then restoring the original value after recovery is complete.  I
 did this last week when I deleted the wrong PGs as part of a toofull
 experiment.
 
Which of course begs the question of why not having min_size at 1
permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
same time your cluster still keeps working (as it should with a size of 3).

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Am 29.12.2014 15:59, schrieb Nico Schottelius:
 Hey Christian,
 
 Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
 [incomplete PG / RBD hanging, osd lost also not helping]
 
 that is very interesting to hear, because we had a similar situation
 with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
 directories to allow OSDs to start after the disk filled up completly.
 
 So I am sorry not to being able to give you a good hint, but I am very
 interested in seeing your problem solved, as it is a show stopper for
 us, too. (*)
 
 Cheers,
 
 Nico
 
 (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
 seems to run much smoother. The first one is however not supported
 by opennebula directly, the second one not flexible enough to host
 our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
 are using ceph at the moment.
 


-- 
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Eneko Lacunza

Hi Christian,

Have you tried to migrate the disk from the old storage (pool) to the 
new one?


I think it should show the same problem, but I think it'd be a much 
easier path to recover than the posix copy.


How full is your storage?

Maybe you can customize the crushmap, so that some OSDs are left in the 
bad (default) pool, and other OSDs and set for the new pool. It think 
(I'm yet learning ceph) that this will make different pgs for each pool, 
also different OSDs, may be this way you can overcome the issue.


Cheers
Eneko

On 30/12/14 12:17, Christian Eichelmann wrote:

Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Am 29.12.2014 15:59, schrieb Nico Schottelius:

Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:

[incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
 seems to run much smoother. The first one is however not supported
 by opennebula directly, the second one not flexible enough to host
 our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
 are using ceph at the moment.






--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
Hi Eneko,

I was trying a rbd cp before, but that was haning as well. But I
couldn't find out if the source image was causing the hang or the
destination image. That's why I decided to try a posix copy.

Our cluster is sill nearly empty (12TB / 867TB). But as far as I
understood (If not, somebody please correct me) placement groups are in
genereally not shared between pools at all.

Regards,
Christian

Am 30.12.2014 12:23, schrieb Eneko Lacunza:
 Hi Christian,
 
 Have you tried to migrate the disk from the old storage (pool) to the
 new one?
 
 I think it should show the same problem, but I think it'd be a much
 easier path to recover than the posix copy.
 
 How full is your storage?
 
 Maybe you can customize the crushmap, so that some OSDs are left in the
 bad (default) pool, and other OSDs and set for the new pool. It think
 (I'm yet learning ceph) that this will make different pgs for each pool,
 also different OSDs, may be this way you can overcome the issue.
 
 Cheers
 Eneko
 
 On 30/12/14 12:17, Christian Eichelmann wrote:
 Hi Nico and all others who answered,

 After some more trying to somehow get the pgs in a working state (I've
 tried force_create_pg, which was putting then in creating state. But
 that was obviously not true, since after rebooting one of the containing
 osd's it went back to incomplete), I decided to save what can be saved.

 I've created a new pool, created a new image there, mapped the old image
 from the old pool and the new image from the new pool to a machine, to
 copy data on posix level.

 Unfortunately, formatting the image from the new pool hangs after some
 time. So it seems that the new pool is suffering from the same problem
 as the old pool. Which is totaly not understandable for me.

 Right now, it seems like Ceph is giving me no options to either save
 some of the still intact rbd volumes, or to create a new pool along the
 old one to at least enable our clients to send data to ceph again.

 To tell the truth, I guess that will result in the end of our ceph
 project (running for already 9 Monthes).

 Regards,
 Christian

 Am 29.12.2014 15:59, schrieb Nico Schottelius:
 Hey Christian,

 Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
 [incomplete PG / RBD hanging, osd lost also not helping]
 that is very interesting to hear, because we had a similar situation
 with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
 directories to allow OSDs to start after the disk filled up completly.

 So I am sorry not to being able to give you a good hint, but I am very
 interested in seeing your problem solved, as it is a show stopper for
 us, too. (*)

 Cheers,

 Nico

 (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
  seems to run much smoother. The first one is however not supported
  by opennebula directly, the second one not flexible enough to host
  our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
  are using ceph at the moment.


 
 


-- 
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Eneko Lacunza

Hi Christian,

New pool's pgs also show as incomplete?

Did you notice something remarkable in ceph logs in the new pools image 
format?


On 30/12/14 12:31, Christian Eichelmann wrote:

Hi Eneko,

I was trying a rbd cp before, but that was haning as well. But I
couldn't find out if the source image was causing the hang or the
destination image. That's why I decided to try a posix copy.

Our cluster is sill nearly empty (12TB / 867TB). But as far as I
understood (If not, somebody please correct me) placement groups are in
genereally not shared between pools at all.

Regards,
Christian

Am 30.12.2014 12:23, schrieb Eneko Lacunza:

Hi Christian,

Have you tried to migrate the disk from the old storage (pool) to the
new one?

I think it should show the same problem, but I think it'd be a much
easier path to recover than the posix copy.

How full is your storage?

Maybe you can customize the crushmap, so that some OSDs are left in the
bad (default) pool, and other OSDs and set for the new pool. It think
(I'm yet learning ceph) that this will make different pgs for each pool,
also different OSDs, may be this way you can overcome the issue.

Cheers
Eneko

On 30/12/14 12:17, Christian Eichelmann wrote:

Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Am 29.12.2014 15:59, schrieb Nico Schottelius:

Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:

[incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
  seems to run much smoother. The first one is however not supported
  by opennebula directly, the second one not flexible enough to host
  our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
  are using ceph at the moment.








--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-30 Thread Christian Eichelmann
Hi Eneko,

nope, new pool has all pgs active+clean, not errors during image
creation. The format command just hangs, without error.



Am 30.12.2014 12:33, schrieb Eneko Lacunza:
 Hi Christian,
 
 New pool's pgs also show as incomplete?
 
 Did you notice something remarkable in ceph logs in the new pools image
 format?
 
 On 30/12/14 12:31, Christian Eichelmann wrote:
 Hi Eneko,

 I was trying a rbd cp before, but that was haning as well. But I
 couldn't find out if the source image was causing the hang or the
 destination image. That's why I decided to try a posix copy.

 Our cluster is sill nearly empty (12TB / 867TB). But as far as I
 understood (If not, somebody please correct me) placement groups are in
 genereally not shared between pools at all.

 Regards,
 Christian

 Am 30.12.2014 12:23, schrieb Eneko Lacunza:
 Hi Christian,

 Have you tried to migrate the disk from the old storage (pool) to the
 new one?

 I think it should show the same problem, but I think it'd be a much
 easier path to recover than the posix copy.

 How full is your storage?

 Maybe you can customize the crushmap, so that some OSDs are left in the
 bad (default) pool, and other OSDs and set for the new pool. It think
 (I'm yet learning ceph) that this will make different pgs for each pool,
 also different OSDs, may be this way you can overcome the issue.

 Cheers
 Eneko

 On 30/12/14 12:17, Christian Eichelmann wrote:
 Hi Nico and all others who answered,

 After some more trying to somehow get the pgs in a working state (I've
 tried force_create_pg, which was putting then in creating state. But
 that was obviously not true, since after rebooting one of the
 containing
 osd's it went back to incomplete), I decided to save what can be saved.

 I've created a new pool, created a new image there, mapped the old
 image
 from the old pool and the new image from the new pool to a machine, to
 copy data on posix level.

 Unfortunately, formatting the image from the new pool hangs after some
 time. So it seems that the new pool is suffering from the same problem
 as the old pool. Which is totaly not understandable for me.

 Right now, it seems like Ceph is giving me no options to either save
 some of the still intact rbd volumes, or to create a new pool along the
 old one to at least enable our clients to send data to ceph again.

 To tell the truth, I guess that will result in the end of our ceph
 project (running for already 9 Monthes).

 Regards,
 Christian

 Am 29.12.2014 15:59, schrieb Nico Schottelius:
 Hey Christian,

 Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
 [incomplete PG / RBD hanging, osd lost also not helping]
 that is very interesting to hear, because we had a similar situation
 with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
 directories to allow OSDs to start after the disk filled up completly.

 So I am sorry not to being able to give you a good hint, but I am very
 interested in seeing your problem solved, as it is a show stopper for
 us, too. (*)

 Cheers,

 Nico

 (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
   seems to run much smoother. The first one is however not
 supported
   by opennebula directly, the second one not flexible enough to
 host
   our heterogeneous infrastructure (mixed disk sizes/amounts) -
 so we
   are using ceph at the moment.



 
 


-- 
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-29 Thread Nico Schottelius
Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
 [incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
seems to run much smoother. The first one is however not supported
by opennebula directly, the second one not flexible enough to host
our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
are using ceph at the moment.

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-29 Thread Chad William Seys
Hi Christian,
   I had a similar problem about a month ago.
   After trying lots of helpful suggestions, I found none of it worked and
I could only delete the affected pools and start over.

  I opened a feature request in the tracker:
http://tracker.ceph.com/issues/10098

  If you find a way, let us know!

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-29 Thread Alexandre Oliva
On Dec 29, 2014, Christian Eichelmann christian.eichelm...@1und1.de wrote:

 After we got everything up and running again, we still have 3 PGs in the
 state incomplete. I was checking one of them directly on the systems
 (replication factor is 3).

I have run into this myself at least twice before.  I had not lost or
replaced the OSDs altogether, though; I had just rolled too many of them
back to an earlier snapshots, which required them to be backfilled to
catch up.  It looks like an OSD won't get out of incomplete state, even
to backfill others, if this would keep the PG active size under the min
size for the pool.

In my case, I brought the current-ish snapshot of the OSD back up to
enable backfilling of enough replicas, so that I could then roll the
remaining OSDs back again and have them backfilled too.

However, I suspect that temporarily setting min size to a lower number
could be enough for the PGs to recover.  If ceph osd pool pool set
min_size 1 doesn't get the PGs going, I suppose restarting at least one
of the OSDs involved in the recovery, so that they PG undergoes peering
again, would get you going again.

Once backfilling completes for all formerly-incomplete PGs, or maybe
even as soon as backfilling begins, bringing the pool min_size back up
to (presumably) 2 is advisable.  You don't want to be running too long
with a too-low min size :-)

I hope this helps,

Happy GNU Year,

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com