Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Thu, 8 Jan 2015 21:17:12 -0700 Robert LeBlanc wrote: On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote: On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote: Which of course currently means a strongly consistent lockup in these scenarios. ^o^ That is one way of putting it If I had the time and more importantly the talent to help with code, I'd do so. Failing that, pointing out the often painful truth is something I can do. Slightly off-topic and snarky, that strong consistency is of course of limited use when in the case of a corrupted PG Ceph basically asks you to toss a coin. As in minor corruption, impossible for a mere human to tell which replica is the good one, because one OSD is down and the 2 remaining ones differ by one bit or so. This is where checksumming is supposed to come in. I think Sage has been leading that initiative. Yeah, I'm aware of that effort. Of course in the meantime even a very simple majority vote would be most welcome and helpful in nearly all cases (with 3 replicas available). One wonders if this is basically acknowledging that while offloading some things like checksums to the underlying layer/FS are desirable from a codebase/effort/complexity view, neither BTRFS or ZFS are fully production ready and won't be for some time. Basically, when an OSD reads an object it should be able to tell if there was bit rot by hashing what it just read and checking the MD5SUM that it did when it first received the object. If it doesn't match it can ask another OSD until it finds one that matches. This provides a number of benefits: 1. Protect against bit rot. Checked on read and on deep scrub. 2. Automatically recover the correct version of the object. 3. If the client computes the MD5SUM before it sent over the wire, the data can be guaranteed through the memory of several machines/devices/cables/etc. 4. Getting by with size 2 is less risky for those who really want to do that. With all these benefits, there is a trade-off associated with it, mostly CPU. However with the inclusion of AES in silicon, it may not be a huge issue now. But, I'm not a programmer and familiar with the aspect of the Ceph code to be authoritative in any way. Yup, all very useful and pertinent points. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote: On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote: Which of course currently means a strongly consistent lockup in these scenarios. ^o^ That is one way of putting it Slightly off-topic and snarky, that strong consistency is of course of limited use when in the case of a corrupted PG Ceph basically asks you to toss a coin. As in minor corruption, impossible for a mere human to tell which replica is the good one, because one OSD is down and the 2 remaining ones differ by one bit or so. This is where checksumming is supposed to come in. I think Sage has been leading that initiative. Basically, when an OSD reads an object it should be able to tell if there was bit rot by hashing what it just read and checking the MD5SUM that it did when it first received the object. If it doesn't match it can ask another OSD until it finds one that matches. This provides a number of benefits: 1. Protect against bit rot. Checked on read and on deep scrub. 2. Automatically recover the correct version of the object. 3. If the client computes the MD5SUM before it sent over the wire, the data can be guaranteed through the memory of several machines/devices/cables/etc. 4. Getting by with size 2 is less risky for those who really want to do that. With all these benefits, there is a trade-off associated with it, mostly CPU. However with the inclusion of AES in silicon, it may not be a huge issue now. But, I'm not a programmer and familiar with the aspect of the Ceph code to be authoritative in any way. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote: On Wed, Jan 7, 2015 at 10:55 PM, Christian Balzer ch...@gol.com wrote: Which of course begs the question of why not having min_size at 1 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the same time your cluster still keeps working (as it should with a size of 3). The idea is that when a write happens at least min_size has it committed on disk before the write is committed back to the client. Just in case something happens to the disk before it can be replicated. It also goes against the strongly consistent model of Ceph. Which of course currently means a strongly consistent lockup in these scenarios. ^o^ Slightly off-topic and snarky, that strong consistency is of course of limited use when in the case of a corrupted PG Ceph basically asks you to toss a coin. As in minor corruption, impossible for a mere human to tell which replica is the good one, because one OSD is down and the 2 remaining ones differ by one bit or so. I believe there is work to resolve the issue when the number of replicas drops below min_number. Ceph should automatically start backfilling to get to at least min_num so that I/O can continue. I believe this work is also tied to prioritizing backfilling so that things like this are backfilled first, then backfilling min_num to get back to size. Yeah, I suppose that is what Greg referred to. Hopefully soon and backported if possible. I am interested in a not-so-strict eventual consistency option in Ceph so that under normal circumstances instead of needing [size] writes to OSDs to complete, only [min_num] is needed and the primary OSD then ensures that the laggy OSD(s) eventually gets the write committed. This is exactly where I was coming from/getting at. And basically what artificially setting min size to 1 in a replica 3 cluster should get you, unless I'm missing something. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Wed, Jan 7, 2015 at 9:55 PM, Christian Balzer ch...@gol.com wrote: On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote: On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote: However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. It depends on how incomplete your incomplete PGs are. min_size is defined as Sets the minimum number of replicas required for I/O.. By default, size is 3 and min_size is 2 on recent versions of ceph. If the number of replicas you have drops below min_size, then Ceph will mark the PG as incomplete. As long as you have one copy of the PG, you can recover by lowering the min_size to the number of copies you do have, then restoring the original value after recovery is complete. I did this last week when I deleted the wrong PGs as part of a toofull experiment. Which of course begs the question of why not having min_size at 1 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the same time your cluster still keeps working (as it should with a size of 3). You no longer have write durability if you only have one copy of a PG. Sam is fixing things up so that recovery will work properly as long as you have a whole copy of the PG, which should make things behave as people expect. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Wed, Jan 7, 2015 at 10:55 PM, Christian Balzer ch...@gol.com wrote: Which of course begs the question of why not having min_size at 1 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the same time your cluster still keeps working (as it should with a size of 3). The idea is that when a write happens at least min_size has it committed on disk before the write is committed back to the client. Just in case something happens to the disk before it can be replicated. It also goes against the strongly consistent model of Ceph. I believe there is work to resolve the issue when the number of replicas drops below min_number. Ceph should automatically start backfilling to get to at least min_num so that I/O can continue. I believe this work is also tied to prioritizing backfilling so that things like this are backfilled first, then backfilling min_num to get back to size. I am interested in a not-so-strict eventual consistency option in Ceph so that under normal circumstances instead of needing [size] writes to OSDs to complete, only [min_num] is needed and the primary OSD then ensures that the laggy OSD(s) eventually gets the write committed. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote: However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. It depends on how incomplete your incomplete PGs are. min_size is defined as Sets the minimum number of replicas required for I/O.. By default, size is 3 and min_size is 2 on recent versions of ceph. If the number of replicas you have drops below min_size, then Ceph will mark the PG as incomplete. As long as you have one copy of the PG, you can recover by lowering the min_size to the number of copies you do have, then restoring the original value after recovery is complete. I did this last week when I deleted the wrong PGs as part of a toofull experiment. If the number of replicas drops to 0, I think you can use ceph pg force_create_pg, but I haven't tested it. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote: On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote: However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. It depends on how incomplete your incomplete PGs are. min_size is defined as Sets the minimum number of replicas required for I/O.. By default, size is 3 and min_size is 2 on recent versions of ceph. If the number of replicas you have drops below min_size, then Ceph will mark the PG as incomplete. As long as you have one copy of the PG, you can recover by lowering the min_size to the number of copies you do have, then restoring the original value after recovery is complete. I did this last week when I deleted the wrong PGs as part of a toofull experiment. Which of course begs the question of why not having min_size at 1 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the same time your cluster still keeps working (as it should with a size of 3). Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Christian, New pool's pgs also show as incomplete? Did you notice something remarkable in ceph logs in the new pools image format? On 30/12/14 12:31, Christian Eichelmann wrote: Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Eneko, nope, new pool has all pgs active+clean, not errors during image creation. The format command just hangs, without error. Am 30.12.2014 12:33, schrieb Eneko Lacunza: Hi Christian, New pool's pgs also show as incomplete? Did you notice something remarkable in ceph logs in the new pools image format? On 30/12/14 12:31, Christian Eichelmann wrote: Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Christian Eichelmann Systemadministrator 11 Internet AG - IT Operations Mail Media Advertising Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelm...@1und1.de Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Christian, I had a similar problem about a month ago. After trying lots of helpful suggestions, I found none of it worked and I could only delete the affected pools and start over. I opened a feature request in the tracker: http://tracker.ceph.com/issues/10098 If you find a way, let us know! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Dec 29, 2014, Christian Eichelmann christian.eichelm...@1und1.de wrote: After we got everything up and running again, we still have 3 PGs in the state incomplete. I was checking one of them directly on the systems (replication factor is 3). I have run into this myself at least twice before. I had not lost or replaced the OSDs altogether, though; I had just rolled too many of them back to an earlier snapshots, which required them to be backfilled to catch up. It looks like an OSD won't get out of incomplete state, even to backfill others, if this would keep the PG active size under the min size for the pool. In my case, I brought the current-ish snapshot of the OSD back up to enable backfilling of enough replicas, so that I could then roll the remaining OSDs back again and have them backfilled too. However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. Once backfilling completes for all formerly-incomplete PGs, or maybe even as soon as backfilling begins, bringing the pool min_size back up to (presumably) 2 is advisable. You don't want to be running too long with a too-low min size :-) I hope this helps, Happy GNU Year, -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com