[ceph-users] Restarting OSD leads to lower CPU usage
Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption]
It is necessary to synchronize time 2015-06-11 11:09 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.com: i'm trying to add a extra monitor to my already existing cluster i do this with the ceph-deploy with the following command ceph-deploy mon add mynewhost the ceph-deploy says its all finished but when i take a look at my new monitor host in the logs i see the following error cephx: verify_reply couldn't decrypt with error: error decoding block for decryption and when i take a look in my existing monitor logs i see this error cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 i tried gatherking key's copy keys reinstall/purge the new monitor node greetz Ramon For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption]
@Irek Fasikhov as i said in my previous post all my servers clocks are in sync i double checked it several times just to be sure i hope you have any other clues -Original Message- From: Irek Fasikhov malm...@gmail.commailto:irek%20fasikhov%20%3cmalm...@gmail.com%3e To: Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:%22Makkelie,%20r%20%28itcdcc%29%20-%20klm%22%20%3cramon.makke...@klm.com%3e Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:%22ceph-us...@lists.ceph.com%22%20%3cceph-us...@lists.ceph.com%3e Subject: Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption] Date: Thu, 11 Jun 2015 12:38:10 +0300 Hands follow command: ntpdate NTPADDRESS 2015-06-11 12:36 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:ramon.makke...@klm.com: all ceph releated servers have the same NTP server and double checked the time and timezones the are all correct -Original Message- From: Irek Fasikhov malm...@gmail.commailto:irek%20fasikhov%20%3cmalm...@gmail.com%3e To: Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:%22Makkelie,%20r%20%28itcdcc%29%20-%20klm%22%20%3cramon.makke...@klm.com%3e Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:%22ceph-us...@lists.ceph.com%22%20%3cceph-us...@lists.ceph.com%3e Subject: Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption] Date: Thu, 11 Jun 2015 12:16:53 +0300 It is necessary to synchronize time 2015-06-11 11:09 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:ramon.makke...@klm.com: i'm trying to add a extra monitor to my already existing cluster i do this with the ceph-deploy with the following command ceph-deploy mon add mynewhost the ceph-deploy says its all finished but when i take a look at my new monitor host in the logs i see the following error cephx: verify_reply couldn't decrypt with error: error decoding block for decryption and when i take a look in my existing monitor logs i see this error cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 i tried gatherking key's copy keys reinstall/purge the new monitor node greetz Ramon For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any
Re: [ceph-users] Restarting OSD leads to lower CPU usage
Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption]
all ceph releated servers have the same NTP server and double checked the time and timezones the are all correct -Original Message- From: Irek Fasikhov malm...@gmail.commailto:irek%20fasikhov%20%3cmalm...@gmail.com%3e To: Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:%22Makkelie,%20r%20%28itcdcc%29%20-%20klm%22%20%3cramon.makke...@klm.com%3e Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:%22ceph-us...@lists.ceph.com%22%20%3cceph-us...@lists.ceph.com%3e Subject: Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption] Date: Thu, 11 Jun 2015 12:16:53 +0300 It is necessary to synchronize time 2015-06-11 11:09 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.commailto:ramon.makke...@klm.com: i'm trying to add a extra monitor to my already existing cluster i do this with the ceph-deploy with the following command ceph-deploy mon add mynewhost the ceph-deploy says its all finished but when i take a look at my new monitor host in the logs i see the following error cephx: verify_reply couldn't decrypt with error: error decoding block for decryption and when i take a look in my existing monitor logs i see this error cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 i tried gatherking key's copy keys reinstall/purge the new monitor node greetz Ramon For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption]
Hands follow command: ntpdate NTPADDRESS 2015-06-11 12:36 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.com: all ceph releated servers have the same NTP server and double checked the time and timezones the are all correct -Original Message- *From*: Irek Fasikhov malm...@gmail.com irek%20fasikhov%20%3cmalm...@gmail.com%3e *To*: Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.com %22Makkelie,%20r%20%28itcdcc%29%20-%20klm%22%20%3cramon.makke...@klm.com%3e *Cc*: ceph-users@lists.ceph.com ceph-users@lists.ceph.com %22ceph-us...@lists.ceph.com%22%20%3cceph-us...@lists.ceph.com%3e *Subject*: Re: [ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption] *Date*: Thu, 11 Jun 2015 12:16:53 +0300 It is necessary to synchronize time 2015-06-11 11:09 GMT+03:00 Makkelie, R (ITCDCC) - KLM ramon.makke...@klm.com: i'm trying to add a extra monitor to my already existing cluster i do this with the ceph-deploy with the following command ceph-deploy mon add mynewhost the ceph-deploy says its all finished but when i take a look at my new monitor host in the logs i see the following error cephx: verify_reply couldn't decrypt with error: error decoding block for decryption and when i take a look in my existing monitor logs i see this error cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 i tried gatherking key's copy keys reinstall/purge the new monitor node greetz Ramon For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
On 6/11/15 12:21, Jan Schermer wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? What about disk IO? Are OSDs scrubbing or deep-scrubbing? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
On 11 Jun 2015, at 11:53, Henrik Korkuc li...@kirneh.eu wrote: On 6/11/15 12:21, Jan Schermer wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? What about disk IO? Are OSDs scrubbing or deep-scrubbing? Nope, the OSDs are not scrubbing or deep-scrubbing, and I see the same amount of ops/sec on the OSD as before the restart. The things that are not yet at the previous (before restart) level are a) threads: 2200 before restart, 2050 now, slowly going up b) open files - changed with fdcache, ~17500 before restart, 31000 now c) memory usage: rss 1.7GiB x 1.1 now, vss 4.7 x 3.5 now The amount of work is still the same. Jan Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ilya Dryomov Sent: 11 June 2015 12:33 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. That's great, do you think it will make 4.2? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph giant installation fails on rhel 7.0
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Have you configured and enabled the epel repo? - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jun 11, 2015 at 6:26 AM, Shambhu Rajak wrote: I am trying to install ceph gaint on rhel 7.0, while installing ceph-common-0.87.2-0.el7.x86_64.rpm I am getting following dependency packages]$ sudo yum install ceph-common-0.87.2-0.el7.x86_64.rpm Loaded plugins: amazon-id, priorities, rhui-lb Examining ceph-common-0.87.2-0.el7.x86_64.rpm: 1:ceph-common-0.87.2-0.el7.x86_64 Marking ceph-common-0.87.2-0.el7.x86_64.rpm to be installed Resolving Dependencies -- Running transaction check --- Package ceph-common.x86_64 1:0.87.2-0.el7 will be installed -- Processing Dependency: libtcmalloc.so.4()(64bit) for package: 1:ceph-common-0.87.2-0.el7.x86_64 -- Finished Dependency Resolution Error: Package: 1:ceph-common-0.87.2-0.el7.x86_64 (/ceph-common-0.87.2-0.el7.x86_64) Requires: libtcmalloc.so.4()(64bit) You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles –nodigest So I am trying install gperftools-libs to resolve the dependencies, but I am unable to get the package using yum install Can any one help me getting the complete list of dependencies to install ceph giant on rhel 7.0. Thanks, Shambhu PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVeZv0CRDmVDuy+mK58QAAODEQALfE8XG5TTOSJGBl20O8 Ol47KjhHzAYtSTod3jFDJRqzpKDv61YKX4XAqOT3NrGiWP33hzWZiBQoYjlZ LqsjmZyvOZfQVkS2v78C0oBmIqyFhND7Xp/U57CBGhhqMk/LS9P36eaYdywZ MkYSqy+Jwk9Cg++VA4spbG+i6eu+Gp3vPJnJbzJ/3pLBK8K1wKz97qPuyztk LSRIDejrdUPx355m2pAkzYVdVMaw41+FFiz4QJPCTfZx/Ya6oCboyBLFcNA0 vka2U3NmWHQDiT0MCOylTvDk4cqjX9bHFrqa0juNw0vxBwICoVa4JieLNlyE CLPHUrqq3kfKHO18zapvc204uJuec/Ufm9bPW8cnXMg3m4izzUOQ5ZKDk5aL K2WfTB0fozy099AsPnDOy4DYFkv2zIRTZngD4Gs0aIAQd7XIHv1SohEDve50 D6trpQW8BAzuQVaGh7pY9X1rWnRLeAdaXIW+X5aZlrhCZQDP4tTPvyL0VuGi ncp1H5dnuZlfHXCSrMZUQoETgXAjhHl+ww+6RCccYzg2wEcT0TV7fXFKUkLU jJlPMFOW0dqbt4NkxulBQrXKDPB/VsQFYegCcNPSyvOPsYs5iysy5u21tBz9 jfZjUA8qxaAGpQb2EnCGhsK3mWLV5ud3RqCQvBz7xiVeAXdQZEfcdKtlDjFi Cpmr =Ytyr -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Thu, Jun 11, 2015 at 5:30 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ilya Dryomov Sent: 11 June 2015 12:33 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. That's great, do you think it will make 4.2? Depends on Jens, but I think it will. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Load balancing RGW and Scaleout
Hum thanks David I will check corosync And maybe Consul can be a solution ? Sent from my iPhone On 11 juin 2015, at 11:33, David Moreau Simard dmsim...@iweb.com wrote: What I've seen work well is to set multiple A records for your RGW endpoint. Then, with something like corosync, you ensure that these multiple IP addresses are always bound somewhere. You can then have as many nodes in active-active mode as you want. -- David Moreau Simard On 2015-06-11 11:29 AM, Florent MONTHEL wrote: Hi Team Is it possible for you to share your setup on radosgw in order to use maximum of network bandwidth and to have no SPOF I have 5 servers on 10gb network and 3 radosgw on it We would like to setup Haproxy on 1 node with 3 rgw but : - SPOF become Haproxy node - Max bandwidth will be on HAproxy node (10gb/s) Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v9.0.1 released
This development release is delayed a bit due to tooling changes in the build environment. As a result the next one (v9.0.2) will have a bit more work than is usual. Highlights here include lots of RGW Swift fixes, RBD feature work surrounding the new object map feature, more CephFS snapshot fixes, and a few important CRUSH fixes. Notable Changes --- * auth: cache/reuse crypto lib key objects, optimize msg signature check (Sage Weil) * build: allow tcmalloc-minimal (Thorsten Behrens) * build: do not build ceph-dencoder with tcmalloc (#10691 Boris Ranto) * build: fix pg ref disabling (William A. Kennington III) * build: install-deps.sh improvements (Loic Dachary) * build: misc fixes (Boris Ranto, Ken Dreyer, Owen Synge) * ceph-authtool: fix return code on error (Gerhard Muntingh) * ceph-disk: fix zap sgdisk invocation (Owen Synge, Thorsten Behrens) * ceph-disk: pass --cluster arg on prepare subcommand (Kefu Chai) * ceph-fuse, libcephfs: drop inode when rmdir finishes (#11339 Yan, Zheng) * ceph-fuse,libcephfs: fix uninline (#11356 Yan, Zheng) * ceph-monstore-tool: fix store-copy (Huangjun) * common: add perf counter descriptions (Alyona Kiseleva) * common: fix throttle max change (Henry Chang) * crush: fix crash from invalid 'take' argument (#11602 Shiva Rkreddy, Sage Weil) * crush: fix divide-by-2 in straw2 (#11357 Yann Dupont, Sage Weil) * deb: fix rest-bench-dbg and ceph-test-dbg dependendies (Ken Dreyer) * doc: document region hostnames (Robin H. Johnson) * doc: update release schedule docs (Loic Dachary) * init-radosgw: run radosgw as root (#11453 Ken Dreyer) * librados: fadvise flags per op (Jianpeng Ma) * librbd: allow additional metadata to be stored with the image (Haomai Wang) * librbd: better handling for dup flatten requests (#11370 Jason Dillaman) * librbd: cancel in-flight ops on watch error (#11363 Jason Dillaman) * librbd: default new images to format 2 (#11348 Jason Dillaman) * librbd: fast diff implementation that leverages object map (Jason Dillaman) * librbd: fix snapshot creation when other snap is active (#11475 Jason Dillaman) * librbd: new diff_iterate2 API (Jason Dillaman) * librbd: object map rebuild support (Jason Dillaman) * logrotate.d: prefer service over invoke-rc.d (#11330 Win Hierman, Sage Weil) * mds: avoid getting stuck in XLOCKDONE (#11254 Yan, Zheng) * mds: fix integer truncateion on large client ids (Henry Chang) * mds: many snapshot and stray fixes (Yan, Zheng) * mds: persist completed_requests reliably (#11048 John Spray) * mds: separate safe_pos in Journaler (#10368 John Spray) * mds: snapshot rename support (#3645 Yan, Zheng) * mds: warn when clients fail to advance oldest_client_tid (#10657 Yan, Zheng) * misc cleanups and fixes (Danny Al-Gaaf) * mon: fix average utilization calc for 'osd df' (Mykola Golub) * mon: fix variance calc in 'osd df' (Sage Weil) * mon: improve callout to crushtool (Mykola Golub) * mon: prevent bucket deletion when referenced by a crush rule (#11602 Sage Weil) * mon: prime pg_temp when CRUSH map changes (Sage Weil) * monclient: flush_log (John Spray) * msgr: async: many many fixes (Haomai Wang) * msgr: simple: fix clear_pipe (#11381 Haomai Wang) * osd: add latency perf counters for tier operations (Xinze Chi) * osd: avoid multiple hit set insertions (Zhiqiang Wang) * osd: break PG removal into multiple iterations (#10198 Guang Yang) * osd: check scrub state when handling map (Jianpeng Ma) * osd: fix endless repair when object is unrecoverable (Jianpeng Ma, Kefu Chai) * osd: fix pg resurrection (#11429 Samuel Just) * osd: ignore non-existent osds in unfound calc (#10976 Mykola Golub) * osd: increase default max open files (Owen Synge) * osd: prepopulate needs_recovery_map when only one peer has missing (#9558 Guang Yang) * osd: relax reply order on proxy read (#11211 Zhiqiang Wang) * osd: skip promotion for flush/evict op (Zhiqiang Wang) * osd: write journal header on clean shutdown (Xinze Chi) * qa: run-make-check.sh script (Loic Dachary) * rados bench: misc fixes (Dmitry Yatsushkevich) * rados: fix error message on failed pool removal (Wido den Hollander) * radosgw-admin: add 'bucket check' function to repair bucket index (Yehuda Sadeh) * rbd: allow unmapping by spec (Ilya Dryomov) * rbd: deprecate --new-format option (Jason Dillman) * rgw: do not set content-type if length is 0 (#11091 Orit Wasserman) * rgw: don't use end_marker for namespaced object listing (#11437 Yehuda Sadeh) * rgw: fail if parts not specified on multipart upload (#11435 Yehuda Sadeh) * rgw: fix GET on swift account when limit == 0 (#10683 Radoslaw Zarzynski) * rgw: fix broken stats in container listing (#11285 Radoslaw Zarzynski) * rgw: fix bug in domain/subdomain splitting (Robin H. Johnson) * rgw: fix civetweb max threads (#10243 Yehuda Sadeh) * rgw: fix copy metadata, support X-Copied-From for swift (#10663 Radoslaw Zarzynski) * rgw: fix locator for objects starting with _ (#11442 Yehuda Sadeh) *
Re: [ceph-users] NFS interaction with RBD
Hi George Well that’s strange. I wonder why our systems behave so differently. We’ve got: Hypervisors running on Ubuntu 14.04. VMs with 9 ceph volumes: 2TB each. XFS instead of your ext4 Maybe the number of placement groups plays a major role as well. Jens-Christian may be able to give you the specifics of our ceph cluster. I’m about to leave on vacation and don’t have time to look that up anymore. Best regards Christian On 29 May 2015, at 14:42, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: All, I 've tried to recreate the issue without success! My configuration is the following: OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64) QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64 Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB OSDs equally distributed on two disk nodes, 3xMonitors OpenStack Cinder has been configured to provide RBD Volumes from Ceph. I have created 10x 500GB Volumes which were then all attached at a single Virtual Machine. All volumes were formatted two times for comparison reasons, one using mkfs.xfs and one using mkfs.ext4. I did try to issue the commands all at the same time (or as possible to that). In both tests I didn't notice any interruption. It may took longer than just doing one at a time but the system was continuously up and everything was responding without the problem. At the time of these processes the open connections were 100 with one of the OSD node and 111 with the other one. So I guess I am not experiencing the issue due to the low number of OSDs I am having. Is my assumption correct? Best regards, George Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the mkfs.xfs command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of the RBD Volumes cause I have a feeling that mine were two small, and if so I have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, George Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64 QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is
Re: [ceph-users] Discuss: New default recovery config settings
hi,jan 2015-06-01 15:43 GMT+08:00 Jan Schermer j...@schermer.cz: We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sped up nicely (because the chunks are smaller), but during heavy activity we are still getting some “slow IOs”. How many PGs do you set ? we get slow requests many times, but didn't relate it to PG number. And we follow the equation below for every pool: (OSDs * 100) Total PGs = - pool size our cluster has 157 OSDs and 3 POOLs, we set pg_num to 8192 for every pool, but osd cpu utlity percentage is up to 300% after restart, we think it's loading pgs during the period. and we will try different PG number when we get slow request thanks! I believe there is an ionice knob in newer versions (we still run Dumpling), and that should do the trick no matter how much additional “load” is put on the OSDs. Everybody’s bottleneck will be different - we run all flash so disk IO is not a problem but an OSD daemon is - no ionice setting will help with that, it just needs to be faster ;-) Jan On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- thanks huangjun ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hardware cache settings recomendation
Hi, Please help me with hardware cache settings on controllers for ceph rbd best performance. All Ceph hosts have one SSD drive for journal. We are using 4 different controllers, all with BBU: * HP Smart Array P400 * HP Smart Array P410i * Dell PERC 6/i * Dell PERC H700 I have to set cache policy, on Dell settings are: * Read Policy o Read-Ahead (current) o No-Read-Ahead o Adaptive Read-Ahead * Write Policy o Write-Back (current) o Write-Through * Cache Policy o Cache I/O o Direct I/O (current) * Disk Cache Policy o Default (current) o Enabled o Disabled On HP controllers: * Cache Ratio (current: 25% Read / 75% Write) * Drive Write Cache o Enabled (current) o Disabled And there is one more setting in LogicalDrive option: * Caching: o Enabled (current) o Disabled Please verify my settings and give me some recomendations. Best regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph mount error
Hi , My ceph health is OK , And now , I want to build a Filesystem , refer to the CEPH FS QUICK START guide . http://ceph.com/docs/master/start/quick-cephfs/ however , I got a error when i use the command , mount -t ceph 192.168.1.105:6789:/ /mnt/mycephfs . error : mount error 22 = Invalid argument I refer to munual , and now , I don't know how to solve it . I am looking forward to your reply ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] query on ceph-deploy command
Hi, I am trying to deploy ceph-hammer on 4 nodes(admin, monitor and 2 OSD's). MY servers are behind a proxy server, so when I need to run an apt-get update I need to export our proxy server. When I run the command ceph-deploy install osd1 osd2 mon1, since all three nodes are behind the proxy the command fails with the below message: [osd1][DEBUG ] 0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded. [osd1][INFO ] Running command: sudo wget -O release.asc https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc [osd1][WARNIN] --2015-05-20 11:07:41-- https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc [osd1][WARNIN] Resolving ceph.com (ceph.com)... 208.113.241.137, 2607:f298:4:147::b05:fe2a [osd1][WARNIN] Connecting to ceph.com (ceph.com)|208.113.241.137|:443... failed: Connection timed out. [osd1][WARNIN] command returned non-zero exit status: 4 [osd1][INFO ] Running command: sudo apt-key add release.asc [osd1][WARNIN] gpg: no valid OpenPGP data found. [osd1][ERROR ] RuntimeError: command returned non-zero exit status: 2 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: apt-key add release.asc Request your help in solving above. How do I set a proxy for the user so that I am able to connect to ceph.com to download the file. Thanks! -- Best Regards B.Vivek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Error in sys.exitfunc
OS: CentOS release 6.6 (Final) kernel : 3.10.77-1.el6.elrepo.x86_64 Installed: ceph-deploy.noarch 0:1.5.23-0 Dependency Installed:python-argparse.noarch 0:1.2.1-2.el6.centos I install the ceph-deploy refer to the manual , http://ceph.com/docs/master/start/quick-start-preflight/ . However , When I run ceph-deploy , error will appear , Error in sys.exitfunc: . how to solve it ? I find the same error message with me in the web , http://www.spinics.net/lists/ceph-devel/msg21388.html , but I cannot find the way to solve this problem . I am looking forward for your reply ! Best wishes! zhongbo error message: [root@node1 ~]# ceph-deploy usage: ceph-deploy [-h] [-v | -q] [--version] [--username USERNAME] [--overwrite-conf] [--cluster NAME] [--ceph-conf CEPH_CONF] COMMAND ... Easy Ceph deployment -^- / \ |O o| ceph-deploy v1.5.23 ).-.( '/|||\` | '|` | '|` Full documentation can be found at: http://ceph.com/ceph-deploy/docs optional arguments: -h, --helpshow this help message and exit -v, --verbose be more verbose -q, --quiet be less verbose --version the current installed version of ceph-deploy --username USERNAME the username to connect to the remote host --overwrite-conf overwrite an existing conf file on remote host (if present) --cluster NAMEname of the cluster --ceph-conf CEPH_CONF use (or reuse) a given ceph.conf file commands: COMMAND description new Start deploying a new cluster, and write a CLUSTER.conf and keyring for it. install Install Ceph packages on remote hosts. rgw Deploy ceph RGW on remote hosts. mds Deploy ceph MDS on remote hosts. mon Deploy ceph monitor on remote hosts. gatherkeys Gather authentication keys for provisioning new nodes. diskManage disks on a remote host. osd Prepare a data disk on remote host. admin Push configuration and client.admin key to a remote host. config Push configuration file to a remote host. uninstall Remove Ceph packages from remote hosts. purgedata Purge (delete, destroy, discard, shred) any Ceph data from /var/lib/ceph purge Remove Ceph packages from remote hosts and purge all data. forgetkeys Remove authentication keys from the local directory. pkg Manage packages on remote hosts. calamariInstall and configure Calamari nodes Error in sys.exitfunc: ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Qemu-devel] rbd cache + libvirt
On Mon, Jun 08, 2015 at 07:49:15PM +0300, Andrey Korolyov wrote: On Mon, Jun 8, 2015 at 6:50 PM, Jason Dillaman dilla...@redhat.com wrote: Hmm ... looking at the latest version of QEMU, it appears that the RBD cache settings are changed prior to reading the configuration file instead of overriding the value after the configuration file has been read [1]. Try specifying the path to a new configuration file via the conf=/path/to/my/new/ceph.conf QEMU parameter where the RBD cache is explicitly disabled [2]. [1] http://git.qemu.org/?p=qemu.git;a=blob;f=block/rbd.c;h=fbe87e035b12aab2e96093922a83a3545738b68f;hb=HEAD#l478 [2] http://ceph.com/docs/master/rbd/qemu-rbd/#usage Actually the mentioned snippet presumes *expected* behavior with cache=xxx driving the overall cache behavior. Probably the pass itself (from cache=none to proper bitmask values in a backend properties) is broken in some way. CCing qemu-devel for possible bug. CCing Josh Durgin and Jeff Cody for block/rbd.c pgp0qama9SeEj.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error in sys.exitfunc
Thank you for your reply . I encountered other problems when i install ceph . #1. When i run the command , ceph-deploy new ceph-0, and got the ceph.conf file . However , there is not any information aboutosd pool default size or public network . [root@ceph-2 my-cluster]# more ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 192.168.72.33 mon_initial_members = ceph-0 fsid = 74d682b5-2bf2-464c-8462-740f96bcc525 #2. I ignore the problem #1 , and continue to set us the Ceph Storage Cluster , encountered a error , whhen run the command ' ceph-deploy osd activate ceph-2:/mnt/sda ' . I do it refer to the manual , http://ceph.com/docs/master/start/quick-ceph-deploy/ error message [root@ceph-0 my-cluster]#ceph-deploy osd prepare ceph-2:/mnt/sda [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.23): /usr/bin/ceph-deploy osd prepare ceph-2:/mnt/sda [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks ceph-2:/mnt/sda: [ceph-2][DEBUG ] connected to host: ceph-2 [ceph-2][DEBUG ] detect platform information from remote host [ceph-2][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.5 Final [ceph_deploy.osd][DEBUG ] Deploying osd to ceph-2 [ceph-2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [ceph-2][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add [ceph_deploy.osd][DEBUG ] Preparing host ceph-2 disk /mnt/sda journal None activate False [ceph-2][INFO ] Running command: ceph-disk -v prepare --fs-type xfs --cluster ceph -- /mnt/sda [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_type [ceph-2][WARNIN] DEBUG:ceph-disk:Preparing osd data dir /mnt/sda [ceph-2][INFO ] checking OSD status... [ceph-2][INFO ] Running command: ceph --cluster=ceph osd stat --format=json [ceph_deploy.osd][DEBUG ] Host ceph-2 is now ready for osd use. Error in sys.exitfunc: [root@ceph-0 my-cluster]# ceph-deploy osd activate ceph-2:/mnt/sda [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.23): /usr/bin/ceph-deploy osd activate ceph-2:/mnt/sda [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph-2:/mnt/sda: [ceph-2][DEBUG ] connected to host: ceph-2 [ceph-2][DEBUG ] detect platform information from remote host [ceph-2][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.5 Final [ceph_deploy.osd][DEBUG ] activating host ceph-2 disk /mnt/sda [ceph_deploy.osd][DEBUG ] will use init type: sysvinit [ceph-2][INFO ] Running command: ceph-disk -v activate --mark-init sysvinit --mount /mnt/sda [ceph-2][WARNIN] DEBUG:ceph-disk:Cluster uuid is af23707d-325f-4846-bba9-b88ec953be80 [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid [ceph-2][WARNIN] DEBUG:ceph-disk:Cluster name is ceph [ceph-2][WARNIN] DEBUG:ceph-disk:OSD uuid is ca9f6649-b4b8-46ce-a860-1d81eed4fd5e [ceph-2][WARNIN] DEBUG:ceph-disk:Allocating OSD id... [ceph-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ ceph.keyring osd create --concise ca9f6649-b4b8-46ce-a860-1d81eed4fd5e [ceph-2][WARNIN] 2015-05-14 17:37:10.988914 7f373bd34700 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted [ceph-2][WARNIN] Error connecting to cluster: PermissionError [ceph-2][WARNIN] ceph-disk: Error: ceph osd create failed: Command '/usr/bin/ceph' returned non-zero exit status 1: [ceph-2][ERROR ] RuntimeError: command returned non-zero exit status: 1 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /mnt/sda Error in
[ceph-users] Is Ceph right for me?
Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. With Kind Regards, Trevor Robinson CTO at Key4ce [Key4ce - IT Professionals]https://key4ce.com/ Skype: KeyMalus.Trev xmpp: t.robin...@im4ce.com Livechat: http://livechat.key4ce.comhttp://livechat.key4ce.com/ NL: +31 (0)40 290 3310 UK: +44 (0)1332 898 999 CN: +86 (0)7552 824 5985 The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] TR: High apply latency on OSD causes poor performance on VM
Hi, Could you take a look on my problem. It's about high latency on my OSDs on HP G8 servers (ceph01, ceph02 and ceph03). When I run a rados bench for 60 sec, the results are surprising : after a few seconds, there is no traffic, then it's resume, etc. Finally, the maximum latency is high and VM's disks freeze lot. #rados bench -p pool-test-g8 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_ceph02_56745 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 168266 263.959 264 0.0549584 0.171148 2 16 134 118235.97 208 0.344873 0.232103 3 16 189 173 230.639 220 0.015583 0.24581 4 16 248 232 231.973 236 0.0704699 0.252504 5 16 306 290 231.974 232 0.0229872 0.258343 6 16 371 355236.64 260 0.27183 0.255469 7 16 419 403230.26 192 0.0503492 0.263304 8 16 460 444 221.975 164 0.0157241 0.261779 9 16 506 490 217.754 184 0.199418 0.271501 10 16 518 502 200.77848 0.0472324 0.269049 11 16 518 502 182.526 0 - 0.269049 12 16 556 540 179.98176 0.100336 0.301616 13 16 607 591 181.827 204 0.173912 0.346105 14 16 655 639 182.552 192 0.0484904 0.339879 15 16 683 667 177.848 112 0.0504184 0.349929 16 16 746 730 182.481 252 0.276635 0.347231 17 16 807 791 186.098 244 0.391491 0.339275 18 16 845 829 184.203 152 0.188608 0.342021 19 16 850 834 175.56120 0.960175 0.342717 2015-05-28 17:09:48.397376min lat: 0.013532 max lat: 6.28387 avg lat: 0.346987 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 859 843 168.58236 0.0182246 0.346987 21 16 863 847 161.31616 3.18544 0.355051 22 16 897 881 160.165 136 0.0811037 0.371209 23 16 901 885 153.89716 0.0482124 0.370793 24 16 943 927 154.484 168 0.63064 0.397204 25 15 997 982 157.104 220 0.0933448 0.392701 26 16 1058 1042 160.291 240 0.166463 0.385943 27 16 1088 1072 158.798 120 1.63882 0.388568 28 16 1125 1109 158.412 148 0.0511479 0.38419 29 16 1155 1139 157.087 120 0.162266 0.385898 30 16 1163 1147 152.91732 0.0682181 0.383571 31 16 1190 1174 151.468 108 0.0489185 0.386665 32 16 1196 1180 147.48524 2.95263 0.390657 33 16 1213 1197 145.07668 0.0467788 0.389299 34 16 1265 1249 146.926 208 0.0153085 0.420687 35 16 1332 1316 150.384 268 0.0157061 0.42259 36 16 1374 1358 150.873 168 0.251626 0.417373 37 16 1402 1386 149.822 112 0.0475302 0.413886 38 16 1444 1428 150.3 168 0.0507577 0.421055 39 16 1500 1484 152.189 224 0.0489163 0.416872 2015-05-28 17:10:08.399434min lat: 0.013532 max lat: 9.26596 avg lat: 0.415296 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 16 1530 1514 151.384 120 0.951713 0.415296 41 16 1551 1535 149.74184 0.0686787 0.416571 42 16 1606 1590 151.413 220 0.0826855 0.41684 43 16 1656 1640 152.542 200 0.0706539 0.409974 44 16 1663 1647 149.71228 0.046672 0.408476 45 16 1685 1669148.3488 0.0989566 0.424918 46 16 1707 1691 147.02888 0.0490569 0.421116 47 16 1707 1691 143.9 0 - 0.421116 48 16 1707 1691 140.902 0 - 0.421116 49 16 1720 1704 139.088 17. 0.0480335 0.428997 50 16 1752 1736 138.866 128 0.0532190.4416 51 16 1786 1770 138.809 136 0.602946 0.440357 52 16 1810 1794 137.98696 0.0472518 0.438376 53 16 1831 1815 136.96784 0.0148999 0.446801 54 16 1831 1815134.43 0 - 0.446801
Re: [ceph-users] NFS interaction with RBD
Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is your network, CPU. Running something like collectd to gather all ceph perf data and other data from the storage nodes and then feeding it to graphite (or similar) can be VERY helpful to identify if something is going wrong and what it is in particular. Otherwise run atop on your storage nodes to identify if CPU, network, specific HDDs/OSDs are bottlenecks. Deep scrubbing can be _very_ taxing, do your problems persist if inject into your running cluster an osd_scrub_sleep value of 0.5 (lower
Re: [ceph-users] xfs corruption, data disaster!
On 5/11/15 9:47 AM, Ric Wheeler wrote: On 05/05/2015 04:13 AM, Yujian Peng wrote: Emmanuel Florac eflorac@... writes: Le Mon, 4 May 2015 07:00:32 + (UTC) Yujian Peng pengyujian5201314 at 126.com écrivait: I'm encountering a data disaster. I have a ceph cluster with 145 osd. The data center had a power problem yesterday, and all of the ceph nodes were down. But now I find that 6 disks(xfs) in 4 nodes have data corruption. Some disks are unable to mount, and some disks have IO errors in syslog. mount: Structure needs cleaning xfs_log_forece: error 5 returned I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd reported a leveldb error: Error initializing leveldb: Corruption: checksum mismatch I cannot start the 6 osds and 22 pgs is down. This is really a tragedy for me. Can you give me some idea to recovery the xfs? Thanks very much! For XFS problems, ask the XFS ML: xfs at oss.sgi.com You didn't give enough details, by far. What version of kernel and distro are you running? If there were errors, please post extensive logs. If you have IO errors on some disks, you probably MUST replace them before going any further. Why did you run xfs_repair -L ? Did you try xfs_repair without options first? Were you running the very very latest version of xfs_repair (3.2.2) ? The OS is ubuntu 12.04.5 with kernel 3.13.0 uname -a Linux ceph19 3.13.0-32-generic #57~precise1-Ubuntu SMP Tue Jul 15 03:51:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux cat /etc/issue Ubuntu 12.04.5 LTS \n \l xfs_repair -V xfs_repair version 3.1.7 I've tried xfs_repair without options, but it showed me some errors, so I used the -L option. Thanks for your reply! Responding quickly to a couple of things: * xfs_repair -L wipes out the XFS log, not normally a good thing to do And if required due to an unreplayable log, often indicates some problem with the storage system. For example a volatile write cache not synced as needed, and lost along with a power loss, leading to a corrupted and unreplayable XFS log. * replacing disks with IO errors is not a great idea if you still need that data. You might want to copy the data from that disk to a new disk (same or greater size) and then try to repair that new disk. A lot depends on the type of IO error you see - you might have cable issues, HBA issues, or fairly normal read issues (which are not worth replacing a disk for). Just a note that XFS sometimes starts saying IO error when the filesystem has shut down; this isn't the same as a block-device-level IO error, but you haven't posted logs or anything, so I'm just guessing here. http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F -Eric You should work with your vendor's support team if you have a support contract or post the the XFS devel list (copied above) for help. Good luck! Ric ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] umount stuck on NFS gateways switch over by using Pacemaker
Hello, I am testing NFS over RBD recently. I am trying to build the NFS HA environment under Ubuntu 14.04 for testing, and the packages version information as follows: - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS) - ceph : 0.80.9-0ubuntu0.14.04.2 - ceph-common : 0.80.9-0ubuntu0.14.04.2 - pacemaker (git20130802-1ubuntu2.3) - corosync (2.3.3-1ubuntu1) PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations. The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and transferred to 'nfs2', and vice versa. When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, the other node will take over all resources. Everything looks fine. Then I wait about 30 minutes and do nothing to the NFS gateways. I repeated the previous steps to test fail over procedure. I found the process code of 'umount' is 'D' (uninterruptible sleep), the 'ps' showed the following result root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1 Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 minutes for 'umount' time out, the only way I can do is powering off the server directly. Any help would be much appreciated. I attached my configurations and loggings as follows. Pacemaker configurations: crm configure primitive p_rbd_map_1 ocf:ceph:rbd.in \ params user=admin pool=block_data name=data01 cephconf=/etc/ceph/ceph.conf \ op monitor interval=10s timeout=20s crm configure primitive p_fs_rbd_1 ocf:heartbeat:Filesystem \ params directory=/mnt/block1 fstype=xfs device=/dev/rbd1 \ fast_stop=no options=noatime,nodiratime,nobarrier,inode64 \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s crm configure primitive p_export_rbd_1 ocf:heartbeat:exportfs \ params directory=/mnt/block1 clientspec=10.35.64.0/24 options=rw,async,no_subtree_check,no_root_squash fsid=1 \ op monitor interval=10s timeout=20s \ op start interval=0 timeout=40s crm configure primitive p_vip_1 ocf:heartbeat:IPaddr2 \ params ip=10.35.64.90 cidr_netmask=24 \ op monitor interval=5 crm configure primitive p_nfs_server lsb:nfs-kernel-server \ op monitor interval=10s timeout=30s crm configure primitive p_rpcbind upstart:rpcbind \ op monitor interval=10s timeout=30s crm configure group g_rbd_share_1 p_rbd_map_1 p_fs_rbd_1 p_export_rbd_1 p_vip_1 \ meta target-role=Started crm configure group g_nfs p_rpcbind p_nfs_server \ meta target-role=Started crm configure clone clo_nfs g_nfs \ meta globally-unique=false target-role=Started 'crm_mon' status results for normal condition: Online: [ nfs1 nfs2 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1 p_export_rbd_1 (ocf::heartbeat:exportfs): Started nfs1 p_vip_1 (ocf::heartbeat:IPaddr2): Started nfs1 Clone Set: clo_nfs [g_nfs] Started: [ nfs1 nfs2 ] 'crm_mon' status results for fail over condition: Online: [ nfs1 nfs2 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1 (unmanaged) FAILED p_export_rbd_1 (ocf::heartbeat:exportfs): Stopped p_vip_1 (ocf::heartbeat:IPaddr2): Stopped Clone Set: clo_nfs [g_nfs] Started: [ nfs2 ] Stopped: [ nfs1 ] Failed actions: p_fs_rbd_1_stop_0 (node=nfs1, call=114, rc=1, status=Timed Out, last-rc-change=Wed May 13 16:39:10 2015, queued=60002ms, exec=1ms ): unknown error 'demsg' messages: [ 9470.284509] nfsd: last server has exited, flushing export cache [ 9470.322893] init: rpcbind main process (4267) terminated with status 2 [ 9600.520281] INFO: task umount:2675 blocked for more than 120 seconds. [ 9600.520445] Not tainted 3.13.0-32-generic #57-Ubuntu [ 9600.520570] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 9600.520792] umount D 88003fc13480 0 2675 1 0x [ 9600.520800] 88003a4f9dc0 0082 880039ece000 88003a4f9fd8 [ 9600.520805] 00013480 00013480 880039ece000 880039ece000 [ 9600.520809] 88003fc141a0 0001 88003a377928 [ 9600.520814] Call Trace: [ 9600.520830] [817251a9] schedule+0x29/0x70 [ 9600.520882] [a043b300] _xfs_log_force+0x220/0x280 [xfs] [ 9600.520891] [8109a9b0] ? wake_up_state+0x20/0x20 [ 9600.520922] [a043b386] xfs_log_force+0x26/0x80 [xfs] [ 9600.520947] [a03f3b6d] xfs_fs_sync_fs+0x2d/0x50 [xfs] [ 9600.520954]
[ceph-users] radosgw backup
Hi everyone. I'm wondering - is there way to backup radosgw data? What i already tried. create backup pool - copy .rgw.buckets to backup pool. Then i delete object via s3 client. And then i copy data from backup pool to .rgw.buckets. I still can't see object in s3 client, but can get it via http by early known url. Questions: where radosgw stores info about objects - (how to make restored object visible from s3 client)? is there best way for backup data for radosgw? Thanks for any advises. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware cache settings recomendation
You want write cache to disk, no write cache for SSD. I assume all of your data disk are single drive raid 0? Tyler Bishop Chief Executive Officer 513-299-7108 x10 tyler.bis...@beyondhosting.net If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. From: Mateusz Skała mateusz.sk...@budikom.net To: ceph-users@lists.ceph.com Sent: Saturday, June 6, 2015 4:09:59 AM Subject: [ceph-users] Hardware cache settings recomendation Hi, Please help me with hardware cache settings on controllers for ceph rbd best performance. All Ceph hosts have one SSD drive for journal. We are using 4 different controllers, all with BBU: · HP Smart Array P400 · HP Smart Array P410i · Dell PERC 6/i · Dell PERC H700 I have to set cache policy, on Dell settings are: · Read Policy o Read-Ahead (current) o No-Read-Ahead o Adaptive Read-Ahead · Write Policy o Write-Back (current) o Write-Through · Cache Policy o Cache I/O o Direct I/O (current) · Disk Cache Policy o Default (current) o Enabled o Disabled On HP controllers: · Cache Ratio (current: 25% Read / 75% Write) · Drive Write Cache o Enabled (current) o Disabled And there is one more setting in LogicalDrive option: · Caching: o Enabled (current) o Disabled Please verify my settings and give me some recomendations. Best regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Nginx access ceph
Hi, I am trying to setup nginx to access html files in ceph buckets. I have setup - https://github.com/anomalizer/ngx_aws_auth Below is the nginx config . When I try to access http://hostname:8080/test/b.html - shows signature mismatch. http://hostname:8080/b.html - shows signature mismatch. I could see the request passed from nginx to ceph in ceph logs. server { listen 8080; server_name localhost; location / { proxy_pass http://10.84.182.80:8080; aws_access_key GMO31LL1LECV1RH4T71K; aws_secret_key aXEf9e1Aq85VTz7Q5tkXeq4qZaEtnYP04vSTIFBB; s3_buckettest; set $url_full '$1'; chop_prefix /test; proxy_set_header Authorization $s3_auth_token; proxy_set_header x-amz-date $aws_date; } } I have set ceph bucket as public ( not private ). Request to kindly help. Regards, Ram ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v0.94.2 Hammer released
This Hammer point release fixes a few critical bugs in RGW that can prevent objects starting with underscore from behaving properly and that prevent garbage collection of deleted objects when using the Civetweb standalone mode. All v0.94.x Hammer users are strongly encouraged to upgrade, and to make note of the repair procedure below if RGW is in use. Upgrading from previous Hammer release -- Bug #11442 introduced a change that made rgw objects that start with underscore incompatible with previous versions. The fix to that bug reverts to the previous behavior. In order to be able to access objects that start with an underscore and were created in prior Hammer releases, following the upgrade it is required to run (for each affected bucket):: $ radosgw-admin bucket check --check-head-obj-locator \ --bucket=bucket [--fix] You can get a list of buckets with $ radosgw-admin bucket list Notable changes --- * build: compilation error: No high-precision counter available (armhf, powerpc..) (#11432, James Page) * ceph-dencoder links to libtcmalloc, and shouldn't (#10691, Boris Ranto) * ceph-disk: disk zap sgdisk invocation (#11143, Owen Synge) * ceph-disk: use a new disk as journal disk,ceph-disk prepare fail (#10983, Loic Dachary) * ceph-objectstore-tool should be in the ceph server package (#11376, Ken Dreyer) * librados: can get stuck in redirect loop if osdmap epoch == last_force_op_resend (#11026, Jianpeng Ma) * librbd: A retransmit of proxied flatten request can result in -EINVAL (Jason Dillaman) * librbd: ImageWatcher should cancel in-flight ops on watch error (#11363, Jason Dillaman) * librbd: Objectcacher setting max object counts too low (#7385, Jason Dillaman) * librbd: Periodic failure of TestLibRBD.DiffIterateStress (#11369, Jason Dillaman) * librbd: Queued AIO reference counters not properly updated (#11478, Jason Dillaman) * librbd: deadlock in image refresh (#5488, Jason Dillaman) * librbd: notification race condition on snap_create (#11342, Jason Dillaman) * mds: Hammer uclient checking (#11510, John Spray) * mds: remove caps from revoking list when caps are voluntarily released (#11482, Yan, Zheng) * messenger: double clear of pipe in reaper (#11381, Haomai Wang) * mon: Total size of OSDs is a maginitude less than it is supposed to be. (#11534, Zhe Zhang) * osd: don't check order in finish_proxy_read (#11211, Zhiqiang Wang) * osd: handle old semi-deleted pgs after upgrade (#11429, Samuel Just) * osd: object creation by write cannot use an offset on an erasure coded pool (#11507, Jianpeng Ma) * rgw: Improve rgw HEAD request by avoiding read the body of the first chunk (#11001, Guang Yang) * rgw: civetweb is hitting a limit (number of threads 1024) (#10243, Yehuda Sadeh) * rgw: civetweb should use unique request id (#10295, Orit Wasserman) * rgw: critical fixes for hammer (#11447, #11442, Yehuda Sadeh) * rgw: fix swift COPY headers (#10662, #10663, #11087, #10645, Radoslaw Zarzynski) * rgw: improve performance for large object (multiple chunks) GET (#11322, Guang Yang) * rgw: init-radosgw: run RGW as root (#11453, Ken Dreyer) * rgw: keystone token cache does not work correctly (#11125, Yehuda Sadeh) * rgw: make quota/gc thread configurable for starting (#11047, Guang Yang) * rgw: make swift responses of RGW return last-modified, content-length, x-trans-id headers.(#10650, Radoslaw Zarzynski) * rgw: merge manifests correctly when there's prefix override (#11622, Yehuda Sadeh) * rgw: quota not respected in POST object (#11323, Sergey Arkhipov) * rgw: restore buffer of multipart upload after EEXIST (#11604, Yehuda Sadeh) * rgw: shouldn't need to disable rgw_socket_path if frontend is configured (#11160, Yehuda Sadeh) * rgw: swift: Response header of GET request for container does not contain X-Container-Object-Count, X-Container-Bytes-Used and x-trans-id headers (#10666, Dmytro Iurchenko) * rgw: swift: Response header of POST request for object does not contain content-length and x-trans-id headers (#10661, Radoslaw Zarzynski) * rgw: swift: response for GET/HEAD on container does not contain the X-Timestamp header (#10938, Radoslaw Zarzynski) * rgw: swift: response for PUT on /container does not contain the mandatory Content-Length header when FCGI is used (#11036, #10971, Radoslaw Zarzynski) * rgw: swift: wrong handling of empty metadata on Swift container (#11088, Radoslaw Zarzynski) * tests: TestFlatIndex.cc races with TestLFNIndex.cc (#11217, Xinze Chi) * tests: ceph-helpers kill_daemons fails when kill fails (#11398, Loic Dachary) For more detailed information, see the complete changelog at http://docs.ceph.com/docs/master/_downloads/v0.94.2.txt Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.94.2.tar.gz * For packages, see
Re: [ceph-users] ceph mount error
You probably didn't turn on an MDS, as that isn't set up by default anymore. I believe the docs tell you how to do that somewhere else. If that's not it, please provide the output of ceph -s. -Greg On Sun, Jun 7, 2015 at 8:14 AM, 张忠波 zhangzhongbo2...@163.com wrote: Hi , My ceph health is OK , And now , I want to build a Filesystem , refer to the CEPH FS QUICK START guide . http://ceph.com/docs/master/start/quick-cephfs/ however , I got a error when i use the command , mount -t ceph 192.168.1.105:6789:/ /mnt/mycephfs . error : mount error 22 = Invalid argument I refer to munual , and now , I don't know how to solve it . I am looking forward to your reply ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw backup
You may be able to use replication. Here is a site showing a good example of how to set it up. I have not tested replicating within the same datacenter, but you should just be able to define a new zone within your existing ceph cluster and replicate to it. http://cephnotes.ksperis.com/blog/2015/03/13/radosgw-simple-replication-example From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Konstantin Ivanov Sent: Thursday, May 28, 2015 1:44 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] radosgw backup Hi everyone. I'm wondering - is there way to backup radosgw data? What i already tried. create backup pool - copy .rgw.buckets to backup pool. Then i delete object via s3 client. And then i copy data from backup pool to .rgw.buckets. I still can't see object in s3 client, but can get it via http by early known url. Questions: where radosgw stores info about objects - (how to make restored object visible from s3 client)? is there best way for backup data for radosgw? Thanks for any advises. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mount error
1) set up mds server ceph-deploy mds --overwrite-conf create hostname of mds server 2) create filesystem ceph osd pool create cephfs_data 128 ceph osd pool create cephfs_metadata 16 ceph fs new cephfs cephfs_metadata cephfs_data ceph fs ls ceph mds stat 3) mount it! From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ??? Sent: Sunday, June 07, 2015 8:15 AM To: ceph-us...@ceph.com; community Cc: xuzh@gmail.com Subject: [ceph-users] ceph mount error Hi , My ceph health is OK , And now , I want to build a Filesystem , refer to the CEPH FS QUICK START guide . http://ceph.com/docs/master/start/quick-cephfs/https://urldefense.proofpoint.com/v2/url?u=http-3A__ceph.com_docs_master_start_quick-2Dcephfs_d=AwMGbgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=q4j_7A_3Avo64MLd_mNa6jWl9XuLv1sx5SHvl58A0Vos=5Ttzin1olsWLGMMcZsINYfk82p7_jiBGDejDXUqUQvQe= however , I got a error when i use the command , mount -t ceph 192.168.1.105:6789:/ /mnt/mycephfs . error : mount error 22 = Invalid argument I refer to munual , and now , I don't know how to solve it . I am looking forward to your reply ! [cid:image001.png@01D0A433.D303BFB0] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is Ceph right for me?
You might be able to accomplish that with something like dropbox or owncloud From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Trevor Robinson - Key4ce Sent: Wednesday, May 20, 2015 2:35 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] Is Ceph right for me? Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. With Kind Regards, Trevor Robinson CTO at Key4ce [Image removed by sender. Key4ce - IT Professionals]https://urldefense.proofpoint.com/v2/url?u=https-3A__key4ce.com_d=AwMFAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=2W9LUdP7c-p9ne86lHzC9HrCJlqRadoKZr0lL_2jCpss=-GHFrZVDoc-S05-TAziAR-f-4eLd8MxrKbTkiSlWHyEe= Skype: KeyMalus.Trev xmpp: t.robin...@im4ce.com Livechat: http://livechat.key4ce.comhttps://urldefense.proofpoint.com/v2/url?u=http-3A__livechat.key4ce.com_d=AwMFAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=2W9LUdP7c-p9ne86lHzC9HrCJlqRadoKZr0lL_2jCpss=EAVTZZsRYNxcyZr2JR9op7sqzfWA49ReJpeH7MkSgWQe= NL: +31 (0)40 290 3310 UK: +44 (0)1332 898 999 CN: +86 (0)7552 824 5985 The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mount error
Hi, Are you using cephx? If so, does your client have the appropriate key on it? It looks like you have an mds set up and running from your screenshot. Try mounting it like so: mount -t ceph -o name=admin,secret=[your secret] 192.168.1.105:6789:/ /mnt/mycephfs --Lincoln On Jun 7, 2015, at 10:14 AM, 张忠波 wrote: Hi , My ceph health is OK , And now , I want to build a Filesystem , refer to the CEPH FS QUICK START guide . http://ceph.com/docs/master/start/quick-cephfs/ however , I got a error when i use the command , mount -t ceph 192.168.1.105:6789:/ /mnt/mycephfs . error : mount error 22 = Invalid argument I refer to munual , and now , I don't know how to solve it . I am looking forward to your reply ! 截图1.png ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is Ceph right for me?
On 05/20/15 23:34, Trevor Robinson - Key4ce wrote: Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. CephFS won't work well (or at all when the connections are lost). The only part of Ceph which would work is RGW replication but you don't get a filesystem with it and I'm under the impression that a multi-master replication might be tricky (to be confirmer). Coda's goals seems to match your needs. I'm not sure if it's still actively developped (there is a client distributed with the Linux kernel though). http://www.coda.cs.cmu.edu/ Last time I tried it (several years ago) it worked well enough for me. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. If your setup is not too complex, you might simply handle this with rsync or unison. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD with OCFS2
Hi, Ceph journal works in different way. It’s a write ahead journal, all the data will be persisted first in journal and then will be written to actual place. Journal data is encoded. Journal is a fixed size partition/file and written sequentially. So, if you are placing journal in HDDs, it will be overwritten, for SSD case , it will be GC later. So, if you are measuring amount of data written to the device it will be double. But, if you are saying you have written a 500MB file to cluster and you are seeing the actual file size is 10G, it should not be the case. How are you seeing this size BTW ? Could you please tell us more about your configuration ? What is the replication policy you are using ? What interface you used to store the data ? Regarding your other query.. If i transfer 1GB data, what will be server size(OSD), Is this will write compressed format No, actual data is not compressed. You don’t want to fill up OSD disk and there are some limits you can set . Check the following link http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ It will stop working if the disk is 95% full by default. Is it possible to take backup from server compressed data and copy the same to other machine as Server_Backup - then start new client using Server_Backup For backup, check the following link if that works for you. https://ceph.com/community/blog/tag/backup/ Also, you can use RGW federated config for back up. Data removal is very slow How are you removing data ? Are you removing a rbd image ? If you are removing entire pool , that should be fast and do deletes data async way I guess. Thanks Regards Somnath From: gjprabu [mailto:gjpr...@zohocorp.com] Sent: Thursday, June 11, 2015 6:38 AM To: Somnath Roy Cc: ceph-users@lists.ceph.com; Kamala Subramani; Siva Sokkumuthu Subject: Re: RE: [ceph-users] Ceph OSD with OCFS2 Hi Team, Once data transfer completed the journal file should convert all memory data's to real places but our cause it showing double of the size after complete transfer, Here everyone will confuse what is real file and folder size. Also What will happen If i move the monitoring from that osd server to separately, is the double size issue may solve ? We have below query also. 1. Extra 2-3 mins is taken for hg / git repository operation like clone , pull , checkout and update. 2. If i transfer 1GB data, what will be server size(OSD), Is this will write compressed format. 3 . Is it possible to take backup from server compressed data and copy the same to other machine as Server_Backup - then start new client using Server_Backup. 4. Data removal is very slow. Regards Prabu On Fri, 05 Jun 2015 21:55:28 +0530 Somnath Roy somnath@sandisk.commailto:somnath@sandisk.com wrote Yes, Ceph will be writing twice , one for journal and one for actual data. Considering you configured journal in the same device , this is what you end up seeing if you are monitoring the device BW. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of gjprabu Sent: Friday, June 05, 2015 3:07 AM To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Cc: Kamala Subramani; Siva Sokkumuthu Subject: [ceph-users] Ceph OSD with OCFS2 Dear Team, We are newly using ceph with two OSD and two clients. Both clients are mounted with OCFS2 file system. Here suppose i transfer 500MB of data in the client its showing double of the size 1GB after finish data transfer. Is the behavior is correct or is there any solution for this. Regards Prabu PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is Ceph right for me?
You don't need a filesystem for that. I use csync2 with lsyncd and it works ok. Make sure if you use 2 or multi way sync and WinSCP to update files, first delete the old version, wait a second and then upload the new version. It will save you some head scratching... https://www.krystalmods.com/index.php?title=csync2-web-server-file-syncmore=1c=1tb=1pb=1 Dan On 6/11/2015 8:54 PM, Michael Kuriger wrote: You might be able to accomplish that with something like dropbox or owncloud *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Trevor Robinson - Key4ce *Sent:* Wednesday, May 20, 2015 2:35 PM *To:* ceph-users@lists.ceph.com *Subject:* [ceph-users] Is Ceph right for me? Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. *With Kind Regards,* Trevor Robinson *CTO at Key4ce* Image removed by sender. Key4ce - IT Professionals https://urldefense.proofpoint.com/v2/url?u=https-3A__key4ce.com_d=AwMFAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=2W9LUdP7c-p9ne86lHzC9HrCJlqRadoKZr0lL_2jCpss=-GHFrZVDoc-S05-TAziAR-f-4eLd8MxrKbTkiSlWHyEe= *Skype:* KeyMalus.Trev *xmpp:* t.robin...@im4ce.com *Livechat:* http://livechat.key4ce.com https://urldefense.proofpoint.com/v2/url?u=http-3A__livechat.key4ce.com_d=AwMFAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=2W9LUdP7c-p9ne86lHzC9HrCJlqRadoKZr0lL_2jCpss=EAVTZZsRYNxcyZr2JR9op7sqzfWA49ReJpeH7MkSgWQe= *NL:* +31 (0)40 290 3310 *UK:* +44 (0)1332 898 999 *CN:* +86 (0)7552 824 5985 The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] TR: High apply latency on OSD causes poor performance on VM
Turn off write cache on the controller. Your probably seeing the flush to disk. Tyler Bishop Chief Executive Officer 513-299-7108 x10 tyler.bis...@beyondhosting.net If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. From: Franck Allouis franck.allo...@stef.com To: ceph-users ceph-us...@ceph.com Sent: Friday, May 29, 2015 8:54:41 AM Subject: [ceph-users] TR: High apply latency on OSD causes poor performance on VM Hi, Could you take a look on my problem. It’s about high latency on my OSDs on HP G8 servers (ceph01, ceph02 and ceph03). When I run a rados bench for 60 sec, the results are surprising : after a few seconds, there is no traffic, then it’s resume, etc. Finally, the maximum latency is high and VM’s disks freeze lot. #rados bench -p pool-test-g8 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_ceph02_56745 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 82 66 263.959 264 0.0549584 0.171148 2 16 134 118 235.97 208 0.344873 0.232103 3 16 189 173 230.639 220 0.015583 0.24581 4 16 248 232 231.973 236 0.0704699 0.252504 5 16 306 290 231.974 232 0.0229872 0.258343 6 16 371 355 236.64 260 0.27183 0.255469 7 16 419 403 230.26 192 0.0503492 0.263304 8 16 460 444 221.975 164 0.0157241 0.261779 9 16 506 490 217.754 184 0.199418 0.271501 10 16 518 502 200.778 48 0.0472324 0.269049 11 16 518 502 182.526 0 - 0.269049 12 16 556 540 179.981 76 0.100336 0.301616 13 16 607 591 181.827 204 0.173912 0.346105 14 16 655 639 182.552 192 0.0484904 0.339879 15 16 683 667 177.848 112 0.0504184 0.349929 16 16 746 730 182.481 252 0.276635 0.347231 17 16 807 791 186.098 244 0.391491 0.339275 18 16 845 829 184.203 152 0.188608 0.342021 19 16 850 834 175.561 20 0.960175 0.342717 2015-05-28 17:09:48.397376min lat: 0.013532 max lat: 6.28387 avg lat: 0.346987 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 859 843 168.582 36 0.0182246 0.346987 21 16 863 847 161.316 16 3.18544 0.355051 22 16 897 881 160.165 136 0.0811037 0.371209 23 16 901 885 153.897 16 0.0482124 0.370793 24 16 943 927 154.484 168 0.63064 0.397204 25 15 997 982 157.104 220 0.0933448 0.392701 26 16 1058 1042 160.291 240 0.166463 0.385943 27 16 1088 1072 158.798 120 1.63882 0.388568 28 16 1125 1109 158.412 148 0.0511479 0.38419 29 16 1155 1139 157.087 120 0.162266 0.385898 30 16 1163 1147 152.917 32 0.0682181 0.383571 31 16 1190 1174 151.468 108 0.0489185 0.386665 32 16 1196 1180 147.485 24 2.95263 0.390657 33 16 1213 1197 145.076 68 0.0467788 0.389299 34 16 1265 1249 146.926 208 0.0153085 0.420687 35 16 1332 1316 150.384 268 0.0157061 0.42259 36 16 1374 1358 150.873 168 0.251626 0.417373 37 16 1402 1386 149.822 112 0.0475302 0.413886 38 16 1444 1428 150.3 168 0.0507577 0.421055 39 16 1500 1484 152.189 224 0.0489163 0.416872 2015-05-28 17:10:08.399434min lat: 0.013532 max lat: 9.26596 avg lat: 0.415296 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 40 16 1530 1514 151.384 120 0.951713 0.415296 41 16 1551 1535 149.741 84 0.0686787 0.416571 42 16 1606 1590 151.413 220 0.0826855 0.41684 43 16 1656 1640 152.542 200 0.0706539 0.409974 44 16 1663 1647 149.712 28 0.046672 0.408476 45 16 1685 1669 148.34 88 0.0989566 0.424918 46 16 1707 1691 147.028 88 0.0490569 0.421116 47 16 1707 1691 143.9 0 - 0.421116 48 16 1707 1691 140.902 0 - 0.421116 49 16 1720 1704 139.088 17. 0.0480335 0.428997 50 16 1752 1736 138.866 128 0.053219 0.4416 51 16 1786 1770 138.809 136 0.602946 0.440357 52 16 1810 1794 137.986 96 0.0472518 0.438376 53 16 1831 1815 136.967 84 0.0148999 0.446801 54 16 1831 1815 134.43 0 - 0.446801 55 16 1853 1837 133.586 44 0.0499486 0.455561 56 16 1898 1882 134.415 180 0.0566593 0.461019 57 16 1932 1916 134.442 136 0.0162902 0.454385 58 16 1948 1932 133.227 64 0.62188 0.464403 59 16 1966 1950 132.19 72 0.563613 0.472147 2015-05-28 17:10:28.401525min lat: 0.013532 max lat: 12.4828 avg lat: 0.472084 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 60 16 1983 1967 131.12 68 0.030789 0.472084 61 16 1984 1968 129.036 4 0.0519125 0.471871 62 16 1984 1968 126.955 0 - 0.471871 63 16 1984 1968 124.939 0 - 0.471871 64 14 1984 1970 123.112 2.7 4.20878 0.476035 Total time run: 64.823355 Total writes made: 1984 Write size: 4194304 Bandwidth (MB/sec): 122.425 Stddev Bandwidth: 85.3816 Max bandwidth (MB/sec): 268 Min bandwidth (MB/sec): 0 Average Latency: 0.520956 Stddev Latency: 1.17678 Max latency: 12.4828 Min latency: 0.013532 I have installed a new ceph06 box which has best latencies but hardware is different (RAID card, disks,
Re: [ceph-users] Restarting OSD leads to lower CPU usage
Yeah, perf top will help you a lot.. Some guess: 1. If your block size is small 4-16K range, most probably you are hitting the tcmalloc issue. 'perf top' will show up with lot of tcmalloc traces in that case. 2. fdcache should save you some cpu but I don't see it will be that significant. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Thursday, June 11, 2015 5:57 AM To: Dan van der Ster Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage I have no experience with perf and the package is not installed. I will take a look at it, thanks. Jan On 11 Jun 2015, at 13:48, Dan van der Ster d...@vanderster.com wrote: Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
Hi, I looked at it briefly before leaving, tcmalloc was at the top. I can provide a full listing tomorrow if it helps. 12.80% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans() 8.40% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) 7.40% [kernel] [k] futex_wake 6.36% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) 6.09% [kernel] [k] futex_requeue Not much else to see. We tried setting the venerable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, but it only got much much worse (default 16MB, tried 8MB and up to 512MB, it was unusably slow immediately after start). We haven’t tried upgrading tcmalloc, though... We only use Ceph for RBD with OpenStack, block size is the default (4MB). I tested different block sizes previously, and I got the best results from 8MB blocks (and I was benchmarking 4K random direct/sync writes) - strange, I think… I increased fdcache to 12 (which should be enough for all objects on the OSD), and I will compare how it behaves tomorrow. Thanks a lot Jan On 11 Jun 2015, at 20:59, Somnath Roy somnath@sandisk.com wrote: Yeah, perf top will help you a lot.. Some guess: 1. If your block size is small 4-16K range, most probably you are hitting the tcmalloc issue. 'perf top' will show up with lot of tcmalloc traces in that case. 2. fdcache should save you some cpu but I don't see it will be that significant. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Thursday, June 11, 2015 5:57 AM To: Dan van der Ster Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage I have no experience with perf and the package is not installed. I will take a look at it, thanks. Jan On 11 Jun 2015, at 13:48, Dan van der Ster d...@vanderster.com wrote: Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
Yeah ! Then it is the tcmalloc issue.. If you are using the version coming with OS , the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES won't do anything. Try building the latest tcmalloc and set the env variable and see if it improves or not. Also, you can try with latest ceph build with jemalloc enabled if you have a test cluster. Thanks Regards Somnath -Original Message- From: Jan Schermer [mailto:j...@schermer.cz] Sent: Thursday, June 11, 2015 12:10 PM To: Somnath Roy Cc: Dan van der Ster; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage Hi, I looked at it briefly before leaving, tcmalloc was at the top. I can provide a full listing tomorrow if it helps. 12.80% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans() 8.40% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) 7.40% [kernel] [k] futex_wake 6.36% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) 6.09% [kernel] [k] futex_requeue Not much else to see. We tried setting the venerable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, but it only got much much worse (default 16MB, tried 8MB and up to 512MB, it was unusably slow immediately after start). We haven’t tried upgrading tcmalloc, though... We only use Ceph for RBD with OpenStack, block size is the default (4MB). I tested different block sizes previously, and I got the best results from 8MB blocks (and I was benchmarking 4K random direct/sync writes) - strange, I think… I increased fdcache to 12 (which should be enough for all objects on the OSD), and I will compare how it behaves tomorrow. Thanks a lot Jan On 11 Jun 2015, at 20:59, Somnath Roy somnath@sandisk.com wrote: Yeah, perf top will help you a lot.. Some guess: 1. If your block size is small 4-16K range, most probably you are hitting the tcmalloc issue. 'perf top' will show up with lot of tcmalloc traces in that case. 2. fdcache should save you some cpu but I don't see it will be that significant. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Thursday, June 11, 2015 5:57 AM To: Dan van der Ster Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage I have no experience with perf and the package is not installed. I will take a look at it, thanks. Jan On 11 Jun 2015, at 13:48, Dan van der Ster d...@vanderster.com wrote: Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error,
Re: [ceph-users] Is Ceph right for me?
Alternatively you could just use GIT (or some other form of versioning system) ... host your code/files/html/whatever in GIT. Make changes to the GIT tree - then you can trigger a git pull from your webservers to local filesystem. This gives you the ability to use branches/versions to control your webserver content - and you can easily roll back to a previous version if you need to. You can create a dev branch and make changes to it, host it on a test web server ... approved, then push the changes to the master branch and trigger the refresh on the web servers. ~~shane On 6/11/15, 11:28 AM, Lionel Bouton lionel+c...@bouton.namemailto:lionel+c...@bouton.name wrote: On 05/20/15 23:34, Trevor Robinson - Key4ce wrote: Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. CephFS won't work well (or at all when the connections are lost). The only part of Ceph which would work is RGW replication but you don't get a filesystem with it and I'm under the impression that a multi-master replication might be tricky (to be confirmer). Coda's goals seems to match your needs. I'm not sure if it's still actively developped (there is a client distributed with the Linux kernel though). http://www.coda.cs.cmu.edu/ Last time I tried it (several years ago) it worked well enough for me. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. If your setup is not too complex, you might simply handle this with rsync or unison. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES works, or at least seems to, just nothing positive. This is on a Centos 6-ish distro. I can’t really upgrade anything easily because of support, and we still run 0.67.12 in production, so that’s a no-go. I know upgrading to Giant is the best way to achieve more performance, but we’re not ready for that yet either (but working on it :)) I’d expect the tcmalloc issue to manifest almost immediately? There are thousands of threads, hundreds of connections - surely it would manifest sooner? People were seeing regressions with just two clients in benchmarks so I thought we are operating with b0rked thread cache constantly… for the record, preloading jemalloc ends with sigsegv within a few minutes, if anybody wanted to know… :) Jan On 11 Jun 2015, at 21:14, Somnath Roy somnath@sandisk.com wrote: Yeah ! Then it is the tcmalloc issue.. If you are using the version coming with OS , the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES won't do anything. Try building the latest tcmalloc and set the env variable and see if it improves or not. Also, you can try with latest ceph build with jemalloc enabled if you have a test cluster. Thanks Regards Somnath -Original Message- From: Jan Schermer [mailto:j...@schermer.cz] Sent: Thursday, June 11, 2015 12:10 PM To: Somnath Roy Cc: Dan van der Ster; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage Hi, I looked at it briefly before leaving, tcmalloc was at the top. I can provide a full listing tomorrow if it helps. 12.80% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans() 8.40% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) 7.40% [kernel] [k] futex_wake 6.36% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) 6.09% [kernel] [k] futex_requeue Not much else to see. We tried setting the venerable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, but it only got much much worse (default 16MB, tried 8MB and up to 512MB, it was unusably slow immediately after start). We haven’t tried upgrading tcmalloc, though... We only use Ceph for RBD with OpenStack, block size is the default (4MB). I tested different block sizes previously, and I got the best results from 8MB blocks (and I was benchmarking 4K random direct/sync writes) - strange, I think… I increased fdcache to 12 (which should be enough for all objects on the OSD), and I will compare how it behaves tomorrow. Thanks a lot Jan On 11 Jun 2015, at 20:59, Somnath Roy somnath@sandisk.com wrote: Yeah, perf top will help you a lot.. Some guess: 1. If your block size is small 4-16K range, most probably you are hitting the tcmalloc issue. 'perf top' will show up with lot of tcmalloc traces in that case. 2. fdcache should save you some cpu but I don't see it will be that significant. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Thursday, June 11, 2015 5:57 AM To: Dan van der Ster Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Restarting OSD leads to lower CPU usage I have no experience with perf and the package is not installed. I will take a look at it, thanks. Jan On 11 Jun 2015, at 13:48, Dan van der Ster d...@vanderster.com wrote: Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now,
[ceph-users] Ceph giant installation fails on rhel 7.0
I am trying to install ceph gaint on rhel 7.0, while installing ceph-common-0.87.2-0.el7.x86_64.rpm I am getting following dependency packages]$ sudo yum install ceph-common-0.87.2-0.el7.x86_64.rpm Loaded plugins: amazon-id, priorities, rhui-lb Examining ceph-common-0.87.2-0.el7.x86_64.rpm: 1:ceph-common-0.87.2-0.el7.x86_64 Marking ceph-common-0.87.2-0.el7.x86_64.rpm to be installed Resolving Dependencies -- Running transaction check --- Package ceph-common.x86_64 1:0.87.2-0.el7 will be installed -- Processing Dependency: libtcmalloc.so.4()(64bit) for package: 1:ceph-common-0.87.2-0.el7.x86_64 -- Finished Dependency Resolution Error: Package: 1:ceph-common-0.87.2-0.el7.x86_64 (/ceph-common-0.87.2-0.el7.x86_64) Requires: libtcmalloc.so.4()(64bit) You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles -nodigest So I am trying install gperftools-libs to resolve the dependencies, but I am unable to get the package using yum install Can any one help me getting the complete list of dependencies to install ceph giant on rhel 7.0. Thanks, Shambhu PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD with OCFS2
Hi Team, Once data transfer completed the journal file should convert all memory data's to real places but our cause it showing double of the size after complete transfer, Here everyone will confuse what is real file and folder size. Also What will happen If i move the monitoring from that osd server to separately, is the double size issue may solve ? We have below query also. 1. Extra 2-3 mins is taken for hg / git repository operation like clone , pull , checkout and update. 2. If i transfer 1GB data, what will be server size(OSD), Is this will write compressed format. 3 . Is it possible to take backup from server compressed data and copy the same to other machine as Server_Backup - then start new client using Server_Backup. 4. Data removal is very slow. Regards Prabu On Fri, 05 Jun 2015 21:55:28 +0530 Somnath Roy lt;somnath@sandisk.comgt; wrote Yes, Ceph will be writing twice , one for journal and one for actual data. Considering you configured journal in the same device , this is what you end up seeing if you are monitoring the device BW. Thanks amp; Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of gjprabu Sent: Friday, June 05, 2015 3:07 AM To: ceph-users@lists.ceph.com Cc: Kamala Subramani; Siva Sokkumuthu Subject: [ceph-users] Ceph OSD with OCFS2 Dear Team, We are newly using ceph with two OSD and two clients. Both clients are mounted with OCFS2 file system. Here suppose i transfer 500MB of data in the client its showing double of the size 1GB after finish data transfer. Is the behavior is correct or is there any solution for this. Regards Prabu PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restarting OSD leads to lower CPU usage
I have no experience with perf and the package is not installed. I will take a look at it, thanks. Jan On 11 Jun 2015, at 13:48, Dan van der Ster d...@vanderster.com wrote: Hi Jan, Can you get perf top running? It should show you where the OSDs are spinning... Cheers, Dan On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer j...@schermer.cz wrote: Hi, hoping someone can point me in the right direction. Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I restart the OSD everything runs nicely for some time, then it creeps up. 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 80%. Restarting means the offending OSDs only use 40% again. 2) average latencies and CPU usage on the host are the same - so it’s not caused by the host that the OSD is running on 3) I can’t say exactly when or how the issue happens. I can’t even say if it’s the same OSDs. It seems it either happens when something heavy happens in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t come back, or maybe it happens slowly over time and I can’t find it in the graphs. Looking at the graphs it seems to be the former. I have just one suspicion and that is the “fd cache size” - we have it set to 16384 but the open fds suggest there are more open files for the osd process (over 17K fds) - it varies by some hundreds between the osds. Maybe some are just slightly over the limit and the misses cause this? Restarting the OSD clears them (~2K) and they increase over time. I increased it to 32768 yesterday and it consistently nice now, but it might take another few days to manifest… Could this explain it? Any other tips? Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is Ceph right for me?
Hi Trevor, probably csync2 could work for you. Best Karsten Am 11.06.2015 7:30 nachm. schrieb Trevor Robinson - Key4ce t.robin...@key4ce.com: Hello, Could somebody please advise me if Ceph is suitable for our use? We are looking for a file system which is able to work over different locations which are connected by VPN. If one locations was to go offline then the filesystem will stay online at both sites and then once connection is regained the latest file version will take priority. The main use will be for website files so the changes are most likely to be any uploaded files and cache files as a lot of the data will be stored in a SQL database which is already replicated. *With Kind Regards,* Trevor Robinson *CTO at Key4ce* [image: Key4ce - IT Professionals] https://key4ce.com/ *Skype:* KeyMalus.Trev *xmpp:* t.robin...@im4ce.com *Livechat:* http://livechat.key4ce.com *NL:* +31 (0)40 290 3310 *UK:* +44 (0)1332 898 999 *CN:* +86 (0)7552 824 5985 -- The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MONs not forming quorum
Hi folks- I'm trying to deploy 0.94.2 (Hammer) onto CentOS7. I used to be pretty good at this on Ubuntu but it has been a while. Anyway, my monitors are not forming quorum, and I'm not sure why. They can definitely all ping each other and such. Any thoughts on specific problems in the output below, or just general causes for monitors not forming quorum, or where to get more debug information on what is going wrong? Thanks!! [root@bdca151 ceph]# ceph-deploy mon create-initial bdca15{0,2,3} [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.25): /bin/ceph-deploy mon create-initial bdca150 bdca152 bdca153 [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts bdca150 bdca152 bdca153 [ceph_deploy.mon][DEBUG ] detecting platform for host bdca150 ... [bdca150][DEBUG ] connected to host: bdca150 [bdca150][DEBUG ] detect platform information from remote host [bdca150][DEBUG ] detect machine type [ceph_deploy.mon][INFO ] distro info: CentOS Linux 7.1.1503 Core [bdca150][DEBUG ] determining if provided host has same hostname in remote [bdca150][DEBUG ] get remote short hostname [bdca150][DEBUG ] deploying mon to bdca150 [bdca150][DEBUG ] get remote short hostname [bdca150][DEBUG ] remote hostname: bdca150 [bdca150][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [bdca150][DEBUG ] create the mon path if it does not exist [bdca150][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-bdca150/done [bdca150][DEBUG ] done path does not exist: /var/lib/ceph/mon/ceph-bdca150/done [bdca150][INFO ] creating keyring file: /var/lib/ceph/tmp/ceph-bdca150.mon.keyring [bdca150][DEBUG ] create the monitor keyring file [bdca150][INFO ] Running command: ceph-mon --cluster ceph --mkfs -i bdca150 --keyring /var/lib/ceph/tmp/ceph-bdca150.mon.keyring [bdca150][DEBUG ] ceph-mon: renaming mon.noname-a 10.1.0.150:6789/0 to mon.bdca150 [bdca150][DEBUG ] ceph-mon: set fsid to 770514ba-65e6-475b-8d43-ad6ee850ead6 [bdca150][DEBUG ] ceph-mon: created monfs at /var/lib/ceph/mon/ceph-bdca150 for mon.bdca150 [bdca150][INFO ] unlinking keyring file /var/lib/ceph/tmp/ceph-bdca150.mon.keyring [bdca150][DEBUG ] create a done file to avoid re-doing the mon deployment [bdca150][DEBUG ] create the init path if it does not exist [bdca150][DEBUG ] locating the `service` executable... [bdca150][INFO ] Running command: /usr/sbin/service ceph -c /etc/ceph/ceph.conf start mon.bdca150 [bdca150][DEBUG ] === mon.bdca150 === [bdca150][DEBUG ] Starting Ceph mon.bdca150 on bdca150... [bdca150][WARNIN] Running as unit run-52328.service. [bdca150][DEBUG ] Starting ceph-create-keys on bdca150... [bdca150][INFO ] Running command: systemctl enable ceph [bdca150][WARNIN] ceph.service is not a native service, redirecting to /sbin/chkconfig. [bdca150][WARNIN] Executing /sbin/chkconfig ceph on [bdca150][WARNIN] The unit files have no [Install] section. They are not meant to be enabled [bdca150][WARNIN] using systemctl. [bdca150][WARNIN] Possible reasons for having this kind of units are: [bdca150][WARNIN] 1) A unit may be statically enabled by being symlinked from another unit's [bdca150][WARNIN].wants/ or .requires/ directory. [bdca150][WARNIN] 2) A unit's purpose may be to act as a helper for some other unit which has [bdca150][WARNIN]a requirement dependency on it. [bdca150][WARNIN] 3) A unit may be started when needed via activation (socket, path, timer, [bdca150][WARNIN]D-Bus, udev, scripted systemctl call, ...). [bdca150][INFO ] Running command: ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.bdca150.asok mon_status [bdca150][DEBUG ] [bdca150][DEBUG ] status for monitor: mon.bdca150 [bdca150][DEBUG ] { [bdca150][DEBUG ] election_epoch: 0, [bdca150][DEBUG ] extra_probe_peers: [ [bdca150][DEBUG ] 10.1.0.152:6789/0, [bdca150][DEBUG ] 10.1.0.153:6789/0 [bdca150][DEBUG ] ], [bdca150][DEBUG ] monmap: { [bdca150][DEBUG ] created: 0.00, [bdca150][DEBUG ] epoch: 0, [bdca150][DEBUG ] fsid: 770514ba-65e6-475b-8d43-ad6ee850ead6, [bdca150][DEBUG ] modified: 0.00, [bdca150][DEBUG ] mons: [ [bdca150][DEBUG ] { [bdca150][DEBUG ] addr: 10.1.0.150:6789/0, [bdca150][DEBUG ] name: bdca150, [bdca150][DEBUG ] rank: 0 [bdca150][DEBUG ] }, [bdca150][DEBUG ] { [bdca150][DEBUG ] addr: 0.0.0.0:0/1, [bdca150][DEBUG ] name: bdca152, [bdca150][DEBUG ] rank: 1 [bdca150][DEBUG ] }, [bdca150][DEBUG ] { [bdca150][DEBUG ] addr: 0.0.0.0:0/2, [bdca150][DEBUG ] name: bdca153, [bdca150][DEBUG ] rank: 2 [bdca150][DEBUG ] } [bdca150][DEBUG ] ] [bdca150][DEBUG ] }, [bdca150][DEBUG ] name: bdca150, [bdca150][DEBUG ] outside_quorum: [ [bdca150][DEBUG ] bdca150 [bdca150][DEBUG ] ], [bdca150][DEBUG ] quorum: [],
[ceph-users] anyone using CephFS for HPC?
Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
Hi Alexandre, I agree with your rational, of one iothread per disk. CPU consumed in IOwait is pretty high in each VM. But I am not finding a way to set the same on a nova instance. I am using openstack Juno with QEMU+KVM. As per libvirt documentation for setting iothreads, I can edit domain.xml directly and achieve the same effect. However in as in openstack env domain xml is created by nova with some additional metadata, so editing the domain xml using 'virsh edit' does not seems to work(I agree, it is not a very cloud way of doing things, but a hack). Changes made there vanish after saving them, due to reason libvirt validation fails on the same. #virsh dumpxml instance-00c5 vm.xml #virt-xml-validate vm.xml Relax-NG validity error : Extra element cpu in interleave vm.xml:1: element domain: Relax-NG validity error : Element domain failed to validate content vm.xml fails to validate Second approach I took was to setting QoS in volumes types. But there is no option to set iothreads per volume, there are parameter realted to max_read/wrirte ops/bytes. Thirdly, editing Nova flavor and proving extra specs like hw:cpu_socket/thread/core, can change guest CPU topology however again no way to set iothread. It does accept hw_disk_iothreads(no type check in place, i believe ), but can not pass the same in domain.xml. Could you suggest me a way to set the same. -Pushpesh On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER aderum...@odiso.com wrote: I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) Sure no problem. (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 iothread by disk) - Mail original - De: Somnath Roy somnath@sandisk.com À: aderumier aderum...@odiso.com, Irek Fasikhov malm...@gmail.com Cc: ceph-devel ceph-de...@vger.kernel.org, pushpesh sharma pushpesh@gmail.com, ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 10 Juin 2015 09:06:32 Objet: RE: rbd_cache, limiting read on high iops around 40k Hi Alexandre, Thanks for sharing the data. I need to try out the performance on qemu soon and may come back to you if I need some qemu setting trick :-) Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alexandre DERUMIER Sent: Tuesday, June 09, 2015 10:42 PM To: Irek Fasikhov Cc: ceph-devel; pushpesh sharma; ceph-users Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k Very good work! Do you have a rpm-file? Thanks. no sorry, I'm have compiled it manually (and I'm using debian jessie as client) - Mail original - De: Irek Fasikhov malm...@gmail.com À: aderumier aderum...@odiso.com Cc: Robert LeBlanc rob...@leblancnet.us, ceph-devel ceph-de...@vger.kernel.org, pushpesh sharma pushpesh@gmail.com, ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 10 Juin 2015 07:21:42 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k Hi, Alexandre. Very good work! Do you have a rpm-file? Thanks. 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER aderum...@odiso.com : Hi, I have tested qemu with last tcmalloc 2.4, and the improvement is huge with iothread: 50k iops (+45%) ! qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : no-iothread : tcmalloc (2.4) : iops=35974 (+7%) qemu : iothread : glibc : iops=34516 qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%) -- rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] [eta 00m:00s] rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, stdev=197.40 clat percentiles (usec): | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474], | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652], | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980], | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896], | 99.99th=[ 3760] bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit
Re: [ceph-users] anyone using CephFS for HPC?
On Thu, Jun 11, 2015 at 10:31 PM, Nigel Williams nigel.d.willi...@gmail.com wrote: Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? Oak Ridge had a paper at Supercomputing a couple years ago about this from their perspective. I don't remember how many of its concerns are still up-to-date, and the test evaluation was on repurposed Lustre hardware so it was a bit odd, but it might give you some stuff to think about. Sage's thesis or some of the earlier papers will be happy to tell you all the ways in which Ceph Lustre, of course, since creating a successor is how the project started. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New to CEPH - VR@Sheeltron
Dear Sir, I am New to CEPH. I have the following queries: 1. I have been using OpenNAS, OpenFiler, Gluster Nexenta for storage OS. How is CEPH different from Gluster Nexenta ? 2. I also use LUSTRE for our Storage in a HPC Environment. Can CEPH be substituted for Lustre ? 3. What is the minimum capacity of storage (in TB), where CEPH can be deployed ? What is the typical hardware configuration required to support CEPH ? Can we use 'commodity hardware' like TYAN - Servers JBODs to stack up the HDDs ?? Do you need RAID Controllers or is RAID/LUN built by the OS ? 4. Do you have any doc. that gives me the comparisons with other Software based Storage ? Thanks Regards, V.Ranganath VP - SIS Division Sheeltron Digital Systems Pvt. Ltd. Direct: 080-49293307 Mob:+91 88840 54897 E-mail: ra...@sheeltron.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss
I wrote a script which calculates data durability SLA depending on many factors like disk size, network speed, number of hosts etc. It takes recovery time three times greater than needed to count client IO priority over recovery IO. For 2Tb disks and 10g network it shows a bright picture. OSDs: 10SLA: 100.00% OSDs: 20SLA: 100.00% OSDs: 30SLA: 100.00% OSDs: 40SLA: 100.00% OSDs: 50SLA: 100.00% OSDs: 100 SLA: 100.00% OSDs: 200 SLA: 100.00% OSDs: 500 SLA: 99.99% For 1g network it show 7-8 nines in every line. So if my estimations are correct then we are almost safe from triple failure data loss. Script is in attachment. Any good critisizm is welcome. Regards, Vasily. On Thu, Jun 11, 2015 at 3:37 AM, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 10 Jun 2015 23:53:48 +0300 Vasiliy Angapov wrote: Hi, I also wrote a simple script which calculates the data loss probabilities for triple disk failure. Here are some numbers: OSDs: 10, Pr: 138.89% OSDs: 20, Pr: 29.24% OSDs: 30, Pr: 12.32% OSDs: 40, Pr: 6.75% OSDs: 50, Pr: 4.25% OSDs: 100, Pr: 1.03% OSDs: 200, Pr: 0.25% OSDs: 500, Pr: 0.04% Nice, good to have some numbers. Here i assumed we have 100PGs per OSD. Also there is a constraint for 3 disks not to be in one host because this will not lead to a failure. For situation where all disks are evenly distributed between 10 hosts it gives us a correction coefficient of 83% so for 50 OSDs it will be something like 3.53% instead of 4.25%. There is a further constraint for 2 disks in one host and 1 disk on another but that's just adds unneeded complexity. Numbers will not change significantly. And actually triple simultaneous failure is itself not very likely to happen, so i believe that starting from 100 OSDs we can somewhat relax about data failure. I mentioned the link below before, I found that to be one of the more believable RAID failure calculators and they explain their shortcomings nicely to boot. I usually half their DLO/year values (double the chance of data loss) to be on the safe side: https://www.memset.com/tools/raid-calculator/ If you plunk in a 100 disk RAID6 (the equivalent of replica 3) and 2TB per disk with a recovery rate of 100MB/s the odds are indeed pretty good. But note the expected disk failure rate of one per 10 days! Of course the the biggest variable here is how fast your recovery speed will be. I picked 100MB/s, because for some people that will be as fast as their network goes. For others their network could be 10-40 times as fast, but their cluster might not have enough OSDs (or fast enough ones) to remain usable at those speeds, so they'll opt for for lower priority recovery speeds. Christian BTW, this presentation has more math http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph Regards, Vasily. On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster d...@vanderster.com wrote: OK I wrote a quick script to simulate triple failures and count how many would have caused data loss. The script gets your list of OSDs and PGs, then simulates failures and checks if any permutation of that failure matches a PG. Here's an example with 1 simulations on our production cluster: # ./simulate-failures.py We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like this: (945, 910, 399) Simulating 1 failures Simulated 1000 triple failures. Data loss incidents = 0 Data loss incident with failure (676, 451, 931) Simulated 2000 triple failures. Data loss incidents = 1 Simulated 3000 triple failures. Data loss incidents = 1 Simulated 4000 triple failures. Data loss incidents = 1 Simulated 5000 triple failures. Data loss incidents = 1 Simulated 6000 triple failures. Data loss incidents = 1 Simulated 7000 triple failures. Data loss incidents = 1 Simulated 8000 triple failures. Data loss incidents = 1 Data loss incident with failure (1031, 1034, 806) Data loss incident with failure (449, 644, 329) Simulated 9000 triple failures. Data loss incidents = 3 Simulated 1 triple failures. Data loss incidents = 3 End of simulation: Out of 1 triple failures, 3 caused a data loss incident The script is here: https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py Give it a try (on your test clusters!) Cheers, Dan On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer j...@schermer.cz wrote: Yeah, I know but I believe it was fixed so that a single copy is sufficient for recovery now (even with min_size=1)? Depends on what you want to achieve... The point is that even if we lost “just” 1% of data, that’s too much (0%) when talking about customer
[ceph-users] S3 expiration
Hello, I need to store expirable objects in Ceph for housekeeping purposes. I understand that developers are planning to implement it using Amazon S3 API. Does anybody know what is the status of this, or is there another approach for housekeeping available? Thanks. This email and any files transmitted with it are confidential material. They are intended solely for the use of the designated individual or entity to whom they are addressed. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this email in error please immediately notify the sender and delete or destroy any copy of this message ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mount error
2015-06-12 2:00 GMT+08:00 Lincoln Bryant linco...@uchicago.edu: Hi, Are you using cephx? If so, does your client have the appropriate key on it? It looks like you have an mds set up and running from your screenshot. Try mounting it like so: mount -t ceph -o name=admin,secret=[your secret] 192.168.1.105:6789:/ /mnt/mycephfs This should be the solution, you can get error detail from kernel log, by dmesg. --Lincoln On Jun 7, 2015, at 10:14 AM, 张忠波 wrote: Hi , My ceph health is OK , And now , I want to build a Filesystem , refer to the CEPH FS QUICK START guide . http://ceph.com/docs/master/start/quick-cephfs/ however , I got a error when i use the command , mount -t ceph 192.168.1.105:6789:/ /mnt/mycephfs . error : mount error 22 = Invalid argument I refer to munual , and now , I don't know how to solve it . I am looking forward to your reply ! 截图1.png ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.2 Hammer released
Hi, On 11/06/2015 19:34, Sage Weil wrote: Bug #11442 introduced a change that made rgw objects that start with underscore incompatible with previous versions. The fix to that bug reverts to the previous behavior. In order to be able to access objects that start with an underscore and were created in prior Hammer releases, following the upgrade it is required to run (for each affected bucket):: $ radosgw-admin bucket check --check-head-obj-locator \ --bucket=bucket [--fix] You can get a list of buckets with $ radosgw-admin bucket list After the upgrade of my radosgw, I can't fix the problem of rgw objects that start with underscore. The command with the --fix option displays some errors which I don't understand. Here is a (troncated) paste of my shell below. Have I done something wrong? Thx in advance for the help. François Lafont -- ~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator --bucket=$bucket { bucket: moodles-poc-registry, check_objects: [ { key: { name: _multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, instance: }, oid: default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, locator: default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, needs_fixing: true, status: needs_fixing }, [snip] { key: { name: _multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta, instance: }, oid: default.763616.1___multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta, locator: default.763616.1__multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta, needs_fixing: true, status: needs_fixing } ] } ~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator --bucket=$bucket --fix 2015-06-12 03:01:33.197984 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 2015-06-12 03:01:33.200428 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~poMH-PQKCLstUWpMQpji7JuGaBT53Th.meta) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 2015-06-12 03:01:33.206875 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.1) returned ret=-2 2015-06-12 03:01:33.209293 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.2) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 [snip] 2015-06-12 03:01:33.301101 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta) returned ret=-2 { bucket: moodles-poc-registry, check_objects: [ { key: { name: _multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, instance: }, oid: default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, locator: default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta, needs_fixing: true, status: needs_fixing }, [snip] { key: { name: _multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta, instance: }, oid:
[ceph-users] [Fwd: adding a a monitor wil result in cephx: verify_reply couldn't decrypt with error: error decoding block for decryption]
i'm trying to add a extra monitor to my already existing cluster i do this with the ceph-deploy with the following command ceph-deploy mon add mynewhost the ceph-deploy says its all finished but when i take a look at my new monitor host in the logs i see the following error cephx: verify_reply couldn't decrypt with error: error decoding block for decryption and when i take a look in my existing monitor logs i see this error cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 i tried gatherking key's copy keys reinstall/purge the new monitor node greetz Ramon For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Antw: Re: clock skew detected
Andrey Korolyov and...@xdel.ru schrieb am Mittwoch, 10. Juni 2015 um 15:29: On Wed, Jun 10, 2015 at 4:11 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote: Hi, for us a restart of the monitor solved this. Regards Steffen Hi! Immediately after a reboot of mon.3 host its clock was unsynchronized and clock skew detected on mon.3 warning is appeared. But now (more then 1 hour of uptime) the clock is synced, but the warning still showing. Is this ok? Or I have to restart monitor after clock synchronization? Pavel. The quorum should report OK after a five-minute interval but there is a bug which is preventing quorum for doing so at least on oldest supported stable versions of Ceph. I`ve never reported it because of its almost zero importance, but things are what they are - the theoretical behavior should be different and warning should disappear without restart. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Klinik-Service Neubrandenburg GmbH Allendestr. 30, 17036 Neubrandenburg Amtsgericht Neubrandenburg, HRB 2457 Geschaeftsfuehrerin: Gudrun Kappich ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Load balancing RGW and Scaleout
What I've seen work well is to set multiple A records for your RGW endpoint. Then, with something like corosync, you ensure that these multiple IP addresses are always bound somewhere. You can then have as many nodes in active-active mode as you want. -- David Moreau Simard On 2015-06-11 11:29 AM, Florent MONTHEL wrote: Hi Team Is it possible for you to share your setup on radosgw in order to use maximum of network bandwidth and to have no SPOF I have 5 servers on 10gb network and 3 radosgw on it We would like to setup Haproxy on 1 node with 3 rgw but : - SPOF become Haproxy node - Max bandwidth will be on HAproxy node (10gb/s) Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Load balancing RGW and Scaleout
Hi Team Is it possible for you to share your setup on radosgw in order to use maximum of network bandwidth and to have no SPOF I have 5 servers on 10gb network and 3 radosgw on it We would like to setup Haproxy on 1 node with 3 rgw but : - SPOF become Haproxy node - Max bandwidth will be on HAproxy node (10gb/s) Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Can't mount btrfs volume on rbd
Hello, I'm getting an error when attempting to mount a volume on a host that was forceably powered off: # mount /dev/rbd4 climate-downscale-CMIP5/ mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale file handle /var/log/messages: Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table # parted /dev/rbd4 print Model: Unknown (unknown) Disk /dev/rbd4: 36.5TB Sector size (logical/physical): 512B/512B Partition Table: loop Disk Flags: Number Start End SizeFile system Flags 1 0.00B 36.5TB 36.5TB btrfs # btrfs check --repair /dev/rbd4 enabling repair mode Checking filesystem on /dev/rbd4 UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e checking extents cmds-check.c:2274: check_owner_ref: Assertion `rec-is_root` failed. btrfs[0x4175cc] btrfs[0x41b873] btrfs[0x41c3fe] btrfs[0x41dc1d] btrfs[0x406922] OS: CentOS 7.1 btrfs-progs: 3.16.2 Ceph: version: 0.94.1/CentOS 7.1 I haven't found any references to 'stale file handle' on btrfs. The underlying block device is ceph rbd, so I've posted to both lists for any feedback. Also once I reformatted btrfs I didn't get a mount error. The btrfs volume has been reformatted so I won't be able to do much post mortem but I'm wondering if anyone has some insight. Thanks, Steve ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com