Re: [ceph-users] Read Errors and OSD Flapping
Hello, On Sat, 30 May 2015 22:23:22 +0100 Nick Fisk wrote: Hi All, I was noticing poor performance on my cluster and when I went to investigate I noticed OSD 29 was flapping up and down. On investigation it looks like it has 2 pending sectors, kernel log is filled with the following end_request: critical medium error, dev sdk, sector 4483365656 end_request: critical medium error, dev sdk, sector 4483365872 I can see in the OSD logs that it looked like when the OSD was crashing it was trying to scrub the PG, probably failing when the kernel passes up the read error. ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0xacaf4a] 2: (()+0x10340) [0x7fdc43032340] 3: (gsignal()+0x39) [0x7fdc414d1cc9] 4: (abort()+0x148) [0x7fdc414d50d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5] 6: (()+0x5e836) [0x7fdc41dda836] 7: (()+0x5e863) [0x7fdc41dda863] 8: (()+0x5eaa2) [0x7fdc41ddaaa2] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2908] 10: (FileStore::read(coll_t, ghobject_t const, unsigned long, unsigned long, ceph::buffer::list, unsigned int, bool)+0xc98) [0x9168e 8] 11: (ReplicatedBackend::be_deep_scrub(hobject_t const, unsigned int, ScrubMap::object, ThreadPool::TPHandle)+0x2f9) [0xa05bf9] 12: (PGBackend::be_scan_list(ScrubMap, std::vectorhobject_t, std::allocatorhobject_t const, bool, unsigned int, ThreadPool::TPH andle)+0x2c8) [0x8dab98] 13: (PG::build_scrub_map_chunk(ScrubMap, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle)+0x1fa) [0x7f099a] 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle)+0x4a2) [0x7f1132] 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle)+0xbe) [0x6e583e] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae] 17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950] 18: (()+0x8182) [0x7fdc4302a182] 19: (clone()+0x6d) [0x7fdc4159547d] Few questions: 1. Is this the expected behaviour, or should Ceph try and do something to either keep the OSD down or rewrite the sector to cause a sector remap? I guess what you see is what you get, but both things, especially the rewrite would be better. Alas I suppose it is a bit of work for it to do the right thing there (getting the replica to rewrite things with from another node) AND to be certain that this wasn't the last good replica, read error or not. 2. I am monitoring smart stats, but is there any other way of picking this up or getting Ceph to highlight it? Something like a flapping OSD notification would be nice. Lots of improvement opportunities in the Ceph status indeed. Starting with what constitutes which level (ERR, WRN, INF). 3. I'm assuming at this stage this disk will not be replaceable under warranty, am I best to mark it as out, let it drain and then re-introduce it again, which should overwrite the sector and cause a remap? Or is there a better way? That's the safe, easy way. Might want to add a dd zeroing the drive and long SMART test afterwards for good measure before re-adding it. A faster way might be to determine which PG, file is affected just rewrite this, preferably even with a good copy of the data. After that a deep-scrub of that PG, potentially doing a manual repair if this was the acting one. Christian Many Thanks, Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Read Errors and OSD Flapping
Hi All, I was noticing poor performance on my cluster and when I went to investigate I noticed OSD 29 was flapping up and down. On investigation it looks like it has 2 pending sectors, kernel log is filled with the following end_request: critical medium error, dev sdk, sector 4483365656 end_request: critical medium error, dev sdk, sector 4483365872 I can see in the OSD logs that it looked like when the OSD was crashing it was trying to scrub the PG, probably failing when the kernel passes up the read error. ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0xacaf4a] 2: (()+0x10340) [0x7fdc43032340] 3: (gsignal()+0x39) [0x7fdc414d1cc9] 4: (abort()+0x148) [0x7fdc414d50d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5] 6: (()+0x5e836) [0x7fdc41dda836] 7: (()+0x5e863) [0x7fdc41dda863] 8: (()+0x5eaa2) [0x7fdc41ddaaa2] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2908] 10: (FileStore::read(coll_t, ghobject_t const, unsigned long, unsigned long, ceph::buffer::list, unsigned int, bool)+0xc98) [0x9168e 8] 11: (ReplicatedBackend::be_deep_scrub(hobject_t const, unsigned int, ScrubMap::object, ThreadPool::TPHandle)+0x2f9) [0xa05bf9] 12: (PGBackend::be_scan_list(ScrubMap, std::vectorhobject_t, std::allocatorhobject_t const, bool, unsigned int, ThreadPool::TPH andle)+0x2c8) [0x8dab98] 13: (PG::build_scrub_map_chunk(ScrubMap, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle)+0x1fa) [0x7f099a] 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle)+0x4a2) [0x7f1132] 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle)+0xbe) [0x6e583e] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae] 17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950] 18: (()+0x8182) [0x7fdc4302a182] 19: (clone()+0x6d) [0x7fdc4159547d] Few questions: 1. Is this the expected behaviour, or should Ceph try and do something to either keep the OSD down or rewrite the sector to cause a sector remap? 2. I am monitoring smart stats, but is there any other way of picking this up or getting Ceph to highlight it? Something like a flapping OSD notification would be nice. 3. I'm assuming at this stage this disk will not be replaceable under warranty, am I best to mark it as out, let it drain and then re-introduce it again, which should overwrite the sector and cause a remap? Or is there a better way? Many Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD disk distribution
Martin, It all depends on your workload. For example, if you are not bothered about write speed at all, I would say to configure primary affinity of your cluster properly so that primary OSDs can be the one hosted by SSDs..If you are considering 4 SSDs per node, so, total of 56 SSDs and 14 * 12 = 168 HDDs , I guess numbers should work out reasonably (considering 1 OSD per disk) well. This should give your cluster all SSD like read performance, but, write performance won’t improve (will be HDD like). In this case, making 2 or 3 all SSD nodes with high performance servers make sense as all the read traffic will be landing there and with SSDs you need more powerful cpu complex. If your workload is read/write mix, I would say your theory of 2 SSDs for journal and 2 for Cache pool make sense. Journal will help only for write and cache tier can help for read. But, I must say I am yet to evaluate cache tiering performance though. In this case as you said, distributing the ssds across all nodes should be your correct approach. Hope this helps, Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Martin Palma Sent: Saturday, May 30, 2015 1:37 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] SSD disk distribution Hello, We are planing to deploy our first Ceph cluster with 14 storage nodes and 3 monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2 of the SSDs we plan to use as journal disks and 2 for cache tiering. Now the question raised in our team if it would be better to put all SSDs lets say in 2 storage nodes and consider them as fast nodes or to distribute the SSDs for the cache tiering over all 14 nodes (2 per node). In mine opinion, if I understood the concept of Ceph right (I'm still in the learning process ;-) distributing the SSDs across all storage nodes would be better since this also would distribute the network traffic (client access) across all 14 nodes and not only limit it to 2 nodes. Right? Any suggestion on that? Best, Martin PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd crash with object store as newstore
Hi, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know what is the cause of this crash. Regards Srikanth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD disk distribution
Hello, We are planing to deploy our first Ceph cluster with 14 storage nodes and 3 monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2 of the SSDs we plan to use as journal disks and 2 for cache tiering. Now the question raised in our team if it would be better to put all SSDs lets say in 2 storage nodes and consider them as fast nodes or to distribute the SSDs for the cache tiering over all 14 nodes (2 per node). In mine opinion, if I understood the concept of Ceph right (I'm still in the learning process ;-) distributing the SSDs across all storage nodes would be better since this also would distribute the network traffic (client access) across all 14 nodes and not only limit it to 2 nodes. Right? Any suggestion on that? Best, Martin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD disk distribution
Hello, see the current Blocked requests/ops? thread in this ML, especially the later parts. And a number of similar threads. In short, the CPU requirement for SSD based pools are significantly higher than for HDD or HDD/SSD journal pools. So having dedicated SSD nodes with less OSDs, faster CPUs and potentially faster network makes a lot of sense. It also helps a bit to keep you and your CRUSH rules sane. In your example you'd have 12 HDD based OSDs with journals, at 1.5-2GHz CPU per OSD (things will get CPU bound with small write IOPS). A SSD (I'm assuming something like DC S3700) based OSD will eat all the CPU you can throw at it, 6-8GHZ would be a pretty conservative number. Search the archives for the latest tests/benchmarks by others, don't take my (slightly dated) word for it. Lastly you may find like other that cache-tiers currently aren't all great performance wise. Christian. On Sat, 30 May 2015 10:36:39 +0200 Martin Palma wrote: Hello, We are planing to deploy our first Ceph cluster with 14 storage nodes and 3 monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2 of the SSDs we plan to use as journal disks and 2 for cache tiering. Now the question raised in our team if it would be better to put all SSDs lets say in 2 storage nodes and consider them as fast nodes or to distribute the SSDs for the cache tiering over all 14 nodes (2 per node). In mine opinion, if I understood the concept of Ceph right (I'm still in the learning process ;-) distributing the SSDs across all storage nodes would be better since this also would distribute the network traffic (client access) across all 14 nodes and not only limit it to 2 nodes. Right? Any suggestion on that? Best, Martin -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW - Can't download complete object
The code has been backported and should be part of the firefly 0.80.10 release and the hammer 0.94.2 release. Nathan On 05/14/2015 07:30 AM, Yehuda Sadeh-Weinraub wrote: The code is in wip-11620, abd it's currently on top of the next branch. We'll get it through the tests, then get it into hammer and firefly. I wouldn't recommend installing it in production without proper testing first. Yehuda - Original Message - From: Sean Sullivan seapasu...@uchicago.edu To: Yehuda Sadeh-Weinraub yeh...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Wednesday, May 13, 2015 7:22:10 PM Subject: Re: [ceph-users] RGW - Can't download complete object Thank you so much Yahuda! I look forward to testing these. Is there a way for me to pull this code in? Is it in master? On May 13, 2015 7:08:44 PM Yehuda Sadeh-Weinraub yeh...@redhat.com wrote: Ok, I dug a bit more, and it seems to me that the problem is with the manifest that was created. I was able to reproduce a similar issue (opened ceph bug #11622), for which I also have a fix. I created new tests to cover this issue, and we'll get those recent fixes as soon as we can, after we test for any regressions. Thanks, Yehuda - Original Message - From: Yehuda Sadeh-Weinraub yeh...@redhat.com To: Sean Sullivan seapasu...@uchicago.edu Cc: ceph-users@lists.ceph.com Sent: Wednesday, May 13, 2015 2:33:07 PM Subject: Re: [ceph-users] RGW - Can't download complete object That's another interesting issue. Note that for part 12_80 the manifest specifies (I assume, by the messenger log) this part: default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80 (note the 'tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14') whereas it seems that you do have the original part: default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.12_80 (note the '2/...') The part that the manifest specifies does not exist, which makes me think that there is some weird upload sequence, something like: - client uploads part, upload finishes but client does not get ack for it - client retries (second upload) - client gets ack for the first upload and gives up on the second one But I'm not sure if it would explain the manifest, I'll need to take a look at the code. Could such a sequence happen with the client that you're using to upload? Yehuda - Original Message - From: Sean Sullivan seapasu...@uchicago.edu To: Yehuda Sadeh-Weinraub yeh...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Wednesday, May 13, 2015 2:07:22 PM Subject: Re: [ceph-users] RGW - Can't download complete object Sorry for the delay. It took me a while to figure out how to do a range request and append the data to a single file. The good news is that the end file seems to be 14G in size which matches the files manifest size. The bad news is that the file is completely corrupt and the radosgw log has errors. I am using the following code to perform the download:: https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py Here is a clip of the log file:: -- 2015-05-11 15:28:52.313742 7f570db7d700 1 -- 10.64.64.126:0/108 == osd.11 10.64.64.101:6809/942707 5 osd_op_reply(74566287 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12 [read 0~858004] v0'0 uv41308 ondisk = 0) v6 304+0+858004 (1180387808 0 2445559038) 0x7f53d005b1a0 con 0x7f56f8119240 2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=12934184960 len=858004 2015-05-11 15:28:52.372453 7f570db7d700 1 -- 10.64.64.126:0/108 == osd.45 10.64.64.101:6845/944590 2 osd_op_reply(74566142 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30 2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=12145655808 len=4194304 2015-05-11 15:28:52.372501 7f57067fc700 0 ERROR: got unexpected error when trying to read object: -2 2015-05-11 15:28:52.426079 7f570db7d700 1 -- 10.64.64.126:0/108 == osd.21 10.64.64.102:6856/1133473 16 osd_op_reply(74566144 default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12 [read 0~3671316] v0'0 uv41395 ondisk = 0) v6 304+0+3671316 (1695485150 0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0 2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io completion ofs=10786701312 len=3671316 2015-05-11 15:28:52.504072 7f570db7d700 1 -- 10.64.64.126:0/108 == osd.82 10.64.64.103:6857/88524 2 osd_op_reply(74566283
Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker
Dear Eric: Thanks for your information. The command 'reboot -fn' works well. I have no idea that anybody has met 'umount stuck' condition like me. If it's possible, I hope I could find the reason why the fail over process doesn't work fine after 30 minutes. WD -Original Message- From: Eric Eastman [mailto:eric.east...@keepertech.com] Sent: Thursday, May 28, 2015 10:56 PM To: WD Hwang/WHQ/Wistron Cc: Ceph Users Subject: Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker On Thu, May 28, 2015 at 1:33 AM, wd_hw...@wistron.com wrote: Hello, I am testing NFS over RBD recently. I am trying to build the NFS HA environment under Ubuntu 14.04 for testing, and the packages version information as follows: - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS) - ceph : 0.80.9-0ubuntu0.14.04.2 - ceph-common : 0.80.9-0ubuntu0.14.04.2 - pacemaker (git20130802-1ubuntu2.3) - corosync (2.3.3-1ubuntu1) PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations. The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and transferred to 'nfs2', and vice versa. When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, the other node will take over all resources. Everything looks fine. Then I wait about 30 minutes and do nothing to the NFS gateways. I repeated the previous steps to test fail over procedure. I found the process code of 'umount' is 'D' (uninterruptible sleep), the 'ps' showed the following result root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1 Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 minutes for 'umount' time out, the only way I can do is powering off the server directly. Any help would be much appreciated. I am not sure how to get out of the stuck umount, but you can skip the shutdown scripts that call the umount during a reboot using: reboot -fn This can cause data loss, as it is like a power cycle, so it is best to run sync before running the reboot -fn command to flush out buffers. Sometime when a system is really hung, reboot -fn does not work, but this seems to always work if run as root: echo 1 /proc/sys/kernel/sysrq echo b /proc/sysrq-trigger Eric --- This email contains confidential or legally privileged information and is for the sole use of its intended recipient. Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited. If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately. --- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com