Re: [ceph-users] ceph-deploy jewel stopped working
Sorry about the mangled urls in there, these are all from download.ceph.com rpm-jewel el7 xfs_64 Steve > On Apr 21, 2016, at 1:17 PM, Stephen Lord wrote: > > > > Running this command > > ceph-deploy install --stable jewel ceph00 > > And using the 1.5.32 version of ceph-deploy onto a redhat 7.2 system is > failing today (worked yesterday) > > [ceph00][DEBUG ] > > [ceph00][DEBUG ] Package Arch Version > Repository Size > [ceph00][DEBUG ] > > [ceph00][DEBUG ] Installing: > [ceph00][DEBUG ] ceph-mdsx86_641:10.2.0-0.el7 > ceph2.8 M > [ceph00][DEBUG ] ceph-monx86_641:10.2.0-0.el7 > ceph2.8 M > [ceph00][DEBUG ] ceph-osdx86_641:10.2.0-0.el7 > ceph9.0 M > [ceph00][DEBUG ] ceph-radosgwx86_641:10.2.0-0.el7 > ceph245 k > [ceph00][DEBUG ] Installing for dependencies: > [ceph00][DEBUG ] ceph-base x86_641:10.2.0-0.el7 > ceph4.2 M > [ceph00][DEBUG ] ceph-common x86_641:10.2.0-0.el7 > ceph 15 M > [ceph00][DEBUG ] ceph-selinuxx86_641:10.2.0-0.el7 > ceph 19 k > [ceph00][DEBUG ] Updating for dependencies: > [ceph00][DEBUG ] libcephfs1 x86_641:10.2.0-0.el7 > ceph1.8 M > [ceph00][DEBUG ] librados2 x86_641:10.2.0-0.el7 > ceph1.9 M > [ceph00][DEBUG ] librados2-devel x86_641:10.2.0-0.el7 > ceph474 k > [ceph00][DEBUG ] libradosstriper1x86_641:10.2.0-0.el7 > ceph1.8 M > [ceph00][DEBUG ] librbd1 x86_641:10.2.0-0.el7 > ceph2.4 M > [ceph00][DEBUG ] librgw2 x86_641:10.2.0-0.el7 > ceph2.8 M > [ceph00][DEBUG ] python-cephfs x86_641:10.2.0-0.el7 > ceph 66 k > [ceph00][DEBUG ] python-radosx86_641:10.2.0-0.el7 > ceph145 k > [ceph00][DEBUG ] python-rbd x86_641:10.2.0-0.el7 > ceph 61 k > [ceph00][DEBUG ] > [ceph00][DEBUG ] Transaction Summary > [ceph00][DEBUG ] > > [ceph00][DEBUG ] Install 4 Packages (+3 Dependent packages) > [ceph00][DEBUG ] Upgrade ( 9 Dependent packages) > [ceph00][DEBUG ] > [ceph00][DEBUG ] Total download size: 45 M > [ceph00][DEBUG ] Downloading packages: > [ceph00][DEBUG ] Delta RPMs disabled because /usr/bin/applydeltarpm not > installed. > [ceph00][WARNIN] > https://urldefense.proofpoint.com/v2/url?u=http-3A__download.ceph.com_rpm-2Djewel_el7_x86-5F64_ceph-2Dcommon-2D10.2.0-2D0.el7.x86-5F64.rpm-3A&d=CwIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=F9k0WXxF7xUfRHp4ZqBOkaLr0_5mni4JwI4czlkUybY&s=5BrqfSf5EOziOhw2Z4ZEzBgDMZchLGlpYl4EF7pBB_Y&e= > [Errno -1] Package does not match intended download. Suggestion: run yum > --enablerepo=ceph clean metadata > [ceph00][WARNIN] Trying other mirror. > ….. > > I have cleaned up all the repo info on this end and it makes no difference. I > suspect something in the last update to the site is wrong or missing, the > repomd.xml file here: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__download.ceph.com_rpm-2Djewel_el7_x86-5F64_repodata_&d=CwIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=F9k0WXxF7xUfRHp4ZqBOkaLr0_5mni4JwI4czlkUybY&s=NJLHwaWpdVogSPGcnGJz0e4wtL5Q_lUJZk6QpAabHnw&e= > > > Is a day older than all the packages which may or may not be part of the > issue. > > Steve > > -- > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper
[ceph-users] ceph-deploy jewel stopped working
Running this command ceph-deploy install --stable jewel ceph00 And using the 1.5.32 version of ceph-deploy onto a redhat 7.2 system is failing today (worked yesterday) [ceph00][DEBUG ] [ceph00][DEBUG ] Package Arch Version Repository Size [ceph00][DEBUG ] [ceph00][DEBUG ] Installing: [ceph00][DEBUG ] ceph-mdsx86_641:10.2.0-0.el7 ceph2.8 M [ceph00][DEBUG ] ceph-monx86_641:10.2.0-0.el7 ceph2.8 M [ceph00][DEBUG ] ceph-osdx86_641:10.2.0-0.el7 ceph9.0 M [ceph00][DEBUG ] ceph-radosgwx86_641:10.2.0-0.el7 ceph245 k [ceph00][DEBUG ] Installing for dependencies: [ceph00][DEBUG ] ceph-base x86_641:10.2.0-0.el7 ceph4.2 M [ceph00][DEBUG ] ceph-common x86_641:10.2.0-0.el7 ceph 15 M [ceph00][DEBUG ] ceph-selinuxx86_641:10.2.0-0.el7 ceph 19 k [ceph00][DEBUG ] Updating for dependencies: [ceph00][DEBUG ] libcephfs1 x86_641:10.2.0-0.el7 ceph1.8 M [ceph00][DEBUG ] librados2 x86_641:10.2.0-0.el7 ceph1.9 M [ceph00][DEBUG ] librados2-devel x86_641:10.2.0-0.el7 ceph474 k [ceph00][DEBUG ] libradosstriper1x86_641:10.2.0-0.el7 ceph1.8 M [ceph00][DEBUG ] librbd1 x86_641:10.2.0-0.el7 ceph2.4 M [ceph00][DEBUG ] librgw2 x86_641:10.2.0-0.el7 ceph2.8 M [ceph00][DEBUG ] python-cephfs x86_641:10.2.0-0.el7 ceph 66 k [ceph00][DEBUG ] python-radosx86_641:10.2.0-0.el7 ceph145 k [ceph00][DEBUG ] python-rbd x86_641:10.2.0-0.el7 ceph 61 k [ceph00][DEBUG ] [ceph00][DEBUG ] Transaction Summary [ceph00][DEBUG ] [ceph00][DEBUG ] Install 4 Packages (+3 Dependent packages) [ceph00][DEBUG ] Upgrade ( 9 Dependent packages) [ceph00][DEBUG ] [ceph00][DEBUG ] Total download size: 45 M [ceph00][DEBUG ] Downloading packages: [ceph00][DEBUG ] Delta RPMs disabled because /usr/bin/applydeltarpm not installed. [ceph00][WARNIN] http://download.ceph.com/rpm-jewel/el7/x86_64/ceph-common-10.2.0-0.el7.x86_64.rpm: [Errno -1] Package does not match intended download. Suggestion: run yum --enablerepo=ceph clean metadata [ceph00][WARNIN] Trying other mirror. ….. I have cleaned up all the repo info on this end and it makes no difference. I suspect something in the last update to the site is wrong or missing, the repomd.xml file here: https://download.ceph.com/rpm-jewel/el7/x86_64/repodata/ Is a day older than all the packages which may or may not be part of the issue. Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cache tier clean rate too low
o, not how fast you can get things out of it :-( I have been using readforward and that is working OK, there is sufficient read bandwidth that it does not matter if data is coming from the cache pool or the disk backing pool. Steve > On Apr 19, 2016, at 7:47 PM, Christian Balzer wrote: > > > Hello, > > On Tue, 19 Apr 2016 20:21:39 + Stephen Lord wrote: > >> >> >> I Have a setup using some Intel P3700 devices as a cache tier, and 33 >> sata drives hosting the pool behind them. > > A bit more details about the setup would be nice, as in how many nodes, > interconnect, replication size of the cache tier and the backing HDD > pool, etc. > And "some" isn't a number, how many P3700s (which size?) in how many nodes? > One assumes there are no further SSDs involved with those SATA HDDs? > >> I setup the cache tier with >> writeback, gave it a size and max object count etc: >> >> ceph osd pool set target_max_bytes 5000 >^^^ > This should have given you an error, it needs the pool name, as in your > next line. > >> ceph osd pool set nvme target_max_bytes 5000 >> ceph osd pool set nvme target_max_objects 50 >> ceph osd pool set nvme cache_target_dirty_ratio 0.5 >> ceph osd pool set nvme cache_target_full_ratio 0.8 >> >> This is all running Jewel using bluestore OSDs (I know experimental). > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_- > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2 >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in >> aggregate. > ^ > Key word there. > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty > disappointing result for the supposedly twice as fast BlueStore. > Again, replication size and topology might explain that up to a point, but > we don't know them (yet). > > Also exact methodology of your tests please, i.e. the fio command line, how > was the RBD device (if you tested with one) mounted and where, etc... > >> However, it looks like the mechanism for cleaning the cache >> down to the disk layer is being massively rate limited and I see about >> 47 Mbytes/sec of read activity from each SSD while this is going on. >> > This number is meaningless w/o knowing home many NVMe's you have. > That being said, there are 2 levels of flushing past Hammer, but if you > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you > will get full speed. > >> This means that while I could be pushing data into the cache at high >> speed, It cannot evict old content very fast at all, and it is very easy >> to hit the high water mark and the application I/O drops dramatically as >> it becomes throttled by how fast the cache can flush. >> >> I suspect it is operating on a placement group at a time so ends up >> targeting a very limited number of objects and hence disks at any one >> time. I can see individual disk drives going busy for very short >> periods, but most of them are idle at any one point in time. The only >> way to drive the disk based OSDs fast is to hit a lot of them at once >> which would mean issuing many cache flush operations in parallel. >> > Yes, it is all PG based, so your observations match the expectations and > what everybody else is seeing. > See also the thread "Cache tier operation clarifications" by me, version 2 > is in the works. > There are also some new knobs in Jewel that may be helpful, see: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e= > > > If you have a use case with a clearly defined idle/low use time and a > small enough growth in dirty objects, consider what I'm doing, dropping the > cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a > whole day) via cron job,wait a bit and then up again to it's normal value. > > That way flushes won't normally happen at all during your peak usage > times, though in my case that's purely cosmetic, flushes are not > problematic at any time in that cluster currently. > >> Are there any controls which can influence this behavior? >> > See above (cache_target_dirty_high_ratio). > > Aside from that you might want to reflect on what your use case, workload > is going to be and how your testing reflects on it. > >
[ceph-users] ceph cache tier clean rate too low
I Have a setup using some Intel P3700 devices as a cache tier, and 33 sata drives hosting the pool behind them. I setup the cache tier with writeback, gave it a size and max object count etc: ceph osd pool set target_max_bytes 5000 ceph osd pool set nvme target_max_bytes 5000 ceph osd pool set nvme target_max_objects 50 ceph osd pool set nvme cache_target_dirty_ratio 0.5 ceph osd pool set nvme cache_target_full_ratio 0.8 This is all running Jewel using bluestore OSDs (I know experimental). The cache tier will write at about 900 Mbytes/sec and read at 2.2 Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in aggregate. However, it looks like the mechanism for cleaning the cache down to the disk layer is being massively rate limited and I see about 47 Mbytes/sec of read activity from each SSD while this is going on. This means that while I could be pushing data into the cache at high speed, It cannot evict old content very fast at all, and it is very easy to hit the high water mark and the application I/O drops dramatically as it becomes throttled by how fast the cache can flush. I suspect it is operating on a placement group at a time so ends up targeting a very limited number of objects and hence disks at any one time. I can see individual disk drives going busy for very short periods, but most of them are idle at any one point in time. The only way to drive the disk based OSDs fast is to hit a lot of them at once which would mean issuing many cache flush operations in parallel. Are there any controls which can influence this behavior? Thanks Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore OSD died - error (39) Directory not empty not handled on operation 21
I was experimenting with using bluestore OSDs and appear to have found a fairly consistent way to crash them… Changing the number of copies in a pool down from 3 to 1 has now twice caused the mass panic of a whole pool of OSDs. In one case it was a cache tier, in another case it was just a pool hosting rbd images. From the log file of one of the OSDs: 2016-04-05 12:09:54.272475 7f5a58027700 0 bluestore(/var/lib/ceph/osd/ceph-43) error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) 2016-04-05 12:09:54.272489 7f5a58027700 0 bluestore(/var/lib/ceph/osd/ceph-43) transaction dump: { "ops": [ { "op_num": 0, "op_name": "remove", "collection": "2.354_head", "oid": "#2:2ac0head#" }, { "op_num": 1, "op_name": "rmcoll", "collection": "2.354_head" } ] } 2016-04-05 12:09:54.275114 7f5a58027700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f5a58027700 time 2016-04-05 12:09:54.272532 os/bluestore/BlueStore.cc: 4357: FAILED assert(0 == "unexpected error") ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f5a82e74a55] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x77a) [0x7f5a82b02eba] 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector >&, std::shared_ptr, ThreadPool::TPHandle*)+0x3a5) [0x7f5a82b056e5] 4: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vector >&, Context*, Context*, Context*, Context*, std::shared_ptr)+0x2a6) [0x7f5a82aad0b6] 5: (OSD::RemoveWQ::_process(std::pair, std::shared_ptr >, ThreadPool::TPHandle&)+0x6e4) [0x7f5a827debb4] 6: (ThreadPool::WorkQueueVal, std::shared_ptr >, std::pair, std::shared_ptr > >::_void_process(void*, ThreadPool::TPHandle&)+0x11a) [0x7f5a8283a15a] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f5a82e65a9e] 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f5a82e66980] 9: (()+0x7dc5) [0x7f5a80dbedc5] 10: (clone()+0x6d) [0x7f5a7f44a28d] In both cases a replicated pool with 3 copies was created, some content added and then the number of copies set down to 1. Not a common thing to do I know, but this works on FileStore OSDs. This is a cluster deployed using redhat 7 Jewel (10.1) RPMs from download.ceph.com Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
My ceph-deploy came from the download.ceph.com site and it is 1.5.31-0. This code is in ceph itself though, the deploy logic is where the code appears to do the right thing ;-) Steve > On Mar 15, 2016, at 2:38 PM, Vasu Kulkarni wrote: > > Thanks for the steps that should be enough to test it out, I hope you got the > latest ceph-deploy either from pip or throught github. > > On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord wrote: > I would have to nuke my cluster right now, and I do not have a spare one.. > > The procedure though is literally this, given a 3 node redhat 7.2 cluster, > ceph00, ceph01 and ceph02 > > ceph-deploy install --testing ceph00 ceph01 ceph02 > ceph-deploy new ceph00 ceph01 ceph02 > > ceph-deploy mon create ceph00 ceph01 ceph02 > ceph-deploy gatherkeys ceph00 > > ceph-deploy osd create ceph00:sdb:/dev/sdi > ceph-deploy osd create ceph00:sdc:/dev/sdi > > All devices have their partition tables wiped before this. They are all just > SATA devices, no special devices in the way. > > sdi is an ssd and it is being carved up for journals. The first osd create > works, the second one gets stuck in a loop in the update_partition call in > ceph_disk for the 5 iterations before it gives up. When I look in > /sys/block/sdi the partition for the first osd is visible, the one for the > second is not. However looking at /proc/partitions it sees the correct thing. > So something about partprobe is not kicking udev into doing the right thing > when the second partition is added I suspect. > > If I do not use the separate journal device then it usually works, but > occasionally I see a single retry in that same loop. > > There is code in ceph_deploy which uses partprobe or partx depending on which > distro it detects, that is how I worked out what to change here. > > If I have to tear things down again I will reproduce and post here. > > Steve > > > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni wrote: > > > > Do you mind giving the full failed logs somewhere in fpaste.org along with > > some os version details? > > There are some known issues on RHEL, If you use 'osd prepare' and 'osd > > activate'(specifying just the journal partition here) it might work better. > > > > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord > > wrote: > > Not multipath if you mean using the multipath driver, just trying to setup > > OSDs which use a data disk and a journal ssd. If I run just a disk based > > OSD and only specify one device to ceph-deploy then it usually works > > although sometimes has to retry. In the case where I am using it to carve > > an SSD into several partitions for journals it fails on the second one. > > > > Steve > > > > > -- > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper and complete transmission of the substance of > this communication or for any delay in its receipt. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
I would have to nuke my cluster right now, and I do not have a spare one.. The procedure though is literally this, given a 3 node redhat 7.2 cluster, ceph00, ceph01 and ceph02 ceph-deploy install --testing ceph00 ceph01 ceph02 ceph-deploy new ceph00 ceph01 ceph02 ceph-deploy mon create ceph00 ceph01 ceph02 ceph-deploy gatherkeys ceph00 ceph-deploy osd create ceph00:sdb:/dev/sdi ceph-deploy osd create ceph00:sdc:/dev/sdi All devices have their partition tables wiped before this. They are all just SATA devices, no special devices in the way. sdi is an ssd and it is being carved up for journals. The first osd create works, the second one gets stuck in a loop in the update_partition call in ceph_disk for the 5 iterations before it gives up. When I look in /sys/block/sdi the partition for the first osd is visible, the one for the second is not. However looking at /proc/partitions it sees the correct thing. So something about partprobe is not kicking udev into doing the right thing when the second partition is added I suspect. If I do not use the separate journal device then it usually works, but occasionally I see a single retry in that same loop. There is code in ceph_deploy which uses partprobe or partx depending on which distro it detects, that is how I worked out what to change here. If I have to tear things down again I will reproduce and post here. Steve > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni wrote: > > Do you mind giving the full failed logs somewhere in fpaste.org along with > some os version details? > There are some known issues on RHEL, If you use 'osd prepare' and 'osd > activate'(specifying just the journal partition here) it might work better. > > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord wrote: > Not multipath if you mean using the multipath driver, just trying to setup > OSDs which use a data disk and a journal ssd. If I run just a disk based OSD > and only specify one device to ceph-deploy then it usually works although > sometimes has to retry. In the case where I am using it to carve an SSD into > several partitions for journals it fails on the second one. > > Steve > -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Not multipath if you mean using the multipath driver, just trying to setup OSDs which use a data disk and a journal ssd. If I run just a disk based OSD and only specify one device to ceph-deploy then it usually works although sometimes has to retry. In the case where I am using it to carve an SSD into several partitions for journals it fails on the second one. Steve > On Mar 15, 2016, at 1:45 PM, Vasu Kulkarni wrote: > > Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly > tests ceph-disk and has been run on Jewel as well. I guess the issue Stephen > is seeing is on multipath device > which I believe is a known issue. > > On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum wrote: > There's a ceph-disk suite from last August that Loïc set up, but based > on the qa list it wasn't running for a while and isn't in great shape. > :/ I know there are some CentOS7 boxes in the sepia lab but it might > not be enough for a small and infrequently-run test to reliably get > tested against them. > -Greg > > On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines wrote: > > It seems like ceph-disk is often breaking on centos/redhat systems. Does it > > have automated tests in the ceph release structure? > > > > -Ben > > > > > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord > > wrote: > >> > >> > >> Hi, > >> > >> The ceph-disk (10.0.4 version) command seems to have problems operating on > >> a Redhat 7 system, it uses the partprobe command unconditionally to update > >> the partition table, I had to change this to partx -u to get past this. > >> > >> @@ -1321,13 +1321,13 @@ > >> processed, i.e. the 95-ceph-osd.rules actions and mode changes, > >> group changes etc. are complete. > >> """ > >> -LOG.debug('Calling partprobe on %s device %s', description, dev) > >> +LOG.debug('Calling partx on %s device %s', description, dev) > >> partprobe_ok = False > >> error = 'unknown error' > >> for i in (1, 2, 3, 4, 5): > >> command_check_call(['udevadm', 'settle', '--timeout=600']) > >> try: > >> -_check_output(['partprobe', dev]) > >> +_check_output(['partx', '-u', dev]) > >> partprobe_ok = True > >> break > >> except subprocess.CalledProcessError as e: > >> > >> > >> It really needs to be doing that conditional on the operating system > >> version. > >> > >> Steve > >> > >> > >> -- > >> The information contained in this transmission may be confidential. Any > >> disclosure, copying, or further distribution of confidential information is > >> not permitted unless such privilege is explicitly granted in writing by > >> Quantum. Quantum reserves the right to have electronic communications, > >> including email and attachments, sent across its networks filtered through > >> anti virus and spam software programs and retain such messages in order to > >> comply with applicable data security and retention requirements. Quantum is > >> not responsible for the proper and complete transmission of the substance > >> of > >> this communication or for any delay in its receipt. > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA&s=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw&e= ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-disk from jewel has issues on redhat 7
Hi, The ceph-disk (10.0.4 version) command seems to have problems operating on a Redhat 7 system, it uses the partprobe command unconditionally to update the partition table, I had to change this to partx -u to get past this. @@ -1321,13 +1321,13 @@ processed, i.e. the 95-ceph-osd.rules actions and mode changes, group changes etc. are complete. """ -LOG.debug('Calling partprobe on %s device %s', description, dev) +LOG.debug('Calling partx on %s device %s', description, dev) partprobe_ok = False error = 'unknown error' for i in (1, 2, 3, 4, 5): command_check_call(['udevadm', 'settle', '--timeout=600']) try: -_check_output(['partprobe', dev]) +_check_output(['partx', '-u', dev]) partprobe_ok = True break except subprocess.CalledProcessError as e: It really needs to be doing that conditional on the operating system version. Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why is there heavy read traffic during object delete?
I saw this go by in the commit log: commit cc2200c5e60caecf7931e546f6522b2ba364227f Merge: f8d5807 12c083e Author: Sage Weil Date: Thu Feb 11 08:44:35 2016 -0500 Merge pull request #7537 from ifed01/wip-no-promote-for-delete-fix osd: fix unnecessary object promotion when deleting from cache pool Reviewed-by: Sage Weil Is there any chance that I was basically seeing with the same thing from the filesystem standpoint? Thanks Steve > On Feb 5, 2016, at 8:42 AM, Gregory Farnum wrote: > > On Fri, Feb 5, 2016 at 6:39 AM, Stephen Lord wrote: >> >> I looked at this system this morning, and the it actually finished what it >> was >> doing. The erasure coded pool still contains all the data and the cache >> pool has about a million zero sized objects: >> >> >> GLOBAL: >>SIZE AVAIL RAW USED %RAW USED OBJECTS >>15090G 9001G6080G 40.29 2127k >> POOLS: >>NAMEID CATEGORY USED %USED MAX AVAIL >> OBJECTS DIRTY READ WRITE >>cache-data 21 - 0 0 7962G >> 1162258 1057k 22969 3220k >>cephfs-data 22 - 3964G 26.27 5308G >> 1014840 991k 891k 1143k >> >> Definitely seems like a bug since I removed all references to these from the >> filesystem >> which created them. >> >> I originally wrote 4.5 Tbytes of data into the file system, the erasure coded >> pool is setup as 4+2, and the cache has a size limit of 1 Tbyte. Looks like >> not >> all the data made it out of the cache tier before I removed content, it >> removed the >> content which was only present in the cache tier and created a zero sized >> object >> in the cache for all the content. The used capacity is somewhat consistent >> with >> this. >> >> I tried to look at the extended attributes on one of the zero size object >> with ceph-dencoder, >> but it failed: >> >> error: buffer::malformed_input: void >> object_info_t::decode(ceph::buffer::list::iterator&) unknown encoding >> version > 15 >> >> Same error on one of the objects in the erasure coded pool. >> >> Looks like I am a little too bleeding edge for this, or the contents of the >> .ceph_ attribute are not an object_info_t > > ghobject_info_t > > You can get the EC stuff actually deleted by getting the cache pool to > flush everything. That's discussed in the docs and in various mailing > list archives. > -Greg > >> >> >> >> Steve >> >>> On Feb 4, 2016, at 7:10 PM, Gregory Farnum wrote: >>> >>> On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord wrote: >>>> >>>>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum wrote: >>>>> >>>>> I presume we're doing reads in order to gather some object metadata >>>>> from the cephfs-data pool; and the (small) newly-created objects in >>>>> cache-data are definitely whiteout objects indicating the object no >>>>> longer exists logically. >>>>> >>>>> What kinds of reads are you actually seeing? Does it appear to be >>>>> transferring data, or merely doing a bunch of seeks? I thought we were >>>>> trying to avoid doing reads-to-delete, but perhaps the way we're >>>>> handling snapshots or something is invoking behavior that isn't >>>>> amicable to a full-FS delete. >>>>> >>>>> I presume you're trying to characterize the system's behavior, but of >>>>> course if you just want to empty it out entirely you're better off >>>>> deleting the pools and the CephFS instance entirely and then starting >>>>> it over again from scratch. >>>>> -Greg >>>> >>>> I believe it is reading all the data, just from the volume of traffic and >>>> the cpu load on the OSDs maybe suggests it is doing more than >>>> just that. >>>> >>>> iostat is showing a lot of data moving, I am seeing about the same volume >>>> of read and write activity here. Because the OSDs underneath both pools >>>> are the same ones, I know that’s not exactly optimal, it is hard to tell >>>> what >>>> which pool is responsible for which I/O. Large reads and small writes >>>> suggest >>>> it is reading up all the data from th
Re: [ceph-users] why is there heavy read traffic during object delete?
I looked at this system this morning, and the it actually finished what it was doing. The erasure coded pool still contains all the data and the cache pool has about a million zero sized objects: GLOBAL: SIZE AVAIL RAW USED %RAW USED OBJECTS 15090G 9001G6080G 40.29 2127k POOLS: NAMEID CATEGORY USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE cache-data 21 - 0 0 7962G 1162258 1057k 22969 3220k cephfs-data 22 - 3964G 26.27 5308G 1014840 991k 891k 1143k Definitely seems like a bug since I removed all references to these from the filesystem which created them. I originally wrote 4.5 Tbytes of data into the file system, the erasure coded pool is setup as 4+2, and the cache has a size limit of 1 Tbyte. Looks like not all the data made it out of the cache tier before I removed content, it removed the content which was only present in the cache tier and created a zero sized object in the cache for all the content. The used capacity is somewhat consistent with this. I tried to look at the extended attributes on one of the zero size object with ceph-dencoder, but it failed: error: buffer::malformed_input: void object_info_t::decode(ceph::buffer::list::iterator&) unknown encoding version > 15 Same error on one of the objects in the erasure coded pool. Looks like I am a little too bleeding edge for this, or the contents of the .ceph_ attribute are not an object_info_t Steve > On Feb 4, 2016, at 7:10 PM, Gregory Farnum wrote: > > On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord wrote: >> >>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum wrote: >>> >>> I presume we're doing reads in order to gather some object metadata >>> from the cephfs-data pool; and the (small) newly-created objects in >>> cache-data are definitely whiteout objects indicating the object no >>> longer exists logically. >>> >>> What kinds of reads are you actually seeing? Does it appear to be >>> transferring data, or merely doing a bunch of seeks? I thought we were >>> trying to avoid doing reads-to-delete, but perhaps the way we're >>> handling snapshots or something is invoking behavior that isn't >>> amicable to a full-FS delete. >>> >>> I presume you're trying to characterize the system's behavior, but of >>> course if you just want to empty it out entirely you're better off >>> deleting the pools and the CephFS instance entirely and then starting >>> it over again from scratch. >>> -Greg >> >> I believe it is reading all the data, just from the volume of traffic and >> the cpu load on the OSDs maybe suggests it is doing more than >> just that. >> >> iostat is showing a lot of data moving, I am seeing about the same volume >> of read and write activity here. Because the OSDs underneath both pools >> are the same ones, I know that’s not exactly optimal, it is hard to tell what >> which pool is responsible for which I/O. Large reads and small writes suggest >> it is reading up all the data from the objects, the write traffic is I >> presume all >> journal activity relating to deleting objects and creating the empty ones. >> >> The 9:1 ratio between things being deleted and created seems odd though. >> >> A previous version of this exercise with just a regular replicated data pool >> did not read anything, just a lot of write activity and eventually the >> content >> disappeared. So definitely related to the pool configuration here and >> probably >> not to the filesystem layer. > > Sam, does this make any sense to you in terms of how RADOS handles deletes? > -Greg -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why is there heavy read traffic during object delete?
> On Feb 4, 2016, at 6:51 PM, Gregory Farnum wrote: > > I presume we're doing reads in order to gather some object metadata > from the cephfs-data pool; and the (small) newly-created objects in > cache-data are definitely whiteout objects indicating the object no > longer exists logically. > > What kinds of reads are you actually seeing? Does it appear to be > transferring data, or merely doing a bunch of seeks? I thought we were > trying to avoid doing reads-to-delete, but perhaps the way we're > handling snapshots or something is invoking behavior that isn't > amicable to a full-FS delete. > > I presume you're trying to characterize the system's behavior, but of > course if you just want to empty it out entirely you're better off > deleting the pools and the CephFS instance entirely and then starting > it over again from scratch. > -Greg I believe it is reading all the data, just from the volume of traffic and the cpu load on the OSDs maybe suggests it is doing more than just that. iostat is showing a lot of data moving, I am seeing about the same volume of read and write activity here. Because the OSDs underneath both pools are the same ones, I know that’s not exactly optimal, it is hard to tell what which pool is responsible for which I/O. Large reads and small writes suggest it is reading up all the data from the objects, the write traffic is I presume all journal activity relating to deleting objects and creating the empty ones. The 9:1 ratio between things being deleted and created seems odd though. A previous version of this exercise with just a regular replicated data pool did not read anything, just a lot of write activity and eventually the content disappeared. So definitely related to the pool configuration here and probably not to the filesystem layer. I will eventually just put this out of its misery and wipe it. Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] why is there heavy read traffic during object delete?
I setup a cephfs file system with a cache tier over an erasure coded tier as an experiment: ceph osd erasure-code-profile set raid6 k=4 m=2 ceph osd pool create cephfs-metadata 512 512 ceph osd pool set cephfs-metadata size 3 ceph osd pool create cache-data 2048 2048 ceph osd pool create cephfs-data 256 256 erasure raid6 default_erasure ceph osd tier add cephfs-data cache-data ceph osd tier cache-mode cache-data writeback ceph osd tier set-overlay cephfs-data cache-data ceph osd pool set cache-data hit_set_type bloom ceph osd pool set cache-data target_max_bytes 1099511627776 The file system was created from the cephfs-metadata and cephfs-data pools After adding a lot of data to this and waiting for the pools to idle down and stabilize I removing the file system content with rm. I am seeing very strange behavior, the file system remove was quick, and then it started removing the data from the pools. However it appears to be reading the data from the erasure coded pool and creating empty content in the cache pool. At its peak capacity the system looked like this: NAMEID CATEGORY USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE cache-data 21 - 791G 5.25 6755G 256302 140k 22969 2138k cephfs-data 22 - 4156G 27.54 4503G 1064086 1039k 51271 1046k 2 hours later it looked like this: NAMEID CATEGORY USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE cache-data 21 - 326G 2.17 7576G 702142 559k 22969 2689k cephfs-data 22 - 3964G 26.27 5051G 1014842 991k 476k 1143k The object count in the erasure coded pool has gone down a little, the count in the cache pool has gone up a lot, there has been a lot of read activity in the erasure coded pool and write activity into both pools. The used count in the cache pool is also going down. It looks like the cache pool is gaining 9 objects for each one removed from the erasure code pool. Looking at the actual files being created by the OSD for this, they are empty. What is going on here? It looks like this will take a day or so to complete at this rate of progress. The ceph version here is the master branch from a couple of days ago. Thanks Steve Lord -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to get even placement group distribution across OSDs - looking for hints
I have a configuration with 18 OSDs spread across 3 hosts. I am struggling with getting an even distribution of placement groups between the OSDs for a specific pool. All the OSDs are the same size with the same weight in the crush map. The fill level of the individual placement groups is very close when I put data into them, however I get a fairly uneven spread of placement groups across the OSDs. This leads to the pool filling one of the OSDs well before the others, a file system can report 70% full in aggregate, but one of the OSDs fills so it can take no more data. In addition, this will clearly lead to less than balanced load between the different devices which are all of the same physical type with the same throughput. If I dump the placement groups and count them by OSD I typically see something like this: pool : 4 5 6 7 8 9 0 1 14 2 15 3 | SUM osd.17 4 5 7 5 11 6 9 4 17 12 42 8 | 130 osd.4 7 8 8 7 4 6 1 4 12 8 23 8 | 96 osd.5 8 5 10 10 5 6 3 7 13 7 34 13 | 121 osd.6 9 6 8 2 3 10 1 4 12 10 26 10 | 101 osd.7 7 10 7 7 9 13 1 6 20 5 29 5 | 119 osd.8 6 7 4 6 6 3 7 11 20 7 28 9 | 114 osd.9 8 10 9 9 5 6 4 5 15 5 22 4 | 102 osd.10 3 2 4 5 11 9 3 4 20 7 38 8 | 114 osd.11 8 11 10 7 7 13 3 4 19 8 29 6 | 125 osd.12 7 6 10 5 8 4 2 8 18 6 37 9 | 120 osd.0 3 6 11 13 7 5 6 11 17 6 35 9 | 129 osd.13 7 8 5 10 11 8 4 13 18 11 35 5 | 135 osd.1 13 8 9 4 7 7 4 6 10 10 43 3 | 124 osd.14 8 7 4 7 8 3 3 8 16 3 28 6 | 101 osd.15 9 7 5 3 4 10 5 6 17 7 35 5 | 113 osd.2 7 9 9 11 11 8 2 8 9 6 34 9 | 123 osd.16 9 4 5 7 4 0 3 6 21 4 26 6 | 95 osd.3 5 9 3 10 7 11 3 13 14 6 32 5 | 118 SUM : 128 128 128 128 128 128 64 128 288 128 576 128 | In this example I want to put a filesystem across pools 14 and 15. The data pool has between 23 and 43 placement groups per OSD. Am I just missing something here in defining the crush map? All I can find is recommendations to get a more even balance by having more PGs per OSD. Eventually I just get warnings about too many placement groups per OSD. Or is the issue that there are multiple pools on this set of OSDs and placement groups are being created in parallel for several of them? In this case though pool 15 was created after all the other pools existed and all their placement groups were created, even the first pool is unevenly spread. So are there any controls which influence how placement groups are allocated to OSDs in the initial pool creation? Thanks Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com