Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi Gregory, Thanks for your replies. Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts). 2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC H800 + 11 2TB SAS disks (one unused SSD...) The EC pool is defined with k=4, m=1 I set the failure domain to OSD for the test The OSDs are set up with XFS and a 10GB journal 1st partition (the single doomed-dell SSD was a bottleneck for 23 disks…) All disks are presently configured with a single-RAID0 because H700/H800 do not support JBOD. I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command : rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 --run-name bench_`hostname -s` --no-cleanup I'm aggregating the average bandwidth at the end of the tests. I'm monitoring the Ceph servers stats live with this dstat command: dstat -N p2p1,p2p2,total The network MTU is 9000 on all nodes. With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for the whole 2-nodes ceph cluster / 5 clients. I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less considering I removed half disks ! I'm therefore starting to think I am CPU /memory bandwidth limited... ? That's not however what I am tempted to conclude (for the cpu at least) when I see the dstat output, as it says the cpus still sit idle or IO waiting : total-cpu-usage -dsk/total- --net/p2p1net/p2p2---net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send: recv send: recv send| in out | int csw 1 1 97 0 0 0| 586k 1870k| 0 0 : 0 0 : 0 0 | 49B 455B|816715k 29 17 24 27 0 3| 128k 734M| 367M 870k: 0 0 : 367M 870k| 0 0 | 61k 61k 30 17 34 16 0 3| 432k 750M| 229M 567k: 199M 168M: 427M 168M| 0 0 | 65k 68k 25 14 38 20 0 3| 16k 634M| 232M 654k: 162M 133M: 393M 134M| 0 0 | 56k 64k 19 10 46 23 0 2| 232k 463M| 244M 670k: 184M 138M: 428M 139M| 0 0 | 45k 55k 15 8 46 29 0 1| 368k 422M| 213M 623k: 149M 110M: 362M 111M| 0 0 | 35k 41k 25 17 37 19 0 3| 48k 584M| 139M 394k: 137M 90M: 276M 91M| 0 0 | 54k 53k Could it be the interruptions or system context switches that cause this relatively poor performance per node ? PCI-E interractions with the PERC cards ? I know I can get way more disk throughput with dd (command below) total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 97 0 0 0| 595k 2059k| 0 0 | 634B 2886B|797115k 1 93 0 3 0 3| 0 1722M| 49k 78k| 0 0 | 40k 47k 1 93 0 3 0 3| 0 1836M| 40k 69k| 0 0 | 45k 57k 1 95 0 2 0 2| 0 1805M| 40k 69k| 0 0 | 38k 34k 1 94 0 3 0 2| 0 1864M| 37k 38k| 0 0 | 35k 24k (…) Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo writing $FS_THR times (threads) $[ 4 * FILE_MB ] mb on $i... ; for j in `seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M count=$[ FILE_MB / 4 ] done ; done ; wait) ; echo wrote $[ N_FS * FILE_MB * FS_THR ] MB on $N_FS FS with $FS_THR threads ; rm -f /var/lib/ceph/osd/*/test.zero* Hope I gave you more insights on what I’m trying to achieve, and where I’m failing ? Regards -Message d'origine- De : Gregory Farnum [mailto:g...@gregs42.com] Envoyé : mercredi 22 juillet 2015 16:01 À : Florent MONTHEL Cc : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? We might also be able to help you improve or better understand your results if you can tell us exactly what tests you're conducting that are giving you these numbers. -Greg On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL fmont...@flox-arts.netmailto:fmont...@flox-arts.net wrote: Hi Frederic, When you have Ceph cluster with 1 node you don’t experienced network and communication overhead due to distributed model With 2 nodes and EC 4+1 you will have communication between 2 nodes but you will keep internal communication (2 chunks on first node and 3 chunks on second node) On your configuration EC pool is setup with 4+1 so you will have for each write overhead due to write spreading on 5 nodes (for 1 customer IO, you will experience 5 Ceph IO due to EC 4+1) It’s the reason for that I think you’re
Re: [ceph-users] PGs going inconsistent after stopping the primary
Looks like it's just a stat error. The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason). I bet it clears itself it you perform a write on the pg since the primary will send over its stats. We'd need information from when the stat error originally occurred to debug further. -Sam - Original Message - From: Dan van der Ster d...@vanderster.com To: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 7:49:00 AM Subject: [ceph-users] PGs going inconsistent after stopping the primary Hi Ceph community, Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 We wanted to post here before the tracker to see if someone else has had this problem. We have a few PGs (different pools) which get marked inconsistent when we stop the primary OSD. The problem is strange because once we restart the primary, then scrub the PG, the PG is marked active+clean. But inevitably next time we stop the primary OSD, the same PG is marked inconsistent again. There is no user activity on this PG, and nothing interesting is logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line mentioning the PG already says inactive+inconsistent). We suspect this is related to garbage files left in the PG folder. One of our PGs is acting basically like above, except it goes through this cycle: active+clean - (deep-scrub) - active+clean+inconsistent - (repair) - active+clean - (restart primary OSD) - (deep-scrub) - active+clean+inconsistent. This one at least logs: 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors and this should be debuggable because there is only one object in the pool: tapetest 55 0 073575G 1 even though rados ls returns no objects: # rados ls -p tapetest # Any ideas? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon cpu usage
This cluster is server RBD storage for openstack, and today all the I/O was just stopped. After looking in the boxes ceph-mon was using 17G ram - and this was on *all* the mons. Restarting the main one just made it work again (I restarted the other ones because they were using a lot of ram). This has happened twice now (first was last Monday). As this is considered a prod cluster there is no logging enabled, and I can't reproduce it - our test/dev clusters have been working fine, and have neither symptoms, but they were upgraded from firefly. What can we do to help debug the issue? Any ideas on how to identify the underlying issue? thanks, On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito periqu...@gmail.com wrote: Hi all, I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including replication). There are 3 MONs on this cluster. I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer (0.94.2). This cluster was installed with Hammer (0.94.1) and has only been upgraded to the latest available version. On the three mons one is mostly idle, one is using ~170% CPU, and one is using ~270% CPU. They will change as I restart the process (usually the idle one is the one with the lowest uptime). Running a perf top againt the ceph-mon PID on the non-idle boxes it wields something like this: 4.62% libpthread-2.19.so[.] pthread_mutex_unlock 3.95% libpthread-2.19.so[.] pthread_mutex_lock 3.91% libsoftokn3.so[.] 0x0001db26 2.38% [kernel] [k] _raw_spin_lock 2.09% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) 1.79% ceph-mon [.] DispatchQueue::enqueue(Message*, int, unsigned long) 1.62% ceph-mon [.] RefCountedObject::get() 1.58% libpthread-2.19.so[.] pthread_mutex_trylock 1.32% libtcmalloc.so.4.1.2 [.] operator delete(void*) 1.24% libc-2.19.so [.] 0x00097fd0 1.20% ceph-mon [.] ceph::buffer::ptr::release() 1.18% ceph-mon [.] RefCountedObject::put() 1.15% libfreebl3.so [.] 0x000542a8 1.05% [kernel] [k] update_cfs_shares 1.00% [kernel] [k] tcp_sendmsg The cluster is mostly idle, and it's healthy. The store is 69MB big, and the MONs are consuming around 700MB of RAM. Any ideas on this situation? Is it safe to ignore? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS vs RBD
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 RBD can be safely mounted on multiple machines at once, but the file system has to be designed for such scenarios. File systems like ext, xfs, btrfs, etc are only designed to be accessed by a single system. Clustered file systems like OCFS, GFS, etc are designed to have multiple discrete machines access the file system at the same time. As long as you use a clustered file system on RBD, you will be OK. Now, if that performs better than CephFS, is a question you will have to answer through testing. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jul 22, 2015 at 1:17 PM, Lincoln Bryant wrote: Hi Hadi, AFAIK, you can’t safely mount RBD as R/W on multiple machines. You could re-export the RBD as NFS, but that’ll introduce a bottleneck and probably tank your performance gains over CephFS. For what it’s worth, some of our RBDs are mapped to multiple machines, mounted read-write on one and read-only on the others. We haven’t seen any strange effects from that, but I seem to recall it being ill advised. —Lincoln On Jul 22, 2015, at 2:05 PM, Hadi Montakhabi wrote: Hello Cephers, I've been experimenting with CephFS and RBD for some time now. From what I have seen so far, RBD outperforms CephFS by far. However, there is a catch! RBD could be mounted on one client at a time! Now, assuming that we have multiple clients running some MPI code (and doing some distributed I/O), all these clients need to read/write from the same location and sometimes even the same file. Is this at all possible by using RBD, and not CephFS? Thanks, Hadi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVr+6rCRDmVDuy+mK58QAAbaIP/RhTTsYhx1boWdec8PIb RBYP7rV1Pg2X/QESNWFZA7waqyymfSJgbLZ+TAuHCfUdLCWAE9lerNAs5Cdn ZvLx1Z56s9DsYjh/AFawKAq2tIBUzHOsPZORrUI1HkU2Y3vf9IzYOpUNcxUF 2sY5pGOM0NQDdojsuxskqHP47RckdUdiMAb7UWK7LYHCJJKlzvfFBfp7XT9+ sD4Uy/3Wos0KBH60Oxt8ueGyDCd3EYa1fV8+2k/uJ447XRujv9RC3fXth+oE QYKhNNxi5la0awChs00hfDx7SGlOoq5dy7POAmImo9Y/eoZNuiBSpiUtXAT/ kWvshIUKOUq6A06boEGNDDyGVjaOHUBEZtA1Vpmkj53oY4eDugNKUxMCnFEo /TVgMgjzMM90+u9E7l/Bx7H497HoIAJkhN/9kFK+t9CWySX8I/A1fZ9XI/hs hsVHPvhJrQW/8eHRERJbVEItJZP5EI1wkzpZanpsmimeRghqy2S87TYg6Ged 7Eyt7kpBqUXn+i3VJ0LBvlXC81O0SEA32PEN8Zbv/jFdyHJ4nmXz/WWs/M4p ZseYV/5AHtov+Kbbm8C3CyOO3M8zx5fGPHGNwr+PCYJiQxNCaZPYJ3fB98hn IQhD2f0KYc6cDY19ZtreXNE6ITuZ78n8LulOSlU1VSTPdv/pWBmkCFEV4f3f k6BV =Zvwe -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS vs RBD
Hello Cephers, I've been experimenting with CephFS and RBD for some time now. From what I have seen so far, RBD outperforms CephFS by far. However, there is a catch! RBD could be mounted on one client at a time! Now, assuming that we have multiple clients running some MPI code (and doing some distributed I/O), all these clients need to read/write from the same location and sometimes even the same file. Is this at all possible by using RBD, and not CephFS? Thanks, Hadi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs going inconsistent after stopping the primary
Cool, writing some objects to the affected PGs has stopped the consistent/inconsistent cycle. I'll keep an eye on them but this seems to have fixed the problem. Thanks!! Dan On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just sj...@redhat.com wrote: Looks like it's just a stat error. The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason). I bet it clears itself it you perform a write on the pg since the primary will send over its stats. We'd need information from when the stat error originally occurred to debug further. -Sam - Original Message - From: Dan van der Ster d...@vanderster.com To: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 7:49:00 AM Subject: [ceph-users] PGs going inconsistent after stopping the primary Hi Ceph community, Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 We wanted to post here before the tracker to see if someone else has had this problem. We have a few PGs (different pools) which get marked inconsistent when we stop the primary OSD. The problem is strange because once we restart the primary, then scrub the PG, the PG is marked active+clean. But inevitably next time we stop the primary OSD, the same PG is marked inconsistent again. There is no user activity on this PG, and nothing interesting is logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line mentioning the PG already says inactive+inconsistent). We suspect this is related to garbage files left in the PG folder. One of our PGs is acting basically like above, except it goes through this cycle: active+clean - (deep-scrub) - active+clean+inconsistent - (repair) - active+clean - (restart primary OSD) - (deep-scrub) - active+clean+inconsistent. This one at least logs: 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors and this should be debuggable because there is only one object in the pool: tapetest 55 0 073575G 1 even though rados ls returns no objects: # rados ls -p tapetest # Any ideas? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS vs RBD
Le 22/07/2015 21:17, Lincoln Bryant a écrit : Hi Hadi, AFAIK, you can’t safely mount RBD as R/W on multiple machines. You could re-export the RBD as NFS, but that’ll introduce a bottleneck and probably tank your performance gains over CephFS. For what it’s worth, some of our RBDs are mapped to multiple machines, mounted read-write on one and read-only on the others. We haven’t seen any strange effects from that, but I seem to recall it being ill advised. Yes it is, for several reasons. Here are two at the top of my head. Some (many/most/all?) filesystems update on-disk data when they are mounted even if the mount is read-only. If you map your RBD devices read-only before mounting the filesystem itself read-only you should be safe from corruption occurring at mount time though. The system with read-write access will keep its in-memory data in sync with the on-disk data. The others with read-only access will not as they won't be aware of the writes done, this means they will eventually get incoherent data and will generate fs access errors with various levels of errors from the benign read error to potentially full kernel crash with whole filesystem freeze in-between. Don't do that unless you : - carefully setup your rbd mappings read-only everywhere but the system doing the writes, - can withstand a (simultaneous) system crash on all the systems mounting the rbd mappings read-only. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Clients' connection for concurrent access to ceph
Workaround... We're building now a huge computing cluster 140 computing DISKLESS nodes and they are pulling to storage a lot of computing data concurrently User that put job for the cluster - need also access to the same storage place (seeking progress results) We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - giant - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with upgraded kernel 3.19.8 (already running computing cluster) Surely all nodes, switches and clients were configured to jumbo-frames of network = First test: I thought to make big rbd with shareing, but: - RBD supports multiple clients' mappingmounting but not parallel writes ... Second test: NFS over RBD - it's working pretty good, but: 1. NFS gateway - it's Single-Point-of-Failure 2. There's no performance scaling of scale-out storage e.g. bottleneck (limited with bandwidth of NFS-gateway) Third test: We wanted to try CephFS, because our client is familiar with Lustre, that's very near to CephFS capabilities: 1. I've used my CEPH nodes in the client's role. I've mounted CephFS on one of nodes, and ran dd with bs=1M ... - I've got wonderful write performance ~ 1.1 GBytes/s (really near to 10Gbit network throughput) 2. I've connected CentOS client to 10gig public network, mounted CephFS, but ... - It was just ~ 250 MBytes/s 3. I've connected Ubuntu client (non-ceph member) to 10gig public network, mounted CephFS, and ... - It was also ~ 260 MBytes/s Now I have to know: perhaps ceph-members-nodes have privileged access ??? I'm sure you have more ceph deployment experience, have you seen this CephFS performance deviations? Thanks, Shneur This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Scrubbing optymalisation
Hi Cephers. I'm looking for solution to scrubbing process optimization. In our environment this process make big impact on performance. For monitoring disks we are using monitorix. If process running 'Disk I/O activity (R+W)' shows 20-60 reads+writes per second. After disabling scrub and deep-scrub process this values goes to 0-40reads+write. It makes difference in performance. Ceph config settings: ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep ioprio osd_disk_thread_ioprio_class: idle, osd_disk_thread_ioprio_priority: 7, All disks have cfq scheduler enabled. Cluster have 6 servers, 5 monitors, 4-6 osd per server + 1 ssd for journal in each server. Maybe can I set some other config options to reduce impact of scrubbing process? In attachment screen from monitorix (srubbing disabled in 27 week). Best Regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
I just filed a ticket after trying ceph-objectstore-tool: http://tracker.ceph.com/issues/12428 On Fri, Jul 17, 2015 at 3:36 PM, Dan van der Ster d...@vanderster.com wrote: A bit of progress: rm'ing everything from inside current/36.10d_head/ actually let the OSD start and continue deleting other PGs. Cheers, Dan On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote: Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one
Re: [ceph-users] how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean
On 15/07/15 10:55, Jelle de Jong wrote: On 13/07/15 15:40, Jelle de Jong wrote: I was testing a ceph cluster with osd_pool_default_size = 2 and while rebuilding the OSD on one ceph node a disk in an other node started getting read errors and ceph kept taking the OSD down, and instead of me executing ceph osd set nodown while the other node was rebuilding I kept restarting the OSD for a while and ceph took the OSD in for a few minutes and then taking it back down. I then removed the bad OSD from the cluster and later added it back in with nodown flag set and a weight of zero, moving all the data away. Then removed the OSD again and added a new OSD with a new hard drive. However I ended up with the following cluster status and I can't seem to find how to get the cluster healthy again. I'm doing this as tests before taking this ceph configuration in further production. http://paste.debian.net/plain/281922 If I lost data, my bad, but how could I figure out in what pool the data was lost and in what rbd volume (so what kvm guest lost data). Anybody that can help? Can I somehow reweight some OSD to resolve the problems? Or should I rebuild the whole cluster and loose all data? # ceph pg 3.12 query http://paste.debian.net/284812/ I used ceph pg force_create_pg x.xx on all the incomplete pgs and I don’t have any stuck pgs any more but there are still incomplete ones. # ceph health detail http://paste.debian.net/284813/ How can I get the incomplete pgs active again? Kind regards, Jelle de Jong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance dégradation after upgrade to hammer
Ok, So good news that RADOS appears to be doing well. I'd say next is to follow some of the recommendations here: http://ceph.com/docs/master/radosgw/troubleshooting/ If you examine the objecter_requests and perfcounters during your cosbench write test, it might help explain where the requests are backing up. Another thing to look for (as noted in the above URL) are HTTP errors in the apache logs (if relevant). Other general thoughts: When you upgraded to hammer did you change the RGW configuration at all? Are you using civetweb now? Does the rgw.buckets pool have enough PGs? Mark On 07/21/2015 08:17 PM, Florent MONTHEL wrote: Hi Mark I've something like 600 write IOPs on EC pool and 800 write IOPs on replicated 3 pool with rados bench With Radosgw I have 30/40 write IOPs with Cosbench (1 radosgw- the same with 2) and servers are sleeping : - 0.005 core for radosgw process - 0.01 core for osd process I don't know if we can have .rgw* pool locking or something like that with Hammer (or situation specific to me) On 100% read profile, Radosgw and Ceph servers are working very well with more than 6000 IOPs on one radosgw server : - 7 cores for radosgw process - 1 core for each osd process - 0,5 core for each Apache process Thanks Sent from my iPhone On 14 juil. 2015, at 21:03, Mark Nelson mnel...@redhat.com wrote: Hi Florent, 10x degradation is definitely unusual! A couple of things to look at: Are 8K rados bench writes to the rgw.buckets pool slow? You can with something like: rados -p rgw.buckets bench 30 write -t 256 -b 8192 You may also want to try targeting a specific RGW server to make sure the RR-DNS setup isn't interfering (at least while debugging). It may also be worth creating a new replicated pool and try writes to that pool as well to see if you see much difference. Mark On 07/14/2015 07:17 PM, Florent MONTHEL wrote: Yes of course thanks Mark Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster servers No SSD drives used We're using Cosbench to send : - 8k object size : 100% read with 256 workers : better results with Hammer - 8k object size : 80% read - 20% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) - 8k object size : 100% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) Thanks Sent from my iPhone On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote: On 07/14/2015 06:42 PM, Florent MONTHEL wrote: Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Hi Florent, Can you talk a little bit about how your write tests are setup? How many concurrent IOs and what size? Also, do you see similar problems with rados bench? We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi. Mark Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
We might also be able to help you improve or better understand your results if you can tell us exactly what tests you're conducting that are giving you these numbers. -Greg On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL fmont...@flox-arts.net wrote: Hi Frederic, When you have Ceph cluster with 1 node you don’t experienced network and communication overhead due to distributed model With 2 nodes and EC 4+1 you will have communication between 2 nodes but you will keep internal communication (2 chunks on first node and 3 chunks on second node) On your configuration EC pool is setup with 4+1 so you will have for each write overhead due to write spreading on 5 nodes (for 1 customer IO, you will experience 5 Ceph IO due to EC 4+1) It’s the reason for that I think you’re reaching performance stability with 5 nodes and more in your cluster On Jul 20, 2015, at 10:35 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, As I explained in various previous threads, I’m having a hard time getting the most out of my test ceph cluster. I’m benching things with rados bench. All Ceph hosts are on the same 10GB switch. Basically, I know I can get about 1GB/s of disk write performance per host, when I bench things with dd (hundreds of dd threads) +iperf 10gbit inbound+iperf 10gbit outbound. I also can get 2GB/s or even more if I don’t bench the network at the same time, so yes, there is a bottleneck between disks and network, but I can’t identify which one, and it’s not relevant for what follows anyway (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about this strange bottleneck though…) My hosts each are connected though a single 10Gbits/s link for now. My problem is the following. Please note I see the same kind of poor performance with replicated pools... When testing EC pools, I ended putting a 4+1 pool on a single node in order to track down the ceph bottleneck. On that node, I can get approximately 420MB/s write performance using rados bench, but that’s fair enough since the dstat output shows that real data throughput on disks is about 800+MB/s (that’s the ceph journal effect, I presume). I tested Ceph on my other standalone nodes : I can also get around 420MB/s, since they’re identical. I’m testing things with 5 10Gbits/s clients, each running rados bench. But what I really don’t get is the following : - With 1 host : throughput is 420MB/s - With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s. - With 5 hosts : I get around 1375MB/s . That’s far from the expected 2GB/s. The network never is maxed out, nor are the disks or CPUs. The hosts throughput I see with rados bench seems to match the dstat throughput. That’s as if each additional host was only capable of adding 220MB/s of throughput. Compare this to the 1GB/s they are capable of (420MB/s with journals)… I’m therefore wondering what could possibly be so wrong with my setup ?? Why would it impact so much the performance to add hosts ? On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards. I know, not perfect, but not THAT bad neither… ? Any hint would be greatly appreciated ! Thanks Frederic Schaer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD crashes
We have been error free for almost 3 weeks now. The following settings on all OSD nodes were changed: vm.swappiness=1 vm.min_free_kbytes=262144 My discussion on XFS list is here: http://www.spinics.net/lists/xfs/msg33645.html Thanks, Alex On Fri, Jul 3, 2015 at 6:27 AM, Jan Schermer j...@schermer.cz wrote: What’s the value of /proc/sys/vm/min_free_kbytes on your system? Increase it to 256M (better do it if there’s lots of free memory) and see if it helps. It can also be set too high, hard to find any formula how to set it correctly... Jan On 03 Jul 2015, at 10:16, Alex Gorbachev a...@iss-integration.com wrote: Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9. Thank you for any advice. Alex Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to handle kernel paging request at 0019001c Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP: [8118e476] find_get_entries+0x66/0x160 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops: [#1] SMP Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in: xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp co retemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core lpc_ich joy dev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_ generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca raid_class ptp scsi_transport_sas pps_core arcmsr Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711 Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task: 8800721f1420 ti: 880fbad54000 task.ti: 880fbad54000 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP: 0010:[8118e476] [8118e476] find_get_entries+0x66/0x160 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP: 0018:880fbad571a8 EFLAGS: 00010246 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX: 880004000158 RBX: 000e RCX: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX: 880004000158 RSI: 880fbad571c0 RDI: 0019 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP: 880fbad57208 R08: 00c0 R09: 00ff Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10: R11: 0220 R12: 00b6 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13: 880fbad57268 R14: 000a R15: 880fbad572d8 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS: 7f98cb0e0700() GS:88103f48() knlGS: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS: 0010 DS: ES: CR0: 80050033 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2: 0019001c CR3: 001034f0e000 CR4: 000407e0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262618] 880fbad571f8 880cf6076b30 880bdde05da8 00e6 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262669] 0100 880cf6076b28 00b5 880fbad57258 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262721] 880fbad57258 880fbad572d8 880cf6076b28 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace: Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262801] [8119b482] pagevec_lookup_entries+0x22/0x30 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262831] [8119bd84] truncate_inode_pages_range+0xf4/0x700 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262862] [8119c415] truncate_inode_pages+0x15/0x20 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262891] [8119c53f] truncate_inode_pages_final+0x5f/0xa0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262949] [c0431c2c] xfs_fs_evict_inode+0x3c/0xe0 [xfs] Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.262981] [81220558] evict+0xb8/0x190 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263009] [81220671] dispose_list+0x41/0x50 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263037] [8122176f] prune_icache_sb+0x4f/0x60 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263067] [81208ab5] super_cache_scan+0x155/0x1a0 Jul 3 03:42:06 roc-4r-sca020 kernel: [554036.263096] [8119d26f] do_shrink_slab+0x13f/0x2c0 Jul 3 03:42:06
Re: [ceph-users] CephFS vs RBD
Hi Hadi, AFAIK, you can’t safely mount RBD as R/W on multiple machines. You could re-export the RBD as NFS, but that’ll introduce a bottleneck and probably tank your performance gains over CephFS. For what it’s worth, some of our RBDs are mapped to multiple machines, mounted read-write on one and read-only on the others. We haven’t seen any strange effects from that, but I seem to recall it being ill advised. —Lincoln On Jul 22, 2015, at 2:05 PM, Hadi Montakhabi h...@cs.uh.edu wrote: Hello Cephers, I've been experimenting with CephFS and RBD for some time now. From what I have seen so far, RBD outperforms CephFS by far. However, there is a catch! RBD could be mounted on one client at a time! Now, assuming that we have multiple clients running some MPI code (and doing some distributed I/O), all these clients need to read/write from the same location and sometimes even the same file. Is this at all possible by using RBD, and not CephFS? Thanks, Hadi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs going inconsistent after stopping the primary
Annoying that we don't know what caused the replica's stat structure to get out of sync. Let us know if you see it recur. What were those pools used for? -Sam - Original Message - From: Dan van der Ster d...@vanderster.com To: Samuel Just sj...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 12:36:53 PM Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary Cool, writing some objects to the affected PGs has stopped the consistent/inconsistent cycle. I'll keep an eye on them but this seems to have fixed the problem. Thanks!! Dan On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just sj...@redhat.com wrote: Looks like it's just a stat error. The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason). I bet it clears itself it you perform a write on the pg since the primary will send over its stats. We'd need information from when the stat error originally occurred to debug further. -Sam - Original Message - From: Dan van der Ster d...@vanderster.com To: ceph-users@lists.ceph.com Sent: Wednesday, July 22, 2015 7:49:00 AM Subject: [ceph-users] PGs going inconsistent after stopping the primary Hi Ceph community, Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 We wanted to post here before the tracker to see if someone else has had this problem. We have a few PGs (different pools) which get marked inconsistent when we stop the primary OSD. The problem is strange because once we restart the primary, then scrub the PG, the PG is marked active+clean. But inevitably next time we stop the primary OSD, the same PG is marked inconsistent again. There is no user activity on this PG, and nothing interesting is logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line mentioning the PG already says inactive+inconsistent). We suspect this is related to garbage files left in the PG folder. One of our PGs is acting basically like above, except it goes through this cycle: active+clean - (deep-scrub) - active+clean+inconsistent - (repair) - active+clean - (restart primary OSD) - (deep-scrub) - active+clean+inconsistent. This one at least logs: 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors and this should be debuggable because there is only one object in the pool: tapetest 55 0 073575G 1 even though rados ls returns no objects: # rados ls -p tapetest # Any ideas? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance dégradation after upgrade to hammer
Hi Mark Yes enough PG and no error on Apache logs We identified some bottleneck on bucket index with huge IOPs on one OSD (IOPs is done on only 1 bucket) With bucket sharding (32) configured write IOPs us now 5x better (and after bucket delete/create). But we don't yet reach Firefly performance RedHat case in progress. I will share later with community Sent from my iPhone On 22 juil. 2015, at 08:20, Mark Nelson mnel...@redhat.com wrote: Ok, So good news that RADOS appears to be doing well. I'd say next is to follow some of the recommendations here: http://ceph.com/docs/master/radosgw/troubleshooting/ If you examine the objecter_requests and perfcounters during your cosbench write test, it might help explain where the requests are backing up. Another thing to look for (as noted in the above URL) are HTTP errors in the apache logs (if relevant). Other general thoughts: When you upgraded to hammer did you change the RGW configuration at all? Are you using civetweb now? Does the rgw.buckets pool have enough PGs? Mark On 07/21/2015 08:17 PM, Florent MONTHEL wrote: Hi Mark I've something like 600 write IOPs on EC pool and 800 write IOPs on replicated 3 pool with rados bench With Radosgw I have 30/40 write IOPs with Cosbench (1 radosgw- the same with 2) and servers are sleeping : - 0.005 core for radosgw process - 0.01 core for osd process I don't know if we can have .rgw* pool locking or something like that with Hammer (or situation specific to me) On 100% read profile, Radosgw and Ceph servers are working very well with more than 6000 IOPs on one radosgw server : - 7 cores for radosgw process - 1 core for each osd process - 0,5 core for each Apache process Thanks Sent from my iPhone On 14 juil. 2015, at 21:03, Mark Nelson mnel...@redhat.com wrote: Hi Florent, 10x degradation is definitely unusual! A couple of things to look at: Are 8K rados bench writes to the rgw.buckets pool slow? You can with something like: rados -p rgw.buckets bench 30 write -t 256 -b 8192 You may also want to try targeting a specific RGW server to make sure the RR-DNS setup isn't interfering (at least while debugging). It may also be worth creating a new replicated pool and try writes to that pool as well to see if you see much difference. Mark On 07/14/2015 07:17 PM, Florent MONTHEL wrote: Yes of course thanks Mark Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster servers No SSD drives used We're using Cosbench to send : - 8k object size : 100% read with 256 workers : better results with Hammer - 8k object size : 80% read - 20% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) - 8k object size : 100% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) Thanks Sent from my iPhone On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote: On 07/14/2015 06:42 PM, Florent MONTHEL wrote: Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Hi Florent, Can you talk a little bit about how your write tests are setup? How many concurrent IOs and what size? Also, do you see similar problems with rados bench? We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi. Mark Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] load-gen throughput numbers
If I run rados load-gen with the following parameters: --num-objects 50 --max-ops 16 --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --percent 100 --target-throughput 2000 So every object is 4M in size and all the ops are reads of the entire 4M. I would assume this is equivalent to running rados bench rand on that pool if the pool has been previously filled with 50 4M objects. And I am assuming the --max-ops=16 is equivalent to having 16 concurrent threads in rados bench. And I have set the target throughput higher than is possible with my network. But when I run both rados load-gen and rados bench as described, I see that rados bench gets about twice the throughput of rados load-gen. Why would that be? I see there is a --max-backlog parameter, is there some setting of that parameter that would help the throughput? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] load-gen throughput numbers
Ah, I see that --max-backlog must be expressed in bytes/sec, in spite of what the --help message says. -- Tom -Original Message- From: Deneau, Tom Sent: Wednesday, July 22, 2015 5:09 PM To: 'ceph-users@lists.ceph.com' Subject: load-gen throughput numbers If I run rados load-gen with the following parameters: --num-objects 50 --max-ops 16 --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --percent 100 --target-throughput 2000 So every object is 4M in size and all the ops are reads of the entire 4M. I would assume this is equivalent to running rados bench rand on that pool if the pool has been previously filled with 50 4M objects. And I am assuming the --max-ops=16 is equivalent to having 16 concurrent threads in rados bench. And I have set the target throughput higher than is possible with my network. But when I run both rados load-gen and rados bench as described, I see that rados bench gets about twice the throughput of rados load-gen. Why would that be? I see there is a --max-backlog parameter, is there some setting of that parameter that would help the throughput? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph KeyValueStore configuration settings
Hi, I'm rather new to ceph and I was trying to launch a test cluster with the Hammer release with the default OSD backend as KeyValueStore instead of FileStore. I am deploying my cluster using ceph-deploy. Can someone who has already done this please share the changes they have made for this? I am not able to see any documentation on the same. I apologize if this question has been asked previously. Thanks Sai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph KeyValueStore configuration settings
I guess you only need to add osd objectstore = keyvaluestore and enable experimental unrecoverable data corrupting features = keyvaluestore. And you need to know keyvaluestore is a experimental backend now, it's not recommended to deploy in producation env ! On Thu, Jul 23, 2015 at 7:13 AM, Sai Srinath Sundar-SSI sai.srin...@ssi.samsung.com wrote: Hi, I’m rather new to ceph and I was trying to launch a test cluster with the Hammer release with the default OSD backend as KeyValueStore instead of FileStore. I am deploying my cluster using ceph-deploy. Can someone who has already done this please share the changes they have made for this? I am not able to see any documentation on the same. I apologize if this question has been asked previously. Thanks Sai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PGs going inconsistent after stopping the primary
Hi Ceph community, Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 We wanted to post here before the tracker to see if someone else has had this problem. We have a few PGs (different pools) which get marked inconsistent when we stop the primary OSD. The problem is strange because once we restart the primary, then scrub the PG, the PG is marked active+clean. But inevitably next time we stop the primary OSD, the same PG is marked inconsistent again. There is no user activity on this PG, and nothing interesting is logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line mentioning the PG already says inactive+inconsistent). We suspect this is related to garbage files left in the PG folder. One of our PGs is acting basically like above, except it goes through this cycle: active+clean - (deep-scrub) - active+clean+inconsistent - (repair) - active+clean - (restart primary OSD) - (deep-scrub) - active+clean+inconsistent. This one at least logs: 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors and this should be debuggable because there is only one object in the pool: tapetest 55 0 073575G 1 even though rados ls returns no objects: # rados ls -p tapetest # Any ideas? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] client io doing unrequested reads
1. Is the layout default, apart from the change to object_size? It is default. The only change I make is object_size and stripe_unit. I set both to the same value (i.e. stripe_count is 1 in all cases). 2. What version are the client and server? ceph version 0.94.1 3. Not really... are you using the fuse client? Enabling debug objecter = 10 on the client will give you a log that says what writes the client is doing. I am using the kernel module. Does this work with the kernel module? How can I set it up? 4. This is probably a client issue, so I would expect killing the client to get you out of it. You are absolutely right. It goes away when I reboot the client node. Thanks, Hadi On Tue, Jul 21, 2015 at 4:57 PM, John Spray john.sp...@redhat.com wrote: On 21/07/15 21:54, Hadi Montakhabi wrote: Hello Cephers, I am using CephFS, and running some benchmarks using fio. After increasing the object_size to 33554432, when I try to run some read and write tests with different block sizes, when I get to block size of 64m and beyond, Ceph does not finish the operation (I tried letting it run for more than a day at least three times). However, when I cancel the job and I expect to see no io operations, here is what I get: Is the layout default, apart from the change to object_size? What version are the client and server? [cephuser@node01 ~]$ ceph -s cluster b7beebf6-ea9f-4560-a916-a58e106c6e8e health HEALTH_OK monmap e3: 3 mons at {node02= 192.168.17.212:6789/0,node03=192.168.17.213:6789/0,node04=192.168.17.214:6789/0 } election epoch 8, quorum 0,1,2 node02,node03,node04 mdsmap e74: 1/1/1 up {0=node02=up:active} osdmap e324: 14 osds: 14 up, 14 in pgmap v155699: 768 pgs, 3 pools, 15285 MB data, 1772 objects 91283 MB used, 7700 GB / 7817 GB avail 768 active+clean * client io 2911 MB/s rd, 90 op/s* If I do ceph -w, it shows me that it is constantly doing reads, but I have no idea from where and when it would stop? I had to remove my CephFS file system and the associated pools and start things from scratch. 1. Any idea what is happening? Not really... are you using the fuse client? Enabling debug objecter = 10 on the client will give you a log that says what writes the client is doing. 2. When this happens, do you know a better way to get out of the situation without destroying the filesystem and the pools? This is probably a client issue, so I would expect killing the client to get you out of it. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_agent_max_ops relating to number of OSDs in the cache pool
On Sat, Jul 18, 2015 at 10:25 PM, Nick Fisk n...@fisk.me.uk wrote: Hi All, I’m doing some testing on the new High/Low speed cache tiering flushing and I’m trying to get my head round the effect that changing these 2 settings have on the flushing speed. When setting the osd_agent_max_ops to 1, I can get up to 20% improvement before the osd_agent_max_high_ops value kicks in for high speed flushing. Which is great for bursty workloads. As I understand it, these settings loosely effect the number of concurrent operations the cache pool OSD’s will flush down to the base pool. I may have got completely the wrong idea in my head but I can’t understand how a static default setting will work with different cache/base ratios. For example if I had a relatively small number of very fast cache tier OSD’s (PCI-E SSD perhaps) and a much larger number of base tier OSD’s, would the value need to be increased to ensure sufficient utilisation of the base tier and make sure that the cache tier doesn’t fill up too fast? Alternatively where the cache tier is based on spinning disks or where the base tier is not as comparatively large, this value may need to be reduced to stop it saturating the disks. Any Thoughts? I'm not terribly familiar with these exact values, but I think you've got it right. We can't make decisions at the level of the entire cache pool (because sharing that information isn't feasible), so we let you specify it on a per-OSD basis according to what setup you have. I've no idea if anybody has gathered up a matrix of baseline good settings or not. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com