[ceph-users] Low speed of write to cephfs
Hello all, Does anybody try to use cephfs? I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data. I have Infiniband(ipoib) 56Gb/s interconnect between nodes. Cluster version # ceph -v ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Cluster config # cat /etc/ceph/ceph.conf [global] auth service required = cephx auth client required = cephx auth cluster required = cephx fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f mon osd full ratio = .95 mon osd nearfull ratio = .90 osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 32 osd pool default pgp num = 32 max open files = 131072 osd crush chooseleaf type = 1 [mds] [mds.a] host = ak34 [mon] mon_initial_members = a,b [mon.a] host = ak34 mon addr = 172.24.32.134:6789 [mon.b] host = ak35 mon addr = 172.24.32.135:6789 [osd] osd journal size = 1000 [osd.0] osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443 host = ak34 public addr = 172.24.32.134 osd journal = /CEPH_JOURNAL/osd/ceph-0/journal . Below tree of cluster # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 45.75037 root default -2 45.75037 region RU -3 45.75037 datacenter ru-msk-ak48t -4 22.87518 host ak34 0 1.90627 osd.0up 1.0 1.0 1 1.90627 osd.1up 1.0 1.0 2 1.90627 osd.2up 1.0 1.0 3 1.90627 osd.3up 1.0 1.0 4 1.90627 osd.4up 1.0 1.0 5 1.90627 osd.5up 1.0 1.0 6 1.90627 osd.6up 1.0 1.0 7 1.90627 osd.7up 1.0 1.0 8 1.90627 osd.8up 1.0 1.0 9 1.90627 osd.9up 1.0 1.0 10 1.90627 osd.10 up 1.0 1.0 11 1.90627 osd.11 up 1.0 1.0 -5 22.87518 host ak35 12 1.90627 osd.12 up 1.0 1.0 13 1.90627 osd.13 up 1.0 1.0 14 1.90627 osd.14 up 1.0 1.0 15 1.90627 osd.15 up 1.0 1.0 16 1.90627 osd.16 up 1.0 1.0 17 1.90627 osd.17 up 1.0 1.0 18 1.90627 osd.18 up 1.0 1.0 19 1.90627 osd.19 up 1.0 1.0 20 1.90627 osd.20 up 1.0 1.0 21 1.90627 osd.21 up 1.0 1.0 22 1.90627 osd.22 up 1.0 1.0 23 1.90627 osd.23 up 1.0 1.0 Status of cluster # ceph -s cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f health HEALTH_OK monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0} election epoch 10, quorum 0,1 a,b mdsmap e14: 1/1/1 up {0=a=up:active} osdmap e194: 24 osds: 24 up, 24 in pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects 545 GB used, 44132 GB / 44678 GB avail 384 active+clean Pools for cephfs ]# ceph osd dump|grep pg pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool stripe_width 0 Rados bench # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 objects Object prefix: benchmark_data__8108 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 170 154615.74 616
Re: [ceph-users] Low speed of write to cephfs
Hello John Yes, of course, write speed is rising, because we are increasing amount of data per one operation by disk. But, do you know at least one software which write data by 1Mb blocks? I don't know, you too. Simple test: dd to common 2Tb SATA disk # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M 4GiB 0:00:46 [87.2MiB/s] [ <=> ] 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k dd: warning: partial read (24576 bytes); suggest iflag=fullblock 319MiB 0:00:03 [ 103MiB/s] [ <=> ] 10219+21 records in 10219+21 records out 335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s One SATA disk has better rate than cephfs which consist of 24 the same disks. -- Best Regards, Stanislav Butkeev 15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>: > On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote: >> Hello all, >> Does anybody try to use cephfs? >> >> I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). >> Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data. >> I have Infiniband(ipoib) 56Gb/s interconnect between nodes. >> >> Cluster version >> # ceph -v >> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) >> >> Cluster config >> # cat /etc/ceph/ceph.conf >> [global] >> auth service required = cephx >> auth client required = cephx >> auth cluster required = cephx >> fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f >> mon osd full ratio = .95 >> mon osd nearfull ratio = .90 >> osd pool default size = 2 >> osd pool default min size = 1 >> osd pool default pg num = 32 >> osd pool default pgp num = 32 >> max open files = 131072 >> osd crush chooseleaf type = 1 >> [mds] >> >> [mds.a] >> host = ak34 >> >> [mon] >> mon_initial_members = a,b >> >> [mon.a] >> host = ak34 >> mon addr = 172.24.32.134:6789 >> >> [mon.b] >> host = ak35 >> mon addr = 172.24.32.135:6789 >> >> [osd] >> osd journal size = 1000 >> >> [osd.0] >> osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443 >> host = ak34 >> public addr = 172.24.32.134 >> osd journal = /CEPH_JOURNAL/osd/ceph-0/journal >> . >> >> Below tree of cluster >> # ceph osd tree >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >> -1 45.75037 root default >> -2 45.75037 region RU >> -3 45.75037 datacenter ru-msk-ak48t >> -4 22.87518 host ak34 >> 0 1.90627 osd.0 up 1.0 1.0 >> 1 1.90627 osd.1 up 1.0 1.0 >> 2 1.90627 osd.2 up 1.0 1.0 >> 3 1.90627 osd.3 up 1.0 1.0 >> 4 1.90627 osd.4 up 1.0 1.0 >> 5 1.90627 osd.5 up 1.0 1.0 >> 6 1.90627 osd.6 up 1.0 1.0 >> 7 1.90627 osd.7 up 1.0 1.0 >> 8 1.90627 osd.8 up 1.0 1.0 >> 9 1.90627 osd.9 up 1.0 1.0 >> 10 1.90627 osd.10 up 1.0 1.0 >> 11 1.90627 osd.11 up 1.0 1.0 >> -5 22.87518 host ak35 >> 12 1.90627 osd.12 up 1.0 1.0 >> 13 1.90627 osd.13 up 1.0 1.0 >> 14 1.90627 osd.14 up 1.0 1.0 >> 15 1.90627 osd.15 up 1.0 1.0 >> 16 1.90627 osd.16 up 1.0 1.0 >> 17 1.90627 osd.17 up 1.0 1.0 >> 18 1.90627 osd.18 up 1.0 1.0 >> 19 1.90627 osd.19 up 1.0 1.0 >> 20 1.90627 osd.20 up 1.0 1.0 >> 21 1.90627 osd.21 up 1.0 1.0 >> 22 1.90627 osd.22 up 1.0 1.0 >> 23 1.90627 osd.23 up 1.0 1.0 >> >> Status of cluster >> # ceph -s >> cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f >> health HEALTH_OK >> monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0} >> election epoch 10, quorum 0,1 a,b >> mdsmap e14: 1/1/1 up {0=a=up:active} >> osdmap e194: 24 osds: 24 up, 24 in >> pgmap v2305: 384 pgs, 3 pools,
Re: [ceph-users] Low speed of write to cephfs
Hello Max, It is 15G scsi disk which was exported from Flash array to server. # multipath -ll X dm-3 XX size=15G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 5:0:0:2 sdp 8:240 active ready running |- 4:0:0:2 sdq 65:0 active ready running |- 6:0:0:2 sds 65:32 active ready running `- 7:0:0:2 sdu 65:64 active ready running In config you can see option "osd journal size = 1000". I use 12G on each node for ceph journal For example # ls -l /CEPH_JOURNAL/*/* /CEPH_JOURNAL/osd/ceph-0: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-1: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-10: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:04 journal /CEPH_JOURNAL/osd/ceph-11: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-2: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-3: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal ... -- Best Regards, Stanislav Butkeev 15.10.2015, 23:26, "Max Yehorov" <myeho...@skytap.com>: > Stas, > > as you said: "Each server has 15G flash for ceph journal and 12*2Tb > SATA disk for" > > What is this 15G flash and is it used for all 12 SATA drives? > > On Thu, Oct 15, 2015 at 1:05 PM, John Spray <jsp...@redhat.com> wrote: >> On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staer...@ya.ru> wrote: >>> Thank you for your comment. I know what does mean option oflag=direct and >>> other things about stress testing. >>> Unfortunately speed is very slow for this cluster FS. >>> >>> The same test on another cluster FS(GPFS) which consist of 4 disks >>> >>> # dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k >>> 40.1MB 0:00:05 [7.57MB/s] [ <=> ] >>> 10240+0 records in >>> 10240+0 records out >>> 41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s >>> >>> I hope that I miss some options during configuration or something else. >> >> I don't know much about GPFS internals, since it's proprietary, but a >> quick google brings us here: >> >> http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm >> >> It appears that GPFS only respects O_DIRECT in certain circumstances, >> and in some circumstances will use their "pagepool" cache even when >> direct IO is requested. You would probably need to check with IBM to >> work out exactly whether true direct IO is happening when you run on >> GPFS. >> >> John >> >>> -- >>> Best Regards, >>> Stanislav Butkeev >>> >>> 15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>: >>>> On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote: >>>>> Hello John >>>>> >>>>> Yes, of course, write speed is rising, because we are increasing amount >>>>> of data per one operation by disk. >>>>> But, do you know at least one software which write data by 1Mb blocks? >>>>> I don't know, you too. >>>> >>>> Plenty of applications do large writes, especially if they're intended >>>> for use on network filesystems. >>>> >>>> When you pass oflag=direct, you are asking the kernel to send these >>>> writes individually instead of aggregating them in the page cache. >>>> What you're measuring here is effectively the issue rate of small >>>> messages, rather than the speed at which data can be written to ceph. >>>> >>>> Try the same benchmark with NFS, you'll get a similar scaling with block >>>> size. >>>> >>>> Cheers, >>>> John >>>> >>>> If you want to aggregate these writes in the page cache before sending >>>> them over the network, I imagine you probably need to disable direct >>>> IO. >>>> >>>>> Simple test: dd to common 2Tb SATA disk >>>>> >>>>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M >>>>> 4GiB 0:00:46 [87.2MiB/s] [ <=> ] >>>>> 1048576+0 records in >>>>> 1048576+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s >>>>> >>>>> # dd if=/dev/zero|pv|dd ofl
Re: [ceph-users] Low speed of write to cephfs
Yes,The GPFS use option "pagepool" for caching IO. But my cluster now use tiny piece of the memory for caching. And we can consider that this cluster doesn't use cache. # mmlsconfig Configuration data for cluster XX: - myNodeConfigNumber 3 clusterName ebs.ak315t.c2 clusterId 1764239962949993 autoload no pagepool 1k dmapiFileHandleSize 32 minReleaseLevel 3.5.0.11 verbsPorts mlx4_0/1 mlx4_0/2 verbsRdma enable adminMode central File systems in cluster XX: -- /dev/X -- Best Regards, Stanislav Butkeev 15.10.2015, 23:05, "John Spray" <jsp...@redhat.com>: > On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staer...@ya.ru> wrote: >> Thank you for your comment. I know what does mean option oflag=direct and >> other things about stress testing. >> Unfortunately speed is very slow for this cluster FS. >> >> The same test on another cluster FS(GPFS) which consist of 4 disks >> >> # dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k >> 40.1MB 0:00:05 [7.57MB/s] [ <=> ] >> 10240+0 records in >> 10240+0 records out >> 41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s >> >> I hope that I miss some options during configuration or something else. > > I don't know much about GPFS internals, since it's proprietary, but a > quick google brings us here: > http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm > > It appears that GPFS only respects O_DIRECT in certain circumstances, > and in some circumstances will use their "pagepool" cache even when > direct IO is requested. You would probably need to check with IBM to > work out exactly whether true direct IO is happening when you run on > GPFS. > > John > >> -- >> Best Regards, >> Stanislav Butkeev >> >> 15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>: >>> On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote: >>>> Hello John >>>> >>>> Yes, of course, write speed is rising, because we are increasing amount >>>> of data per one operation by disk. >>>> But, do you know at least one software which write data by 1Mb blocks? I >>>> don't know, you too. >>> >>> Plenty of applications do large writes, especially if they're intended >>> for use on network filesystems. >>> >>> When you pass oflag=direct, you are asking the kernel to send these >>> writes individually instead of aggregating them in the page cache. >>> What you're measuring here is effectively the issue rate of small >>> messages, rather than the speed at which data can be written to ceph. >>> >>> Try the same benchmark with NFS, you'll get a similar scaling with block >>> size. >>> >>> Cheers, >>> John >>> >>> If you want to aggregate these writes in the page cache before sending >>> them over the network, I imagine you probably need to disable direct >>> IO. >>> >>>> Simple test: dd to common 2Tb SATA disk >>>> >>>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M >>>> 4GiB 0:00:46 [87.2MiB/s] [ <=> ] >>>> 1048576+0 records in >>>> 1048576+0 records out >>>> 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s >>>> >>>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k >>>> dd: warning: partial read (24576 bytes); suggest iflag=fullblock >>>> 319MiB 0:00:03 [ 103MiB/s] [ <=> ] >>>> 10219+21 records in >>>> 10219+21 records out >>>> 335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s >>>> >>>> One SATA disk has better rate than cephfs which consist of 24 the same >>>> disks. >>>> >>>> -- >>>> Best Regards, >>>> Stanislav Butkeev >>>> >>>> 15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>: >>>>> On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote: >>>>>> Hello all, >>>>>> Does anybody try to use cephfs? >>>>>> >>>>>> I have two servers with RHEL7.1(latest kernel >>>>>> 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal >>>>>> and 12*2Tb SATA disk
Re: [ceph-users] Low speed of write to cephfs
Thank you for your comment. I know what does mean option oflag=direct and other things about stress testing. Unfortunately speed is very slow for this cluster FS. The same test on another cluster FS(GPFS) which consist of 4 disks # dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k 40.1MB 0:00:05 [7.57MB/s] [ <=> ] 10240+0 records in 10240+0 records out 41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s I hope that I miss some options during configuration or something else. -- Best Regards, Stanislav Butkeev 15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>: > On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote: >> Hello John >> >> Yes, of course, write speed is rising, because we are increasing amount of >> data per one operation by disk. >> But, do you know at least one software which write data by 1Mb blocks? I >> don't know, you too. > > Plenty of applications do large writes, especially if they're intended > for use on network filesystems. > > When you pass oflag=direct, you are asking the kernel to send these > writes individually instead of aggregating them in the page cache. > What you're measuring here is effectively the issue rate of small > messages, rather than the speed at which data can be written to ceph. > > Try the same benchmark with NFS, you'll get a similar scaling with block size. > > Cheers, > John > > If you want to aggregate these writes in the page cache before sending > them over the network, I imagine you probably need to disable direct > IO. > >> Simple test: dd to common 2Tb SATA disk >> >> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M >> 4GiB 0:00:46 [87.2MiB/s] [ <=> ] >> 1048576+0 records in >> 1048576+0 records out >> 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s >> >> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k >> dd: warning: partial read (24576 bytes); suggest iflag=fullblock >> 319MiB 0:00:03 [ 103MiB/s] [ <=> ] >> 10219+21 records in >> 10219+21 records out >> 335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s >> >> One SATA disk has better rate than cephfs which consist of 24 the same >> disks. >> >> -- >> Best Regards, >> Stanislav Butkeev >> >> 15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>: >>> On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote: >>>> Hello all, >>>> Does anybody try to use cephfs? >>>> >>>> I have two servers with RHEL7.1(latest kernel >>>> 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal >>>> and 12*2Tb SATA disk for data. >>>> I have Infiniband(ipoib) 56Gb/s interconnect between nodes. >>>> >>>> Cluster version >>>> # ceph -v >>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) >>>> >>>> Cluster config >>>> # cat /etc/ceph/ceph.conf >>>> [global] >>>> auth service required = cephx >>>> auth client required = cephx >>>> auth cluster required = cephx >>>> fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f >>>> mon osd full ratio = .95 >>>> mon osd nearfull ratio = .90 >>>> osd pool default size = 2 >>>> osd pool default min size = 1 >>>> osd pool default pg num = 32 >>>> osd pool default pgp num = 32 >>>> max open files = 131072 >>>> osd crush chooseleaf type = 1 >>>> [mds] >>>> >>>> [mds.a] >>>> host = ak34 >>>> >>>> [mon] >>>> mon_initial_members = a,b >>>> >>>> [mon.a] >>>> host = ak34 >>>> mon addr = 172.24.32.134:6789 >>>> >>>> [mon.b] >>>> host = ak35 >>>> mon addr = 172.24.32.135:6789 >>>> >>>> [osd] >>>> osd journal size = 1000 >>>> >>>> [osd.0] >>>> osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443 >>>> host = ak34 >>>> public addr = 172.24.32.134 >>>&g
[ceph-users] problem with RGW
Hello everybody We have ceph cluster that consist of 8 host with 12 osd per each host. It's 2T SATA disks. [13:23]:[root@se087 ~]# ceph osd tree ID WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY -1 182.99203 root default -2 182.99203 region RU -3 91.49487 datacenter ru-msk-comp1p -9 22.87500 host 1 48 1.90599 osd.48up 1.0 1.0 49 1.90599 osd.49up 1.0 1.0 50 1.90599 osd.50up 1.0 1.0 51 1.90599 osd.51up 1.0 1.0 52 1.90599 osd.52up 1.0 1.0 53 1.90599 osd.53up 1.0 1.0 54 1.90599 osd.54up 1.0 1.0 55 1.90599 osd.55up 1.0 1.0 56 1.90599 osd.56up 1.0 1.0 57 1.90599 osd.57up 1.0 1.0 58 1.90599 osd.58up 1.0 1.0 59 1.90599 osd.59up 1.0 1.0 -10 22.87216 host 2 60 1.90599 osd.60up 1.0 1.0 61 1.90599 osd.61up 1.0 1.0 62 1.90599 osd.62up 1.0 1.0 63 1.90599 osd.63up 1.0 1.0 64 1.90599 osd.64up 1.0 1.0 65 1.90599 osd.65up 1.0 1.0 66 1.90599 osd.66up 1.0 1.0 67 1.90599 osd.67up 1.0 1.0 69 1.90599 osd.69up 1.0 1.0 70 1.90599 osd.70up 1.0 1.0 71 1.90599 osd.71up 1.0 1.0 68 1.90627 osd.68up 1.0 1.0 -11 22.87500 host 3 72 1.90599 osd.72up 1.0 1.0 73 1.90599 osd.73up 1.0 1.0 74 1.90599 osd.74up 1.0 1.0 75 1.90599 osd.75up 1.0 1.0 76 1.90599 osd.76up 1.0 1.0 77 1.90599 osd.77up 1.0 1.0 78 1.90599 osd.78up 1.0 1.0 79 1.90599 osd.79up 1.0 1.0 80 1.90599 osd.80up 1.0 1.0 81 1.90599 osd.81up 1.0 1.0 82 1.90599 osd.82up 1.0 1.0 83 1.90599 osd.83up 1.0 1.0 -12 22.87271 host 4 84 1.90599 osd.84up 1.0 1.0 86 1.90599 osd.86up 1.0 1.0 89 1.90599 osd.89up 1.0 1.0 90 1.90599 osd.90up 1.0 1.0 91 1.90599 osd.91up 1.0 1.0 92 1.90599 osd.92up 1.0 1.0 93 1.90599 osd.93up 1.0 1.0 94 1.90599 osd.94up 1.0 1.0 95 1.90599 osd.95up 1.0 1.0 85 1.90627 osd.85up 1.0 1.0 88 1.90627 osd.88up 1.0 1.0 87 1.90627 osd.87up 1.0 1.0 -4 91.49716 datacenter ru-msk-vol51 -5 22.87216 host 5 1 1.90599 osd.1 up
[ceph-users] Problems with shadow objects
Hello, all I have ceph+RGW installation. And have some problems with shadow objects. For example: #rados ls -p .rgw.buckets|grep default.4507.1 . default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.1_5 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_2 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.6_4 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.4_2 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.3_5 . Please give me advices and answer on my questions 1) How can I rm this shadow files? 2) What does the name of this shadow files? example with normal object: # radosgw-admin object stat --bucket=dev --object=RegExp_tutorial.png and I receive information about this object. with shadow object: default.4507.1_ - bucket-id radosgw-admin object stat --bucket=dev --object=_shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_7 ERROR: failed to stat object, returned error: (2) No such file or directory how can I determine name of this object -- Best Regards, Stanislav Butkeev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problems with pgs incomplete
Hi all, I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I check ceph status and see this information [root@node-1 ceph-0]# ceph -s cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean monmap e1: 3 mons at {a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election epoch 294, quorum 0,1,2 b,a,c osdmap e418: 6 osds: 5 up, 5 in pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects 5241 MB used, 494 GB / 499 GB avail 308 active+clean 4 incomplete Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having replicated size 2 and min_size 2? My osd tree [root@node-1 ceph-0]# ceph osd tree # idweight type name up/down reweight -1 4 root croc -2 4 region ru -4 3 datacenter vol-5 -5 1 host node-1 0 1 osd.0 down0 -6 1 host node-2 1 1 osd.1 up 1 -7 1 host node-3 2 1 osd.2 up 1 -3 1 datacenter comp -8 1 host node-4 3 1 osd.3 up 1 -9 1 host node-5 4 1 osd.4 up 1 -10 1 host node-6 5 1 osd.5 up 1 Addition information: [root@node-1 ceph-0]# ceph health detail HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last acting [1,3] pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last acting [1,2] pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last acting [1,3] pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last acting [1,3] pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting [1,3] pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting [1,2] pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting [1,3] pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting [1,3] pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') [root@node-1 ceph-0]# ceph osd dump | grep 'pool' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0 pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0 pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool stripe_width 0 pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 48 flags hashpspool stripe_width 0 pool 10 'test' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 136 pgp_num 136 last_change 68 flags hashpspool stripe_width 0 pool 11 '.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0
Re: [ceph-users] Problems with pgs incomplete
Thank you Lionel, Indeed I have forgotten about size min_size. I have set min_size to 1 and my cluster is UP now. I have deleted crash osd and have set size to 3 and min_size to 2. --- With regards, Stanislav 01.12.2014, 19:15, Lionel Bouton lionel-subscript...@bouton.name: Le 01/12/2014 17:08, Lionel Bouton a écrit : I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd expect ~1/3rd of your pgs to be incomplete given your ceph osd tree output) but reducing min_size to 1 should be harmless and should unfreeze the recovering process. Ignore this part : I wasn't paying enough attention to the osd tree output and mixed osd/host levels. Others have pointed out that you have size = 3 for some pools. In this case you might have lost an OSD before a previous recovering process finished which would explain your current state (in this case my earlier advice still applies). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com