[ceph-users] Low speed of write to cephfs

2015-10-15 Thread Butkeev Stas
Hello all,
Does anybody try to use cephfs?

I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each 
server has 15G flash for ceph journal and 12*2Tb SATA disk for data.
I have Infiniband(ipoib) 56Gb/s interconnect between nodes.


Cluster version
# ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

Cluster config
# cat /etc/ceph/ceph.conf 
[global]
auth service required = cephx
auth client required = cephx
auth cluster required = cephx
fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
mon osd full ratio = .95
mon osd nearfull ratio = .90
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 32
osd pool default pgp num = 32
max open files = 131072
osd crush chooseleaf type = 1
[mds]

[mds.a]
host = ak34

[mon]
mon_initial_members = a,b

[mon.a]
host = ak34
mon addr  = 172.24.32.134:6789

[mon.b]
host = ak35
mon addr  = 172.24.32.135:6789

[osd]
osd journal size = 1000

[osd.0]
osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
host = ak34
public addr  = 172.24.32.134
osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
.


Below tree of cluster
# ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 45.75037 root default  
-2 45.75037 region RU 
-3 45.75037 datacenter ru-msk-ak48t   
-4 22.87518 host ak34 
 0  1.90627 osd.0up  1.0  1.0 
 1  1.90627 osd.1up  1.0  1.0 
 2  1.90627 osd.2up  1.0  1.0 
 3  1.90627 osd.3up  1.0  1.0 
 4  1.90627 osd.4up  1.0  1.0 
 5  1.90627 osd.5up  1.0  1.0 
 6  1.90627 osd.6up  1.0  1.0 
 7  1.90627 osd.7up  1.0  1.0 
 8  1.90627 osd.8up  1.0  1.0 
 9  1.90627 osd.9up  1.0  1.0 
10  1.90627 osd.10   up  1.0  1.0 
11  1.90627 osd.11   up  1.0  1.0 
-5 22.87518 host ak35 
12  1.90627 osd.12   up  1.0  1.0 
13  1.90627 osd.13   up  1.0  1.0 
14  1.90627 osd.14   up  1.0  1.0 
15  1.90627 osd.15   up  1.0  1.0 
16  1.90627 osd.16   up  1.0  1.0 
17  1.90627 osd.17   up  1.0  1.0 
18  1.90627 osd.18   up  1.0  1.0 
19  1.90627 osd.19   up  1.0  1.0 
20  1.90627 osd.20   up  1.0  1.0 
21  1.90627 osd.21   up  1.0  1.0 
22  1.90627 osd.22   up  1.0  1.0 
23  1.90627 osd.23   up  1.0  1.0 

Status of cluster
# ceph -s
cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
 health HEALTH_OK
 monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
election epoch 10, quorum 0,1 a,b
 mdsmap e14: 1/1/1 up {0=a=up:active}
 osdmap e194: 24 osds: 24 up, 24 in
  pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects
545 GB used, 44132 GB / 44678 GB avail
 384 active+clean


Pools for cephfs
]# ceph osd dump|grep pg
pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool 
crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool 
stripe_width 0

Rados bench
# rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p 
cephfs_data 300 seq 
 Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 
objects
 Object prefix: benchmark_data__8108
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  16   170   154615.74   616  

Re: [ceph-users] Low speed of write to cephfs

2015-10-15 Thread Butkeev Stas
Hello John

Yes, of course, write speed is rising, because we are increasing amount of data 
per one operation by disk. 
But, do you know at least one software which write data by 1Mb blocks? I don't 
know, you too.

Simple test: dd to common 2Tb SATA disk

#  dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
   4GiB 0:00:46 [87.2MiB/s] [   

<=>   ]
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s

#  dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
dd: warning: partial read (24576 bytes); suggest iflag=fullblock
 319MiB 0:00:03 [ 103MiB/s] [  <=>  

  ]
10219+21 records in
10219+21 records out
335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s

One SATA disk has better rate than cephfs which consist of 24 the same disks.

-- 
Best Regards,
Stanislav Butkeev


15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>:
> On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>  Hello all,
>>  Does anybody try to use cephfs?
>>
>>  I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). 
>> Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data.
>>  I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>>
>>  Cluster version
>>  # ceph -v
>>  ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>
>>  Cluster config
>>  # cat /etc/ceph/ceph.conf
>>  [global]
>>  auth service required = cephx
>>  auth client required = cephx
>>  auth cluster required = cephx
>>  fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>  mon osd full ratio = .95
>>  mon osd nearfull ratio = .90
>>  osd pool default size = 2
>>  osd pool default min size = 1
>>  osd pool default pg num = 32
>>  osd pool default pgp num = 32
>>  max open files = 131072
>>  osd crush chooseleaf type = 1
>>  [mds]
>>
>>  [mds.a]
>>  host = ak34
>>
>>  [mon]
>>  mon_initial_members = a,b
>>
>>  [mon.a]
>>  host = ak34
>>  mon addr = 172.24.32.134:6789
>>
>>  [mon.b]
>>  host = ak35
>>  mon addr = 172.24.32.135:6789
>>
>>  [osd]
>>  osd journal size = 1000
>>
>>  [osd.0]
>>  osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>>  host = ak34
>>  public addr = 172.24.32.134
>>  osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
>>  .
>>
>>  Below tree of cluster
>>  # ceph osd tree
>>  ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>  -1 45.75037 root default
>>  -2 45.75037 region RU
>>  -3 45.75037 datacenter ru-msk-ak48t
>>  -4 22.87518 host ak34
>>   0 1.90627 osd.0 up 1.0 1.0
>>   1 1.90627 osd.1 up 1.0 1.0
>>   2 1.90627 osd.2 up 1.0 1.0
>>   3 1.90627 osd.3 up 1.0 1.0
>>   4 1.90627 osd.4 up 1.0 1.0
>>   5 1.90627 osd.5 up 1.0 1.0
>>   6 1.90627 osd.6 up 1.0 1.0
>>   7 1.90627 osd.7 up 1.0 1.0
>>   8 1.90627 osd.8 up 1.0 1.0
>>   9 1.90627 osd.9 up 1.0 1.0
>>  10 1.90627 osd.10 up 1.0 1.0
>>  11 1.90627 osd.11 up 1.0 1.0
>>  -5 22.87518 host ak35
>>  12 1.90627 osd.12 up 1.0 1.0
>>  13 1.90627 osd.13 up 1.0 1.0
>>  14 1.90627 osd.14 up 1.0 1.0
>>  15 1.90627 osd.15 up 1.0 1.0
>>  16 1.90627 osd.16 up 1.0 1.0
>>  17 1.90627 osd.17 up 1.0 1.0
>>  18 1.90627 osd.18 up 1.0 1.0
>>  19 1.90627 osd.19 up 1.0 1.0
>>  20 1.90627 osd.20 up 1.0 1.0
>>  21 1.90627 osd.21 up 1.0 1.0
>>  22 1.90627 osd.22 up 1.0 1.0
>>  23 1.90627 osd.23 up 1.0 1.0
>>
>>  Status of cluster
>>  # ceph -s
>>  cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>   health HEALTH_OK
>>   monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
>>  election epoch 10, quorum 0,1 a,b
>>   mdsmap e14: 1/1/1 up {0=a=up:active}
>>   osdmap e194: 24 osds: 24 up, 24 in
>>    pgmap v2305: 384 pgs, 3 pools, 

Re: [ceph-users] Low speed of write to cephfs

2015-10-15 Thread Butkeev Stas
Hello Max,

It is 15G scsi disk which was exported from Flash array to server.
# multipath -ll
X dm-3 XX
size=15G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 5:0:0:2 sdp 8:240 active ready running
  |- 4:0:0:2 sdq 65:0  active ready running
  |- 6:0:0:2 sds 65:32 active ready running
  `- 7:0:0:2 sdu 65:64 active ready running

In config you can see option "osd journal size = 1000". I use 12G on each node 
for ceph journal 

For example

# ls -l /CEPH_JOURNAL/*/*
/CEPH_JOURNAL/osd/ceph-0:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal

/CEPH_JOURNAL/osd/ceph-1:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal

/CEPH_JOURNAL/osd/ceph-10:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:04 journal

/CEPH_JOURNAL/osd/ceph-11:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal

/CEPH_JOURNAL/osd/ceph-2:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal

/CEPH_JOURNAL/osd/ceph-3:
total 1024000
-rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
...
-- 
Best Regards,
Stanislav Butkeev


15.10.2015, 23:26, "Max Yehorov" <myeho...@skytap.com>:
> Stas,
>
> as you said: "Each server has 15G flash for ceph journal and 12*2Tb
> SATA disk for"
>
> What is this 15G flash and is it used for all 12 SATA drives?
>
> On Thu, Oct 15, 2015 at 1:05 PM, John Spray <jsp...@redhat.com> wrote:
>>  On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>>  Thank you for your comment. I know what does mean option oflag=direct and 
>>> other things about stress testing.
>>>  Unfortunately speed is very slow for this cluster FS.
>>>
>>>  The same test on another cluster FS(GPFS) which consist of 4 disks
>>>
>>>  # dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k
>>>  40.1MB 0:00:05 [7.57MB/s] [ <=> ]
>>>  10240+0 records in
>>>  10240+0 records out
>>>  41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s
>>>
>>>  I hope that I miss some options during configuration or something else.
>>
>>  I don't know much about GPFS internals, since it's proprietary, but a
>>  quick google brings us here:
>>  
>> http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm
>>
>>  It appears that GPFS only respects O_DIRECT in certain circumstances,
>>  and in some circumstances will use their "pagepool" cache even when
>>  direct IO is requested. You would probably need to check with IBM to
>>  work out exactly whether true direct IO is happening when you run on
>>  GPFS.
>>
>>  John
>>
>>>  --
>>>  Best Regards,
>>>  Stanislav Butkeev
>>>
>>>  15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>:
>>>>  On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>>>>   Hello John
>>>>>
>>>>>   Yes, of course, write speed is rising, because we are increasing amount 
>>>>> of data per one operation by disk.
>>>>>   But, do you know at least one software which write data by 1Mb blocks? 
>>>>> I don't know, you too.
>>>>
>>>>  Plenty of applications do large writes, especially if they're intended
>>>>  for use on network filesystems.
>>>>
>>>>  When you pass oflag=direct, you are asking the kernel to send these
>>>>  writes individually instead of aggregating them in the page cache.
>>>>  What you're measuring here is effectively the issue rate of small
>>>>  messages, rather than the speed at which data can be written to ceph.
>>>>
>>>>  Try the same benchmark with NFS, you'll get a similar scaling with block 
>>>> size.
>>>>
>>>>  Cheers,
>>>>  John
>>>>
>>>>  If you want to aggregate these writes in the page cache before sending
>>>>  them over the network, I imagine you probably need to disable direct
>>>>  IO.
>>>>
>>>>>   Simple test: dd to common 2Tb SATA disk
>>>>>
>>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>>>>>      4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>>>>>   1048576+0 records in
>>>>>   1048576+0 records out
>>>>>   4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>>>>
>>>>>   # dd if=/dev/zero|pv|dd ofl

Re: [ceph-users] Low speed of write to cephfs

2015-10-15 Thread Butkeev Stas
Yes,The GPFS use option "pagepool" for caching IO. 
But my cluster now use tiny piece of the memory for caching. And we can 
consider that this cluster doesn't use cache.
# mmlsconfig
Configuration data for cluster XX:
-
myNodeConfigNumber 3
clusterName ebs.ak315t.c2
clusterId 1764239962949993
autoload no
pagepool 1k
dmapiFileHandleSize 32
minReleaseLevel 3.5.0.11
verbsPorts mlx4_0/1 mlx4_0/2
verbsRdma enable
adminMode central

File systems in cluster XX:
--
/dev/X

-- 
Best Regards,
Stanislav Butkeev


15.10.2015, 23:05, "John Spray" <jsp...@redhat.com>:
> On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>  Thank you for your comment. I know what does mean option oflag=direct and 
>> other things about stress testing.
>>  Unfortunately speed is very slow for this cluster FS.
>>
>>  The same test on another cluster FS(GPFS) which consist of 4 disks
>>
>>  # dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k
>>  40.1MB 0:00:05 [7.57MB/s] [ <=> ]
>>  10240+0 records in
>>  10240+0 records out
>>  41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s
>>
>>  I hope that I miss some options during configuration or something else.
>
> I don't know much about GPFS internals, since it's proprietary, but a
> quick google brings us here:
> http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm
>
> It appears that GPFS only respects O_DIRECT in certain circumstances,
> and in some circumstances will use their "pagepool" cache even when
> direct IO is requested. You would probably need to check with IBM to
> work out exactly whether true direct IO is happening when you run on
> GPFS.
>
> John
>
>>  --
>>  Best Regards,
>>  Stanislav Butkeev
>>
>>  15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>:
>>>  On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>>>   Hello John
>>>>
>>>>   Yes, of course, write speed is rising, because we are increasing amount 
>>>> of data per one operation by disk.
>>>>   But, do you know at least one software which write data by 1Mb blocks? I 
>>>> don't know, you too.
>>>
>>>  Plenty of applications do large writes, especially if they're intended
>>>  for use on network filesystems.
>>>
>>>  When you pass oflag=direct, you are asking the kernel to send these
>>>  writes individually instead of aggregating them in the page cache.
>>>  What you're measuring here is effectively the issue rate of small
>>>  messages, rather than the speed at which data can be written to ceph.
>>>
>>>  Try the same benchmark with NFS, you'll get a similar scaling with block 
>>> size.
>>>
>>>  Cheers,
>>>  John
>>>
>>>  If you want to aggregate these writes in the page cache before sending
>>>  them over the network, I imagine you probably need to disable direct
>>>  IO.
>>>
>>>>   Simple test: dd to common 2Tb SATA disk
>>>>
>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>>>>  4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>>>>   1048576+0 records in
>>>>   1048576+0 records out
>>>>   4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>>>
>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
>>>>   dd: warning: partial read (24576 bytes); suggest iflag=fullblock
>>>>    319MiB 0:00:03 [ 103MiB/s] [ <=> ]
>>>>   10219+21 records in
>>>>   10219+21 records out
>>>>   335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s
>>>>
>>>>   One SATA disk has better rate than cephfs which consist of 24 the same 
>>>> disks.
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Stanislav Butkeev
>>>>
>>>>   15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>:
>>>>>   On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>>>>>    Hello all,
>>>>>>    Does anybody try to use cephfs?
>>>>>>
>>>>>>    I have two servers with RHEL7.1(latest kernel 
>>>>>> 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal 
>>>>>> and 12*2Tb SATA disk

Re: [ceph-users] Low speed of write to cephfs

2015-10-15 Thread Butkeev Stas
Thank you for your comment. I know what does mean option oflag=direct and other 
things about stress testing. 
Unfortunately speed is very slow for this cluster FS.

The same test on another cluster FS(GPFS) which consist of 4 disks

# dd if=/dev/zero|pv|dd oflag=direct of=9 bs=4k count=10k
40.1MB 0:00:05 [7.57MB/s] [ <=> 

 ]
10240+0 records in
10240+0 records out
41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s

I hope that I miss some options during configuration or something else.

-- 
Best Regards,
Stanislav Butkeev


15.10.2015, 22:36, "John Spray" <jsp...@redhat.com>:
> On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>  Hello John
>>
>>  Yes, of course, write speed is rising, because we are increasing amount of 
>> data per one operation by disk.
>>  But, do you know at least one software which write data by 1Mb blocks? I 
>> don't know, you too.
>
> Plenty of applications do large writes, especially if they're intended
> for use on network filesystems.
>
> When you pass oflag=direct, you are asking the kernel to send these
> writes individually instead of aggregating them in the page cache.
> What you're measuring here is effectively the issue rate of small
> messages, rather than the speed at which data can be written to ceph.
>
> Try the same benchmark with NFS, you'll get a similar scaling with block size.
>
> Cheers,
> John
>
> If you want to aggregate these writes in the page cache before sending
> them over the network, I imagine you probably need to disable direct
> IO.
>
>>  Simple test: dd to common 2Tb SATA disk
>>
>>  # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>> 4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>>  1048576+0 records in
>>  1048576+0 records out
>>  4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>
>>  # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
>>  dd: warning: partial read (24576 bytes); suggest iflag=fullblock
>>   319MiB 0:00:03 [ 103MiB/s] [ <=> ]
>>  10219+21 records in
>>  10219+21 records out
>>  335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s
>>
>>  One SATA disk has better rate than cephfs which consist of 24 the same 
>> disks.
>>
>>  --
>>  Best Regards,
>>  Stanislav Butkeev
>>
>>  15.10.2015, 21:49, "John Spray" <jsp...@redhat.com>:
>>>  On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staer...@ya.ru> wrote:
>>>>   Hello all,
>>>>   Does anybody try to use cephfs?
>>>>
>>>>   I have two servers with RHEL7.1(latest kernel 
>>>> 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal 
>>>> and 12*2Tb SATA disk for data.
>>>>   I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>>>>
>>>>   Cluster version
>>>>   # ceph -v
>>>>   ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>>
>>>>   Cluster config
>>>>   # cat /etc/ceph/ceph.conf
>>>>   [global]
>>>>   auth service required = cephx
>>>>   auth client required = cephx
>>>>   auth cluster required = cephx
>>>>   fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>>   mon osd full ratio = .95
>>>>   mon osd nearfull ratio = .90
>>>>   osd pool default size = 2
>>>>   osd pool default min size = 1
>>>>   osd pool default pg num = 32
>>>>   osd pool default pgp num = 32
>>>>   max open files = 131072
>>>>   osd crush chooseleaf type = 1
>>>>   [mds]
>>>>
>>>>   [mds.a]
>>>>   host = ak34
>>>>
>>>>   [mon]
>>>>   mon_initial_members = a,b
>>>>
>>>>   [mon.a]
>>>>   host = ak34
>>>>   mon addr = 172.24.32.134:6789
>>>>
>>>>   [mon.b]
>>>>   host = ak35
>>>>   mon addr = 172.24.32.135:6789
>>>>
>>>>   [osd]
>>>>   osd journal size = 1000
>>>>
>>>>   [osd.0]
>>>>   osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>>>>   host = ak34
>>>>   public addr = 172.24.32.134
>>>&g

[ceph-users] problem with RGW

2015-07-31 Thread Butkeev Stas
Hello everybody

We have ceph cluster that consist of 8 host with 12 osd per each host. It's 2T 
SATA disks.

[13:23]:[root@se087  ~]# ceph osd tree
ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT 
PRIMARY-AFFINITY 
 -1 182.99203 root default  
 
 -2 182.99203 region RU 
 
 -3  91.49487 datacenter ru-msk-comp1p  
 
 -9  22.87500 host 1 
 48   1.90599 osd.48up  1.0  
1.0 
 49   1.90599 osd.49up  1.0  
1.0 
 50   1.90599 osd.50up  1.0  
1.0 
 51   1.90599 osd.51up  1.0  
1.0 
 52   1.90599 osd.52up  1.0  
1.0 
 53   1.90599 osd.53up  1.0  
1.0 
 54   1.90599 osd.54up  1.0  
1.0 
 55   1.90599 osd.55up  1.0  
1.0 
 56   1.90599 osd.56up  1.0  
1.0 
 57   1.90599 osd.57up  1.0  
1.0 
 58   1.90599 osd.58up  1.0  
1.0 
 59   1.90599 osd.59up  1.0  
1.0 
-10  22.87216 host 2 
 60   1.90599 osd.60up  1.0  
1.0 
 61   1.90599 osd.61up  1.0  
1.0 
 62   1.90599 osd.62up  1.0  
1.0 
 63   1.90599 osd.63up  1.0  
1.0 
 64   1.90599 osd.64up  1.0  
1.0 
 65   1.90599 osd.65up  1.0  
1.0 
 66   1.90599 osd.66up  1.0  
1.0 
 67   1.90599 osd.67up  1.0  
1.0 
 69   1.90599 osd.69up  1.0  
1.0 
 70   1.90599 osd.70up  1.0  
1.0 
 71   1.90599 osd.71up  1.0  
1.0 
 68   1.90627 osd.68up  1.0  
1.0 
-11  22.87500 host 3 
 72   1.90599 osd.72up  1.0  
1.0 
 73   1.90599 osd.73up  1.0  
1.0 
 74   1.90599 osd.74up  1.0  
1.0 
 75   1.90599 osd.75up  1.0  
1.0 
 76   1.90599 osd.76up  1.0  
1.0 
 77   1.90599 osd.77up  1.0  
1.0 
 78   1.90599 osd.78up  1.0  
1.0 
 79   1.90599 osd.79up  1.0  
1.0 
 80   1.90599 osd.80up  1.0  
1.0 
 81   1.90599 osd.81up  1.0  
1.0 
 82   1.90599 osd.82up  1.0  
1.0 
 83   1.90599 osd.83up  1.0  
1.0 
-12  22.87271 host 4 
 84   1.90599 osd.84up  1.0  
1.0 
 86   1.90599 osd.86up  1.0  
1.0 
 89   1.90599 osd.89up  1.0  
1.0 
 90   1.90599 osd.90up  1.0  
1.0 
 91   1.90599 osd.91up  1.0  
1.0 
 92   1.90599 osd.92up  1.0  
1.0 
 93   1.90599 osd.93up  1.0  
1.0 
 94   1.90599 osd.94up  1.0  
1.0 
 95   1.90599 osd.95up  1.0  
1.0 
 85   1.90627 osd.85up  1.0  
1.0 
 88   1.90627 osd.88up  1.0  
1.0 
 87   1.90627 osd.87up  1.0  
1.0 
 -4  91.49716 datacenter ru-msk-vol51   
 
 -5  22.87216 host 5 
  1   1.90599 osd.1 up  

[ceph-users] Problems with shadow objects

2015-03-03 Thread Butkeev Stas
Hello, all

I have ceph+RGW installation. And have some problems with shadow objects.

For example:
#rados ls -p .rgw.buckets|grep default.4507.1

.
default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.1_5
default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_2
default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.6_4
default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.4_2
default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.3_5
.

Please give me advices and answer on my questions
1) How can I rm this shadow files?
2) What does the name of this shadow files?
example
with normal object:
# radosgw-admin object stat --bucket=dev --object=RegExp_tutorial.png
and I receive information about this object.

with shadow object:
default.4507.1_ - bucket-id
radosgw-admin object stat --bucket=dev 
--object=_shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_7
ERROR: failed to stat object, returned error: (2) No such file or directory
how can I determine name of this object

-- 
Best Regards,
Stanislav Butkeev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems with pgs incomplete

2014-12-01 Thread Butkeev Stas
Hi all,
I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I 
check ceph status and see this information

[root@node-1 ceph-0]# ceph -s
cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e1: 3 mons at 
{a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election 
epoch 294, quorum 0,1,2 b,a,c
 osdmap e418: 6 osds: 5 up, 5 in
  pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects
5241 MB used, 494 GB / 499 GB avail
 308 active+clean
   4 incomplete

Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having 
replicated size 2 and min_size 2?

My osd tree
[root@node-1 ceph-0]# ceph osd tree
# idweight  type name   up/down reweight
-1  4   root croc
-2  4   region ru
-4  3   datacenter vol-5
-5  1   host node-1
0   1   osd.0   down0
-6  1   host node-2
1   1   osd.1   up  1
-7  1   host node-3
2   1   osd.2   up  1
-3  1   datacenter comp
-8  1   host node-4
3   1   osd.3   up  1
-9  1   host node-5
4   1   osd.4   up  1
-10 1   host node-6
5   1   osd.5   up  1

Addition information:

[root@node-1 ceph-0]# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last 
acting [1,3]
pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last 
acting [1,2]
pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last 
acting [1,3]
pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last 
acting [1,3]
pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting 
[1,3]
pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting 
[1,2]
pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting 
[1,3]
pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting 
[1,3]
pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')

[root@node-1 ceph-0]# ceph osd dump | grep 'pool'
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool 
stripe_width 0
pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0
pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool 
stripe_width 0
pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0
pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool stripe_width 0
pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 48 flags hashpspool stripe_width 0
pool 10 'test' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 136 pgp_num 136 last_change 68 flags hashpspool stripe_width 0
pool 11 '.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Butkeev Stas
Thank you Lionel,
Indeed I have forgotten about size  min_size. I have set min_size to 1 and my 
cluster is UP now. I have deleted crash osd and have set size to 3 and min_size 
to 2.

---
With regards,
Stanislav 


01.12.2014, 19:15, Lionel Bouton lionel-subscript...@bouton.name:
 Le 01/12/2014 17:08, Lionel Bouton a écrit :
  I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd
  expect ~1/3rd of your pgs to be incomplete given your ceph osd tree
  output) but reducing min_size to 1 should be harmless and should
  unfreeze the recovering process.

 Ignore this part : I wasn't paying enough attention to the osd tree
 output and mixed osd/host levels.

 Others have pointed out that you have size = 3 for some pools. In this
 case you might have lost an OSD before a previous recovering process
 finished which would explain your current state (in this case my earlier
 advice still applies).

 Best regards,

 Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com