maybe I found the problerm:
smartctl -a /dev/sda | grep Media_Wearout_Indicator
233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always
root@node-65:~# fio -direct=1 -bs=4k -ramp_time=40 -runtime=100
-size=20g -filename=./testfio.file -ioengine=libaio -iodepth=8
-norandommap -randrepeat=0 -time_based -rw=randwrite -name "osd.0 4k
randwrite test"
osd.0 4k randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=8
fio-2.1.10
Starting 1 process
osd.0 4k randwrite test: Laying out IO file(s) (1 file(s) / 20480MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/252KB/0KB /s] [0/63/0 iops] [eta 00m:00s]
osd.0 4k randwrite test: (groupid=0, jobs=1): err= 0: pid=30071: Wed
Mar 30 13:38:27 2016
write: io=79528KB, bw=814106B/s, iops=198, runt=100032msec
slat (usec): min=5, max=1031.3K, avg=363.76, stdev=13260.26
clat (usec): min=109, max=1325.7K, avg=39755.66, stdev=81798.27
lat (msec): min=3, max=1325, avg=40.25, stdev=83.48
clat percentiles (msec):
| 1.00th=[ 30], 5.00th=[ 30], 10.00th=[ 30], 20.00th=[ 31],
| 30.00th=[ 31], 40.00th=[ 31], 50.00th=[ 31], 60.00th=[ 36],
| 70.00th=[ 36], 80.00th=[ 36], 90.00th=[ 36], 95.00th=[ 36],
| 99.00th=[ 165], 99.50th=[ 799], 99.90th=[ 1221], 99.95th=[ 1237],
| 99.99th=[ 1319]
bw (KB /s): min= 0, max= 1047, per=100.00%, avg=844.21, stdev=291.89
lat (usec) : 250=0.01%
lat (msec) : 4=0.02%, 10=0.14%, 20=0.31%, 50=98.13%, 100=0.25%
lat (msec) : 250=0.23%, 500=0.16%, 750=0.24%, 1000=0.22%, 2000=0.34%
cpu : usr=0.18%, sys=1.27%, ctx=22838, majf=0, minf=27
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=143.7%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=19875/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: io=79528KB, aggrb=795KB/s, minb=795KB/s, maxb=795KB/s,
mint=100032msec, maxt=100032msec
Disk stats (read/write):
sda: ios=864/28988, merge=0/5738, ticks=31932/1061860,
in_queue=1093892, util=99.99%
root@node-65:~#
the lifetime of this SSD is over.
Thanks so much,Christian.
2016-03-30 12:19 GMT+08:00 lin zhou <[email protected]>:
> 2016-03-29 14:50 GMT+08:00 Christian Balzer <[email protected]>:
>>
>> Hello,
>>
>> On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote:
>>
>>> Hi,Christian.
>>> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the
>>> default reweight of these OSD is 0.0
>>>
>> That makes no sense, at a crush weight (not reweight) of 0 they should not
>> get used at all.
>>
>> When you deleted the other OSD (6?) because of high latency, was your only
>> reason/data point the "ceph osd perf" output?
>
> because this is a near-product environment,so when the osd latency and
> the system latency is high.I delete these osd to let it work first.
>
>>> root@node-65:~# ceph osd tree
>>> # id weight type name up/down reweight
>>> -1 103.7 root default
>>> -2 8.19 host node-65
>>> 18 2.73 osd.18 up 1
>>> 21 0 osd.21 up 1
>>> 24 2.73 osd.24 up 1
>>> 27 2.73 osd.27 up 1
>>> 30 0 osd.30 up 1
>>> 33 0 osd.33 up 1
>>> 0 0 osd.0 up 1
>>> 3 0 osd.3 up 1
>>> 6 0 osd.6 down 0
>>> 9 0 osd.9 up 1
>>> 12 0 osd.12 up 1
>>> 15 0 osd.15 up 1
>>>
>>> ceph osd perf:
>>> 0 9825 10211
>>> 3 9398 9775
>>> 9 35852 36904
>>> 12 24716 25626
>>> 15 18893 19633
>>>
>> This could very well be old, stale data.
>> Still, these are some seriously bad numbers, if they are real.
>>
>> Do the these perf numbers change at all? My guess would be no.
>
> Yes,the are never changed.
>
>>
>>> but iostat of these device is empty.
>> Unsurprising, as they should not be used by Ceph with a weight of 0.
>> atop gives you an even better, complete view.
>>
>>> smartctl say nothing error found in these OSD device.
>>>
>> What exactly are these devices (model please), 3TB SATA drives Ia assume?
>> How are they connect (controller)?
>
> Yes,3TB SATA,Model Number: WDC WD3000FYYZ-01UL1B1
> and today,I try to set osd.0 reweight to 0.1;and then check.some
> useful data found.
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.63 0.00 0.48 16.15 0.00 81.75
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 2.00 0.00 1.00
> 1024.00 39.85 1134.00 0.00 1134.00 500.00 100.00
> sda1 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sda2 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sda3 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 1.00 0.00 0.00 0.00 0.00 100.40
> sda4 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sda5 0.00 0.00 0.00 2.00 0.00 1.00
> 1024.00 32.32 1134.00 0.00 1134.00 502.00 100.40
> sda6 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.66 0.00 0.00 0.00 0.00 66.40
> sda7 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sda8 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 1.00 0.00 0.00 0.00 0.00 100.00
> sda9 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 1.00 0.00 0.00 0.00 0.00 100.00
> sda10 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 1.00 0.00 0.00 0.00 0.00 100.00
>
> ^C^C^C^C^C^C^C^C^C
> root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0
> total 62924048
> -rw-r--r-- 1 root root 487 Oct 12 16:49 activate.monmap
> -rw-r--r-- 1 root root 3 Oct 12 16:49 active
> -rw-r--r-- 1 root root 37 Oct 12 16:49 ceph_fsid
> drwxr-xr-x 280 root root 8192 Mar 30 11:58 current
> -rw-r--r-- 1 root root 37 Oct 12 16:49 fsid
> lrwxrwxrwx 1 root root 9 Oct 12 16:49 journal -> /dev/sda5
> -rw------- 1 root root 56 Oct 12 16:49 keyring
> -rw-r--r-- 1 root root 21 Oct 12 16:49 magic
> -rw-r--r-- 1 root root 6 Oct 12 16:49 ready
> -rw-r--r-- 1 root root 4 Oct 12 16:49 store_version
> -rw-r--r-- 1 root root 42 Oct 12 16:49 superblock
> -rw-r--r-- 1 root root 64424509440 Mar 30 10:20 testfio.file
> -rw-r--r-- 1 root root 0 Mar 28 09:54 upstart
> -rw-r--r-- 1 root root 2 Oct 12 16:49 whoami
>
> the journal of osd.0 is sda5,it is so busy.and cpu wait in top is
> 30%.the system is slow.
>
> So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4
> we use two SSD for journal and system.
>
> root@node-65:~# lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 111.8G 0 disk
> ├─sda1 8:1 0 22M 0 part
> ├─sda2 8:2 0 191M 0 part /boot
> ├─sda3 8:3 0 43.9G 0 part /
> ├─sda4 8:4 0 3.8G 0 part [SWAP]
> ├─sda5 8:5 0 10.2G 0 part
> ├─sda6 8:6 0 10.2G 0 part
> ├─sda7 8:7 0 10.2G 0 part
> ├─sda8 8:8 0 10.2G 0 part
> ├─sda9 8:9 0 10.2G 0 part
> └─sda10 8:10 0 10.2G 0 part
> sdb 8:16 0 111.8G 0 disk
> ├─sdb1 8:17 0 24M 0 part
> ├─sdb2 8:18 0 10.2G 0 part
> ├─sdb3 8:19 0 10.2G 0 part
> ├─sdb4 8:20 0 10.2G 0 part
> ├─sdb5 8:21 0 10.2G 0 part
> ├─sdb6 8:22 0 10.2G 0 part
> ├─sdb7 8:23 0 10.2G 0 part
> └─sdb8 8:24 0 50.1G 0 part
>
>> Christian
>>> 2016-03-29 13:22 GMT+08:00 lin zhou <[email protected]>:
>>> > Thanks.I try this method just like ceph document say.
>>> > But I just test osd.6 in this way,and the leveldb of osd.6 is
>>> > broken.so it can not start.
>>> >
>>> > When I try this for other osd,it works.
>>> >
>>> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <[email protected]>:
>>> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote:
>>> >>
>>> >>> > Hello,
>>> >>> >
>>> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote:
>>> >>> >
>>> >>> > > Hi,guys.
>>> >>> > > some days ago,one osd have a large latency seeing in ceph osd
>>> >>> > > perf.and this device make this node a high cpu await.
>>> >>> > The thing to do at that point would have been look at things with
>>> >>> > atop or iostat to verify that it was the device itself that was
>>> >>> > slow and not because it was genuinely busy due to uneven activity
>>> >>> > maybe. As well as a quick glance at SMART of course.
>>> >>>
>>> >>> Thanks.I will follow this when I face this problem next time.
>>> >>>
>>> >>> > > So,I delete this osd ad then check this device.
>>> >>> > If that device (HDD, SSD, which model?) slowed down your cluster,
>>> >>> > you should not have deleted it.
>>> >>> > The best method would have been to set your cluster to noout and
>>> >>> > stop that specific OSD.
>>> >>> >
>>> >>> > When you say "delete", what exact steps did you take?
>>> >>> > Did this include removing it from the crush map?
>>> >>>
>>> >>> Yes,I delete it from crush map.delete its auth,and rm osd.
>>> >>>
>>> >>
>>> >> Google is your friend, if you deleted it like in the link below you
>>> >> should be be able to re-add it the same way:
>>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html
>>> >>
>>> >> Christian
>>> >>
>>> >>> > > But nothing error found.
>>> >>> > >
>>> >>> > > And now I want to re-add this device into cluster with it's data.
>>> >>> > >
>>> >>> > All the data was already replicated elsewhere if you
>>> >>> > deleted/removed the OSD, you're likely not going to save much if
>>> >>> > any data movement by re-adding it.
>>> >>>
>>> >>> Yes,the cluster finished rebalance.but I face a problem of one
>>> >>> unfound object. And in the output of pg query in recovery_state
>>> >>> say,this osd is down,but other odds are ok.
>>> >>> So I want to recover this osd to recover this unfound object.
>>> >>>
>>> >>> and mark_unfound_lost revert/delete do not work:
>>> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all
>>> >>> sources,
>>> >>>
>>> >>> detail see:
>>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html
>>> >>>
>>> >>> Thanks again.
>>> >>>
>>> >>> > >
>>> >>> > > I try to using ceph-osd to add it,but it can not start.log are
>>> >>> > > paste in :
>>> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad
>>> >>> > >
>>> >>> > > so what's the recommend steps.
>>> >>> > That depends on how you deleted it, but at this point your data is
>>> >>> > likely to be mostly stale anyway, so I'd start from scratch.
>>> >>>
>>> >>> > Christian
>>> >>> > --
>>> >>> > Christian Balzer Network/Systems Engineer
>>> >>> > [email protected] Global OnLine Japan/Rakuten Communications
>>> >>> > http://www.gol.com/
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Christian Balzer Network/Systems Engineer
>>> >> [email protected] Global OnLine Japan/Rakuten Communications
>>> >> http://www.gol.com/
>>>
>>
>>
>> --
>> Christian Balzer Network/Systems Engineer
>> [email protected] Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com