[ceph-users] CephFS damaged and cannot recover

2019-06-19 Thread Wei Jin
There are plenty of data in this cluster (2PB), please help us, thx.
Before doing this dangerous
operations(http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts)
, any suggestions?

Ceph version: 12.2.12

ceph fs status:

cephfs - 1057 clients
==
+--+-+-+--+---+---+
| Rank |  State  | MDS | Activity |  dns  |  inos |
+--+-+-+--+---+---+
|  0   |  failed | |  |   |   |
|  1   | resolve | n31-023-214 |  |0  |0  |
|  2   | resolve | n31-023-215 |  |0  |0  |
|  3   | resolve | n31-023-218 |  |0  |0  |
|  4   | resolve | n31-023-220 |  |0  |0  |
|  5   | resolve | n31-023-217 |  |0  |0  |
|  6   | resolve | n31-023-222 |  |0  |0  |
|  7   | resolve | n31-023-216 |  |0  |0  |
|  8   | resolve | n31-023-221 |  |0  |0  |
|  9   | resolve | n31-023-223 |  |0  |0  |
|  10  | resolve | n31-023-225 |  |0  |0  |
|  11  | resolve | n31-023-224 |  |0  |0  |
|  12  | resolve | n31-023-219 |  |0  |0  |
|  13  | resolve | n31-023-229 |  |0  |0  |
+--+-+-+--+---+---+
+-+--+---+---+
|   Pool  |   type   |  used | avail |
+-+--+---+---+
| cephfs_metadata | metadata | 2843M | 34.9T |
|   cephfs_data   |   data   | 2580T |  731T |
+-+--+---+---+

+-+
| Standby MDS |
+-+
| n31-023-227 |
| n31-023-226 |
| n31-023-228 |
+-+



ceph fs dump:

dumped fsmap epoch 22712
e22712
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=no anchor table,9=file layout v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch 22711
flags 4
created 2018-11-30 10:05:06.015325
modified 2019-06-19 23:37:41.400961
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 22246
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2}
max_mds 14
in 0,1,2,3,4,5,6,7,8,9,10,11,12,13
up 
{1=31684663,2=31684674,3=31684576,4=31684673,5=31684678,6=31684612,7=31684688,8=31684683,9=31684698,10=31684695,11=31684693,12=31684586,13=31684617}
failed
damaged 0
stopped
data_pools [2]
metadata_pool 1
inline_data disabled
balancer
standby_count_wanted 1
31684663: 10.31.23.214:6800/829459839 'n31-023-214' mds.1.22682 up:resolve seq 6
31684674: 10.31.23.215:6800/2483123757 'n31-023-215' mds.2.22683
up:resolve seq 3
31684576: 10.31.23.218:6800/3381299029 'n31-023-218' mds.3.22683
up:resolve seq 3
31684673: 10.31.23.220:6800/3540255817 'n31-023-220' mds.4.22685
up:resolve seq 3
31684678: 10.31.23.217:6800/4004537495 'n31-023-217' mds.5.22689
up:resolve seq 3
31684612: 10.31.23.222:6800/1482899141 'n31-023-222' mds.6.22691
up:resolve seq 3
31684688: 10.31.23.216:6800/820115186 'n31-023-216' mds.7.22693 up:resolve seq 3
31684683: 10.31.23.221:6800/1996416037 'n31-023-221' mds.8.22693
up:resolve seq 3
31684698: 10.31.23.223:6800/2807778042 'n31-023-223' mds.9.22695
up:resolve seq 3
31684695: 10.31.23.225:6800/101451176 'n31-023-225' mds.10.22702
up:resolve seq 3
31684693: 10.31.23.224:6800/1597373084 'n31-023-224' mds.11.22695
up:resolve seq 3
31684586: 10.31.23.219:6800/3640206080 'n31-023-219' mds.12.22695
up:resolve seq 3
31684617: 10.31.23.229:6800/3511814011 'n31-023-229' mds.13.22697
up:resolve seq 3


Standby daemons:

31684637: 10.31.23.227:6800/1987867930 'n31-023-227' mds.-1.0 up:standby seq 2
31684690: 10.31.23.226:6800/3695913629 'n31-023-226' mds.-1.0 up:standby seq 2
31689991: 10.31.23.228:6800/2624666750 'n31-023-228' mds.-1.0 up:standby seq 2
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd mirror journal data

2018-11-06 Thread Wei Jin
Yes, we do one-way replication and the 'remote' cluster is the secondary 
cluster, so the rbd-mirror daemon is there.
We can confirm the daemon is working because we observed IO workload. And the 
remote cluster is actually bigger than the 'local’ cluster so it should be able 
to keep up with the IO workload. So it is confusing why there are so many 
journal data that cannot be trimmed immediately. (Local cluster also has 
capability to do more IO workload including trimming operations.)


> On Nov 6, 2018, at 9:25 PM, Jason Dillaman  wrote:
> 
> On Tue, Nov 6, 2018 at 1:12 AM Wei Jin  <mailto:wjin...@gmail.com>> wrote:
>> 
>> Thanks.
>> I found that both minimum and active set are very large in my cluster, is it 
>> expected?
>> By the way, I do snapshot for each image half an hour,and keep snapshots for 
>> two days.
>> 
>> Journal status:
>> 
>> minimum_set: 671839
>> active_set: 1197917
>> registered clients:
>>[id=, commit_position=[positions=[[object_number=4791670, tag_tid=3, 
>> entry_tid=4146742458], [object_number=4791669, tag_tid=3, 
>> entry_tid=4146742457], [object_number=4791668, tag_tid=3, 
>> entry_tid=4146742456], [object_number=4791671, tag_tid=3, 
>> entry_tid=4146742455]]], state=connected]
>>[id=89024ad3-57a7-42cc-99d4-67f33b093704, 
>> commit_position=[positions=[[object_number=2687357, tag_tid=3, 
>> entry_tid=1188516421], [object_number=2687356, tag_tid=3, 
>> entry_tid=1188516420], [object_number=2687359, tag_tid=3, 
>> entry_tid=1188516419], [object_number=2687358, tag_tid=3, 
>> entry_tid=1188516418]]], state=connected]
>> 
> 
> Are you attempting to run "rbd-mirror" daemon on a remote cluster? It
> just appears like either the daemon is not running or that it's so far
> behind that it's just not able to keep up with the IO workload of the
> image. You can run "rbd journal disconnect --image 
> --client-id=89024ad3-57a7-42cc-99d4-67f33b093704" to force-disconnect
> the remote client and start the journal trimming process.
> 
>>> On Nov 6, 2018, at 3:39 AM, Jason Dillaman  wrote:
>>> 
>>> On Sun, Nov 4, 2018 at 11:59 PM Wei Jin  wrote:
>>>> 
>>>> Hi, Jason,
>>>> 
>>>> I have a question about rbd mirroring. When enable mirroring, we observed 
>>>> that there are a lot of objects prefix with journal_data, thus it consumes 
>>>> a lot of disk space.
>>>> 
>>>> When will these journal objects be deleted? And are there any parameters 
>>>> to accelerate it?
>>>> Thanks.
>>>> 
>>> 
>>> Journal data objects should be automatically deleted when the journal
>>> is trimmed beyond the position of the object. If you run "rbd journal
>>> status --image ", you should be able to see the minimum
>>> in-use set and the current active set for new journal entries:
>>> 
>>> $ rbd --cluster cluster1 journal status --image image1
>>> minimum_set: 7
>>> active_set: 8
>>> registered clients:
>>> [id=, commit_position=[positions=[[object_number=33, tag_tid=2,
>>> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
>>> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
>>> tag_tid=2, entry_tid=49150]]], state=connected]
>>> [id=81672c30-d735-46d4-a30a-53c221954d0e,
>>> commit_position=[positions=[[object_number=30, tag_tid=2,
>>> entry_tid=48034], [object_number=29, tag_tid=2, entry_tid=48033],
>>> [object_number=28, tag_tid=2, entry_tid=48032], [object_number=31,
>>> tag_tid=2, entry_tid=48031]]], state=connected]
>>> 
>>> $ rados --cluster cluster1 --pool rbd ls | grep journal_data | sort
>>> journal_data.1.1029b4577f90.28
>>> journal_data.1.1029b4577f90.29
>>> journal_data.1.1029b4577f90.30
>>> journal_data.1.1029b4577f90.31
>>> journal_data.1.1029b4577f90.32
>>> journal_data.1.1029b4577f90.33
>>> journal_data.1.1029b4577f90.34
>>> journal_data.1.1029b4577f90.35
>>> <..>
>>> 
>>> $ rbd --cluster cluster1 journal status --image image1
>>> minimum_set: 8
>>> active_set: 8
>>> registered clients:
>>> [id=, commit_position=[positions=[[object_number=33, tag_tid=2,
>>> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
>>> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
>>> tag_tid=2, entry_tid=49150]]], state=connected]
>>> [id=81672c30-d735-46d4-a30a-53c221954d0e,
>>> commit_position=[positions=[[object_number=33, tag_tid=2,
>>> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
>>> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
>>> tag_tid=2, entry_tid=49150]]], state=connected]
>>> 
>>> $ rados --cluster cluster1 --pool rbd ls | grep journal_data | sort
>>> journal_data.1.1029b4577f90.32
>>> journal_data.1.1029b4577f90.33
>>> journal_data.1.1029b4577f90.34
>>> journal_data.1.1029b4577f90.35
>>> 
>>> --
>>> Jason
>> 
> 
> 
> -- 
> Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd mirror journal data

2018-11-05 Thread Wei Jin
Thanks.
I found that both minimum and active set are very large in my cluster, is it 
expected?
By the way, I do snapshot for each image half an hour,and keep snapshots for 
two days.

Journal status:

minimum_set: 671839
active_set: 1197917
registered clients:
[id=, commit_position=[positions=[[object_number=4791670, tag_tid=3, 
entry_tid=4146742458], [object_number=4791669, tag_tid=3, 
entry_tid=4146742457], [object_number=4791668, tag_tid=3, 
entry_tid=4146742456], [object_number=4791671, tag_tid=3, 
entry_tid=4146742455]]], state=connected]
[id=89024ad3-57a7-42cc-99d4-67f33b093704, 
commit_position=[positions=[[object_number=2687357, tag_tid=3, 
entry_tid=1188516421], [object_number=2687356, tag_tid=3, 
entry_tid=1188516420], [object_number=2687359, tag_tid=3, 
entry_tid=1188516419], [object_number=2687358, tag_tid=3, 
entry_tid=1188516418]]], state=connected]



> On Nov 6, 2018, at 3:39 AM, Jason Dillaman  wrote:
> 
> On Sun, Nov 4, 2018 at 11:59 PM Wei Jin  wrote:
>> 
>> Hi, Jason,
>> 
>> I have a question about rbd mirroring. When enable mirroring, we observed 
>> that there are a lot of objects prefix with journal_data, thus it consumes a 
>> lot of disk space.
>> 
>> When will these journal objects be deleted? And are there any parameters to 
>> accelerate it?
>> Thanks.
>> 
> 
> Journal data objects should be automatically deleted when the journal
> is trimmed beyond the position of the object. If you run "rbd journal
> status --image ", you should be able to see the minimum
> in-use set and the current active set for new journal entries:
> 
> $ rbd --cluster cluster1 journal status --image image1
> minimum_set: 7
> active_set: 8
> registered clients:
> [id=, commit_position=[positions=[[object_number=33, tag_tid=2,
> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
> tag_tid=2, entry_tid=49150]]], state=connected]
> [id=81672c30-d735-46d4-a30a-53c221954d0e,
> commit_position=[positions=[[object_number=30, tag_tid=2,
> entry_tid=48034], [object_number=29, tag_tid=2, entry_tid=48033],
> [object_number=28, tag_tid=2, entry_tid=48032], [object_number=31,
> tag_tid=2, entry_tid=48031]]], state=connected]
> 
> $ rados --cluster cluster1 --pool rbd ls | grep journal_data | sort
> journal_data.1.1029b4577f90.28
> journal_data.1.1029b4577f90.29
> journal_data.1.1029b4577f90.30
> journal_data.1.1029b4577f90.31
> journal_data.1.1029b4577f90.32
> journal_data.1.1029b4577f90.33
> journal_data.1.1029b4577f90.34
> journal_data.1.1029b4577f90.35
> <..>
> 
> $ rbd --cluster cluster1 journal status --image image1
> minimum_set: 8
> active_set: 8
> registered clients:
> [id=, commit_position=[positions=[[object_number=33, tag_tid=2,
> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
> tag_tid=2, entry_tid=49150]]], state=connected]
> [id=81672c30-d735-46d4-a30a-53c221954d0e,
> commit_position=[positions=[[object_number=33, tag_tid=2,
> entry_tid=49153], [object_number=32, tag_tid=2, entry_tid=49152],
> [object_number=35, tag_tid=2, entry_tid=49151], [object_number=34,
> tag_tid=2, entry_tid=49150]]], state=connected]
> 
> $ rados --cluster cluster1 --pool rbd ls | grep journal_data | sort
> journal_data.1.1029b4577f90.32
> journal_data.1.1029b4577f90.33
> journal_data.1.1029b4577f90.34
> journal_data.1.1029b4577f90.35
> 
> -- 
> Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No more Luminous packages for Debian Jessie ??

2018-03-07 Thread Wei Jin
Same issue here.
Will Ceph community support Debian Jessie in the future?

On Mon, Mar 5, 2018 at 6:33 PM, Florent B  wrote:
> Jessie is no more supported ??
> https://download.ceph.com/debian-luminous/dists/jessie/main/binary-amd64/Packages
> only contains ceph-deploy package !
>
>
> On 28/02/2018 10:24, Florent B wrote:
>> Hi,
>>
>> Since yesterday, the "ceph-luminous" repository does not contain any
>> package for Debian Jessie.
>>
>> Is it expected ?
>>
>> Thank you.
>>
>> Florent
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs miss data for 15s when master mds rebooting

2017-12-17 Thread Wei Jin
On Fri, Dec 15, 2017 at 6:08 PM, John Spray  wrote:
> On Fri, Dec 15, 2017 at 1:45 AM, 13605702...@163.com
> <13605702...@163.com> wrote:
>> hi
>>
>> i used 3 nodes to deploy mds (each node also has mon on it)
>>
>> my config:
>> [mds.ceph-node-10-101-4-17]
>> mds_standby_replay = true
>> mds_standby_for_rank = 0
>>
>> [mds.ceph-node-10-101-4-21]
>> mds_standby_replay = true
>> mds_standby_for_rank = 0
>>
>> [mds.ceph-node-10-101-4-22]
>> mds_standby_replay = true
>> mds_standby_for_rank = 0
>>
>> the mds stat:
>> e29: 1/1/1 up {0=ceph-node-10-101-4-22=up:active}, 1 up:standby-replay, 1
>> up:standby
>>
>> i mount the cephfs on the ceph client, and run the test script to write data
>> into file under the cephfs dir,
>> when i reboot the master mds, and i found the data is not written into the
>> file.
>> after 15 seconds, data can be written into the file again
>>
>> so my question is:
>> is this normal when reboot the master mds?
>> when will the up:standby-replay mds take over the the cephfs?
>
> The standby takes over after the active daemon has not reported to the
> monitors for `mds_beacon_grace` seconds, which as you have noticed is
> 15s by default.
>
> If you know you are rebooting something, you can pre-empt the timeout
> mechanism by using "ceph mds fail" on the active daemon, to cause
> another to take over right away.

Why reboot mds must wait for grace time?
Is it possible or reasonable to tell monitor during reboot by that
daemon itself?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Wei Jin
>
> So, questions: does that really matter? What are possible impacts? What
> could have caused this 2 hosts to hold so many capabilities?
> 1 of the hosts are for tests purposes, traffic is close to zero. The other
> host wasn't using cephfs at all. All services stopped.
>

The reason might be updatedb program, you could forbid it to scan your
mount point.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy failed to deploy osd randomly

2017-11-15 Thread Wei Jin
I tried to do purge/purgedata and then redo the deploy command for a
few times, and it still fails to start osd.
And there is no error log, anyone know what's the problem?
BTW, my os is dedian with 4.4 kernel.
Thanks.


On Wed, Nov 15, 2017 at 8:24 PM, Wei Jin <wjin...@gmail.com> wrote:
> Hi, List,
>
> My machine has 12 SSDs disk, and I use ceph-deploy to deploy them. But for
> some machine/disks,it failed to start osd.
> I tried many times, some success but others failed. But there is no error
> info.
> Following is ceph-deploy log for one disk:
>
>
> root@n10-075-012:~# ceph-deploy osd create --zap-disk n10-075-094:sdb:sdb
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.39): /usr/bin/ceph-deploy osd create
> --zap-disk n10-075-094:sdb:sdb
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  block_db  : None
> [ceph_deploy.cli][INFO  ]  disk  : [('n10-075-094',
> '/dev/sdb', '/dev/sdb')]
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  bluestore : None
> [ceph_deploy.cli][INFO  ]  block_wal : None
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: create
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
> [ceph_deploy.cli][INFO  ]  filestore : None
> [ceph_deploy.cli][INFO  ]  func  :  0x7f566ae9a938>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  zap_disk  : True
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> n10-075-094:/dev/sdb:/dev/sdb
> [n10-075-094][DEBUG ] connected to host: n10-075-094
> [n10-075-094][DEBUG ] detect platform information from remote host
> [n10-075-094][DEBUG ] detect machine type
> [n10-075-094][DEBUG ] find the location of an executable
> [ceph_deploy.osd][INFO  ] Distro info: debian 8.9 jessie
> [ceph_deploy.osd][DEBUG ] Deploying osd to n10-075-094
> [n10-075-094][DEBUG ] write cluster configuration to
> /etc/ceph/{cluster}.conf
> [ceph_deploy.osd][DEBUG ] Preparing host n10-075-094 disk /dev/sdb journal
> /dev/sdb activate True
> [n10-075-094][DEBUG ] find the location of an executable
> [n10-075-094][INFO  ] Running command: /usr/sbin/ceph-disk -v prepare
> --zap-disk --cluster ceph --fs-type xfs -- /dev/sdb /dev/sdb
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
> --check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
> --cluster ceph --setuser ceph --setgroup ceph
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
> --check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
> --cluster ceph --setuser ceph --setgroup ceph
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
> --check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
> --cluster ceph --setuser ceph --setgroup ceph
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
> /sys/dev/block/8:16/dm/uuid
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
> /sys/dev/block/8:16/dm/uuid
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
> /sys/dev/block/8:16/dm/uuid
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
> /sys/dev/block/8:16/dm/uuid
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb1 uuid path is
> /sys/dev/block/8:17/dm/uuid
> [n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb2 uuid path is
> /sys/dev/block/8:18/dm/uuid
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
> [n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
> [n10-075-094][WARNIN] command: Running command: /usr/bin/cep

[ceph-users] ceph-deploy failed to deploy osd randomly

2017-11-15 Thread Wei Jin
Hi, List,

My machine has 12 ssd
There are some errors for ceph-deploy.
It failed randomly

root@n10-075-012:~# *ceph-deploy osd create --zap-disk n10-075-094:sdb:sdb*
[ceph_deploy.conf][DEBUG ] found configuration file at:
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.39): /usr/bin/ceph-deploy osd create
--zap-disk n10-075-094:sdb:sdb
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  block_db  : None
[ceph_deploy.cli][INFO  ]  disk  : [('n10-075-094',
'/dev/sdb', '/dev/sdb')]
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  bluestore : None
[ceph_deploy.cli][INFO  ]  block_wal : None
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : xfs
[ceph_deploy.cli][INFO  ]  filestore : None
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : True
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
n10-075-094:/dev/sdb:/dev/sdb
[n10-075-094][DEBUG ] connected to host: n10-075-094
[n10-075-094][DEBUG ] detect platform information from remote host
[n10-075-094][DEBUG ] detect machine type
[n10-075-094][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: debian 8.9 jessie
[ceph_deploy.osd][DEBUG ] Deploying osd to n10-075-094
[n10-075-094][DEBUG ] write cluster configuration to
/etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host n10-075-094 disk /dev/sdb journal
/dev/sdb activate True
[n10-075-094][DEBUG ] find the location of an executable
[n10-075-094][INFO  ] Running command: /usr/sbin/ceph-disk -v prepare
--zap-disk --cluster ceph --fs-type xfs -- /dev/sdb /dev/sdb
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
--cluster ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
--cluster ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log
--cluster ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb1 uuid path is
/sys/dev/block/8:17/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb2 uuid path is
/sys/dev/block/8:18/dm/uuid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] zap: Zapping partition table on /dev/sdb
[n10-075-094][WARNIN] command_check_call: Running command: /sbin/sgdisk
--zap-all -- /dev/sdb
[n10-075-094][WARNIN] Caution: invalid backup GPT header, but valid main
header; regenerating
[n10-075-094][WARNIN] backup header from main header.
[n10-075-094][WARNIN]
[n10-075-094][WARNIN] Warning! Main and backup partition tables differ! Use
the 'c' and 'e' options
[n10-075-094][WARNIN] on the recovery & transformation menu to examine the
two tables.
[n10-075-094][WARNIN]
[n10-075-094][WARNIN] Warning! One or more CRCs don't match. You should
repair the disk!
[n10-075-094][WARNIN]
[n10-075-094][DEBUG ] **

[ceph-users] ceph-deploy failed to deploy osd randomly

2017-11-15 Thread Wei Jin
Hi, List,

My machine has 12 SSDs disk, and I use ceph-deploy to deploy them. But for some 
machine/disks,it failed to start osd.
I tried many times, some success but others failed. But there is no error info.
Following is ceph-deploy log for one disk:


root@n10-075-012:~# ceph-deploy osd create --zap-disk n10-075-094:sdb:sdb
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.39): /usr/bin/ceph-deploy osd create 
--zap-disk n10-075-094:sdb:sdb
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  block_db  : None
[ceph_deploy.cli][INFO  ]  disk  : [('n10-075-094', 
'/dev/sdb', '/dev/sdb')]
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  bluestore : None
[ceph_deploy.cli][INFO  ]  block_wal : None
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   : 
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : xfs
[ceph_deploy.cli][INFO  ]  filestore : None
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : True
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
n10-075-094:/dev/sdb:/dev/sdb
[n10-075-094][DEBUG ] connected to host: n10-075-094
[n10-075-094][DEBUG ] detect platform information from remote host
[n10-075-094][DEBUG ] detect machine type
[n10-075-094][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: debian 8.9 jessie
[ceph_deploy.osd][DEBUG ] Deploying osd to n10-075-094
[n10-075-094][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host n10-075-094 disk /dev/sdb journal 
/dev/sdb activate True
[n10-075-094][DEBUG ] find the location of an executable
[n10-075-094][INFO  ] Running command: /usr/sbin/ceph-disk -v prepare 
--zap-disk --cluster ceph --fs-type xfs -- /dev/sdb /dev/sdb
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log 
--cluster ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster 
ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster 
ceph --setuser ceph --setgroup ceph
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is 
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is 
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is 
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is 
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb1 uuid path is 
/sys/dev/block/8:17/dm/uuid
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb2 uuid path is 
/sys/dev/block/8:18/dm/uuid
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[n10-075-094][WARNIN] command: Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[n10-075-094][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdb uuid path is 
/sys/dev/block/8:16/dm/uuid
[n10-075-094][WARNIN] zap: Zapping partition table on /dev/sdb
[n10-075-094][WARNIN] command_check_call: Running command: /sbin/sgdisk 
--zap-all -- /dev/sdb
[n10-075-094][WARNIN] Caution: invalid backup GPT header, but valid main 
header; regenerating
[n10-075-094][WARNIN] backup header from main header.
[n10-075-094][WARNIN]
[n10-075-094][WARNIN] Warning! Main and backup partition tables differ! Use the 
'c' and 'e' options
[n10-075-094][WARNIN] on the recovery & transformation menu to examine the two 
tables.

Re: [ceph-users] pg inconsistent and repair doesn't work

2017-10-25 Thread Wei Jin
I found it is similar to bug: http://tracker.ceph.com/issues/21388.
And fix it by rados command.

The pg inconsistent info is like following,wish it could be fixed in the future.

root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# rados
list-inconsistent-obj 1.fcd --format=json-pretty
{
"epoch": 2373,
"inconsistents": [
{
"object": {
"name": "103528d.0058",
"nspace": "fsvolumens_87c46348-9869-11e7-8525-3497f65a8415",
"locator": "",
"snap": "head",
"version": 147490
},
"errors": [],
"union_shard_errors": [
"size_mismatch_oi"
],
"selected_object_info":
"1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])",
"shards": [
{
"osd": 27,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
},
{
"osd": 62,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
},
{
            "osd": 133,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
}
]
}
]
}

On Wed, Oct 25, 2017 at 12:05 PM, Wei Jin <wjin...@gmail.com> wrote:
> Hi, list,
>
> We ran into pg deep scrub error. And we tried to repair it by `ceph pg
> repair pgid`. But it didn't work. We also verified object files,  and
> found both 3 replicas were zero size. What's the problem, whether it
> is a bug? And how to fix the inconsistent? I haven't restarted the
> osds so far as I am not sure whether it works.
>
> ceph version: 10.2.9
> user case: cephfs
> kernel client: 4.4/4.9
>
> Error info from primary osd:
>
> root@n10-075-019:~# grep -Hn 'ERR' /var/log/ceph/ceph-osd.27.log.1
> /var/log/ceph/ceph-osd.27.log.1:3038:2017-10-25 04:47:34.460536
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 27: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3039:2017-10-25 04:47:34.460722
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 62: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3040:2017-10-25 04:47:34.460725
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 133: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3041:2017-10-25 04:47:34.460800
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head:
> failed to pick suitable auth object
> /var/log/ceph/ceph-osd.27.log.1:3042:2017-10-25 04:47:34.461458
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : deep-scrub 1.fcd
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> on disk size (0) does not match object info size (3461120) adjusted
> for ondisk to (3461120)
> /v

[ceph-users] pg inconsistent and repair doesn't work

2017-10-24 Thread Wei Jin
Hi, list,

We ran into pg deep scrub error. And we tried to repair it by `ceph pg
repair pgid`. But it didn't work. We also verified object files,  and
found both 3 replicas were zero size. What's the problem, whether it
is a bug? And how to fix the inconsistent? I haven't restarted the
osds so far as I am not sure whether it works.

ceph version: 10.2.9
user case: cephfs
kernel client: 4.4/4.9

Error info from primary osd:

root@n10-075-019:~# grep -Hn 'ERR' /var/log/ceph/ceph-osd.27.log.1
/var/log/ceph/ceph-osd.27.log.1:3038:2017-10-25 04:47:34.460536
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 27: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3039:2017-10-25 04:47:34.460722
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 62: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3040:2017-10-25 04:47:34.460725
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 133: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3041:2017-10-25 04:47:34.460800
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head:
failed to pick suitable auth object
/var/log/ceph/ceph-osd.27.log.1:3042:2017-10-25 04:47:34.461458
7f39c4829700 -1 log_channel(cluster) log [ERR] : deep-scrub 1.fcd
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
on disk size (0) does not match object info size (3461120) adjusted
for ondisk to (3461120)
/var/log/ceph/ceph-osd.27.log.1:3043:2017-10-25 04:47:44.645934
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd deep-scrub 4
errors


Object file info:

root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head#


root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head#


root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread Wei Jin
On Sat, Apr 1, 2017 at 5:17 PM, mj  wrote:
> Hi,
>
> Despite ntp, we keep getting clock skews that auto disappear again after a
> few minutes.
>
> To prevent the unneccerasy HEALTH_WARNs, I have increased the
> mon_clock_drift_allowed to 0.2, as can be seen below:
>
>> root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config
>> show | grep clock
>> "mon_clock_drift_allowed": "0.2",
>> "mon_clock_drift_warn_backoff": "5",
>> "clock_offset": "0",
>> root@ceph1:~#

mon_clock_drift_allowed should be used in monitor process, what's the
output of `ceph daemon mon.foo config show | grep clock`?

how did you change the value? command line or config file?

>
>
> Despite this setting, I keep receiving HEALTH_WARNs like below:
>
>> ceph cluster node ceph1 health status became HEALTH_WARN clock skew
>> detected on mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0
>> clock skew 0.113709s > max 0.1s (latency 0.000523111s)
>
>
> Can anyone explain why the running config shows "mon_clock_drift_allowed":
> "0.2" and the HEALTH_WARN says "max 0.1s (latency 0.000523111s)"?
>
> How come there's a difference between the two?
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Wei Jin
On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy  wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
> v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
> message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message
will be dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd
process is still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with
heartbeat message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com