debug_osd that is... :) On Tue, Mar 6, 2018 at 7:10 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> > > On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata < > mbald...@hsamiata.it> wrote: > >> Hi >> >> I monitor dmesg in each of the 3 nodes, no hardware issue reported. And >> the problem happens with various different OSDs in different nodes, for me >> it is clear it's not an hardware problem. >> > > If you have osd_debug set to 25 or greater when you run the deep scrub you > should get more information about the nature of the read error in the > ReplicatedBackend::be_deep_scrub() function (assuming this is a > replicated pool). > > This may create large logs so watch they don't exhaust storage. > >> Thanks for reply >> >> >> >> Il 05/03/2018 21:45, Vladimir Prokofev ha scritto: >> >> > always solved by ceph pg repair <PG> >> That doesn't necessarily means that there's no hardware issue. In my case >> repair also worked fine and returned cluster to OK state every time, but in >> time faulty disk fail another scrub operation, and this repeated multiple >> times before we replaced that disk. >> One last thing to look into is dmesg at your OSD nodes. If there's a >> hardware read error it will be logged in dmesg. >> >> 2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata < >> mbald...@hsamiata.it>: >> >>> Hi and thanks for reply >>> >>> The OSDs are all healthy, in fact after a ceph pg repair <PG> the ceph >>> health is back to OK and in the OSD log I see <PG> repair ok, 0 fixed >>> >>> The SMART data of the 3 OSDs seems fine >>> >>> *OSD.5* >>> >>> # ceph-disk list | grep osd.5 >>> /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2 >>> >>> # smartctl -a /dev/sdd >>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >>> >>> === START OF INFORMATION SECTION === >>> Model Family: Seagate Barracuda 7200.14 (AF) >>> Device Model: ST1000DM003-1SB10C >>> Serial Number: Z9A1MA1V >>> LU WWN Device Id: 5 000c50 090c7028b >>> Firmware Version: CC43 >>> User Capacity: 1,000,204,886,016 bytes [1.00 TB] >>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>> Rotation Rate: 7200 rpm >>> Form Factor: 3.5 inches >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: ATA8-ACS T13/1699-D revision 4 >>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>> Local Time is: Mon Mar 5 16:17:22 2018 CET >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> === START OF READ SMART DATA SECTION === >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x82) Offline data collection activity >>> was completed without error. >>> Auto Offline Data Collection: Enabled. >>> Self-test execution status: ( 0) The previous self-test routine >>> completed >>> without error or no self-test has ever >>> been run. >>> Total time to complete Offline >>> data collection: ( 0) seconds. >>> Offline data collection >>> capabilities: (0x7b) SMART execute Offline immediate. >>> Auto Offline data collection on/off >>> support. >>> Suspend Offline collection upon new >>> command. >>> Offline surface scan supported. >>> Self-test supported. >>> Conveyance Self-test supported. >>> Selective Self-test supported. >>> SMART capabilities: (0x0003) Saves SMART data before entering >>> power-saving mode. >>> Supports SMART auto save timer. >>> Error logging capability: (0x01) Error logging supported. >>> General Purpose Logging supported. >>> Short self-test routine >>> recommended polling time: ( 1) minutes. >>> Extended self-test routine >>> recommended polling time: ( 109) minutes. >>> Conveyance self-test routine >>> recommended polling time: ( 2) minutes. >>> SCT capabilities: (0x1085) SCT Status supported. >>> >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >>> WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always >>> - 193297722 >>> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >>> - 0 >>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >>> - 60 >>> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >>> - 0 >>> 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always >>> - 1451132477 >>> 9 Power_On_Hours 0x0032 085 085 000 Old_age Always >>> - 13283 >>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >>> - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >>> - 61 >>> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >>> - 0 >>> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >>> - 0 >>> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >>> - 0 >>> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >>> - 0 0 0 >>> 189 High_Fly_Writes 0x003a 086 086 000 Old_age Always >>> - 14 >>> 190 Airflow_Temperature_Cel 0x0022 071 055 040 Old_age Always >>> - 29 (Min/Max 23/32) >>> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >>> - 607 >>> 194 Temperature_Celsius 0x0022 029 014 000 Old_age Always >>> - 29 (0 14 0 0 0) >>> 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always >>> - 193297722 >>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >>> - 0 >>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >>> - 0 >>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >>> - 0 >>> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >>> - 13211h+23m+08.363s >>> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >>> - 53042120064 >>> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >>> - 170788993187 >>> >>> *OSD.4* >>> >>> # ceph-disk list | grep osd.4 >>> /dev/sdc1 ceph data, active, cluster ceph, osd.4, block /dev/sdc2 >>> >>> # smartctl -a /dev/sdc >>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >>> >>> === START OF INFORMATION SECTION === >>> Model Family: Seagate Barracuda 7200.14 (AF) >>> Device Model: ST1000DM003-1SB10C >>> Serial Number: Z9A1M1BW >>> LU WWN Device Id: 5 000c50 090c78d27 >>> Firmware Version: CC43 >>> User Capacity: 1,000,204,886,016 bytes [1.00 TB] >>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>> Rotation Rate: 7200 rpm >>> Form Factor: 3.5 inches >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: ATA8-ACS T13/1699-D revision 4 >>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>> Local Time is: Mon Mar 5 16:20:46 2018 CET >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> === START OF READ SMART DATA SECTION === >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x82) Offline data collection activity >>> was completed without error. >>> Auto Offline Data Collection: Enabled. >>> Self-test execution status: ( 0) The previous self-test routine >>> completed >>> without error or no self-test has ever >>> been run. >>> Total time to complete Offline >>> data collection: ( 0) seconds. >>> Offline data collection >>> capabilities: (0x7b) SMART execute Offline immediate. >>> Auto Offline data collection on/off >>> support. >>> Suspend Offline collection upon new >>> command. >>> Offline surface scan supported. >>> Self-test supported. >>> Conveyance Self-test supported. >>> Selective Self-test supported. >>> SMART capabilities: (0x0003) Saves SMART data before entering >>> power-saving mode. >>> Supports SMART auto save timer. >>> Error logging capability: (0x01) Error logging supported. >>> General Purpose Logging supported. >>> Short self-test routine >>> recommended polling time: ( 1) minutes. >>> Extended self-test routine >>> recommended polling time: ( 109) minutes. >>> Conveyance self-test routine >>> recommended polling time: ( 2) minutes. >>> SCT capabilities: (0x1085) SCT Status supported. >>> >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >>> WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always >>> - 194906537 >>> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >>> - 0 >>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >>> - 64 >>> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >>> - 0 >>> 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always >>> - 1485899434 >>> 9 Power_On_Hours 0x0032 085 085 000 Old_age Always >>> - 13390 >>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >>> - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >>> - 65 >>> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >>> - 0 >>> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >>> - 0 >>> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >>> - 0 >>> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >>> - 0 0 0 >>> 189 High_Fly_Writes 0x003a 095 095 000 Old_age Always >>> - 5 >>> 190 Airflow_Temperature_Cel 0x0022 074 051 040 Old_age Always >>> - 26 (Min/Max 19/29) >>> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >>> - 616 >>> 194 Temperature_Celsius 0x0022 026 014 000 Old_age Always >>> - 26 (0 14 0 0 0) >>> 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always >>> - 194906537 >>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >>> - 0 >>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >>> - 0 >>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >>> - 0 >>> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >>> - 13315h+20m+30.974s >>> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >>> - 52137467719 >>> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >>> - 177227508503 >>> >>> >>> *OSD.8* >>> >>> # ceph-disk list | grep osd.8 >>> /dev/sda1 ceph data, active, cluster ceph, osd.8, block /dev/sda2 >>> >>> # smartctl -a /dev/sda >>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) >>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org >>> >>> === START OF INFORMATION SECTION === >>> Model Family: Seagate Barracuda 7200.14 (AF) >>> Device Model: ST1000DM003-1SB10C >>> Serial Number: Z9A2BEF2 >>> LU WWN Device Id: 5 000c50 0910f5427 >>> Firmware Version: CC43 >>> User Capacity: 1,000,203,804,160 bytes [1.00 TB] >>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>> Rotation Rate: 7200 rpm >>> Form Factor: 3.5 inches >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: ATA8-ACS T13/1699-D revision 4 >>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>> Local Time is: Mon Mar 5 16:22:47 2018 CET >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> === START OF READ SMART DATA SECTION === >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x82) Offline data collection activity >>> was completed without error. >>> Auto Offline Data Collection: Enabled. >>> Self-test execution status: ( 0) The previous self-test routine >>> completed >>> without error or no self-test has ever >>> been run. >>> Total time to complete Offline >>> data collection: ( 0) seconds. >>> Offline data collection >>> capabilities: (0x7b) SMART execute Offline immediate. >>> Auto Offline data collection on/off >>> support. >>> Suspend Offline collection upon new >>> command. >>> Offline surface scan supported. >>> Self-test supported. >>> Conveyance Self-test supported. >>> Selective Self-test supported. >>> SMART capabilities: (0x0003) Saves SMART data before entering >>> power-saving mode. >>> Supports SMART auto save timer. >>> Error logging capability: (0x01) Error logging supported. >>> General Purpose Logging supported. >>> Short self-test routine >>> recommended polling time: ( 1) minutes. >>> Extended self-test routine >>> recommended polling time: ( 110) minutes. >>> Conveyance self-test routine >>> recommended polling time: ( 2) minutes. >>> SCT capabilities: (0x1085) SCT Status supported. >>> >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >>> WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 083 063 006 Pre-fail Always >>> - 224621855 >>> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always >>> - 0 >>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always >>> - 275 >>> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always >>> - 0 >>> 7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always >>> - 149383284 >>> 9 Power_On_Hours 0x0032 093 093 000 Old_age Always >>> - 6210 >>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always >>> - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always >>> - 265 >>> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always >>> - 0 >>> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always >>> - 0 >>> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always >>> - 0 >>> 188 Command_Timeout 0x0032 100 100 000 Old_age Always >>> - 0 0 0 >>> 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always >>> - 2 >>> 190 Airflow_Temperature_Cel 0x0022 069 058 040 Old_age Always >>> - 31 (Min/Max 21/35) >>> 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always >>> - 516 >>> 194 Temperature_Celsius 0x0022 031 017 000 Old_age Always >>> - 31 (0 17 0 0 0) >>> 195 Hardware_ECC_Recovered 0x001a 005 001 000 Old_age Always >>> - 224621855 >>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >>> - 0 >>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline >>> - 0 >>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always >>> - 0 >>> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline >>> - 6154h+03m+35.126s >>> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline >>> - 24333847321 >>> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline >>> - 50261005553 >>> >>> >>> >>> However it's not only these 3 OSD to have PG with errors, these are >>> onlyl the most recent, in the last 3 months I had often OSD_SCRUB_ERRORS in >>> various OSDs, always solved by ceph pg repair <PG>, I don't think it's an >>> hardware issue. >>> >>> >>> >>> >>> >>> Il 05/03/2018 13:40, Vladimir Prokofev ha scritto: >>> >>> > candidate had a read error >>> speaks for itself - while scrubbing it coudn't read data. >>> I had similar issue, and it was just OSD dying - errors and relocated >>> sectors in SMART, just replaced the disk. But in your case it seems that >>> errors are on different OSDs? Are your OSDs all healthy? >>> You can use this command to see some details. >>> rados list-inconsistent-obj <pg.id> --format=json-pretty >>> pg.id is the PG that's reporting as inconsistent. My guess is that >>> you'll see read errors in this output, with OSD number that encountered >>> error. After that you have to check that OSD health - SMART details, etc. >>> Not always it's the disk itself that causing problems - for example we >>> had read errors because of a faulty backplane interface in a server; >>> changing the chassis resolved this issue. >>> >>> >>> 2018-03-05 14:21 GMT+03:00 Marco Baldini - H.S. Amiata < >>> mbald...@hsamiata.it>: >>> >>>> Hi >>>> >>>> After some days with debug_osd 5/5 I found [ERR] in different days, >>>> different PGs, different OSDs, different hosts. This is what I get in the >>>> OSD logs: >>>> >>>> *OSD.5 (host 3)* >>>> 2018-03-01 20:30:02.702269 7fdf4d515700 2 osd.5 pg_epoch: 16486 pg[9.1c( >>>> v 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 >>>> ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) >>>> [5,6] r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 >>>> active+clean+scrubbing+deep] 9.1c shard 6: soid >>>> 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a >>>> read error >>>> 2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] >>>> : 9.1c shard 6: soid >>>> 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a >>>> read error >>>> >>>> * >>>> OSD.4 (host 3)* >>>> 2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] >>>> : 13.65 shard 2: soid >>>> 13:a719ecdf:::rbd_data.5f65056b8b4567.000000000000f8eb:head candidate had >>>> a read error >>>> >>>> *OSD.8 (host 2)* >>>> 2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] >>>> : 14.31 shard 1: soid >>>> 14:8cc6cd37:::rbd_data.30b15b6b8b4567.00000000000081a1:head candidate had >>>> a read error >>>> >>>> I don't know what this error is meaning, and as always a ceph pg repair >>>> fixes it. I don't think this is normal. >>>> >>>> Ideas? >>>> >>>> Thanks >>>> >>>> Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto: >>>> >>>> Hi >>>> >>>> I read the bugtracker issue and it seems a lot like my problem, even if >>>> I can't check the reported checksum because I don't have it in my logs, >>>> perhaps it's because of debug osd = 0/0 in ceph.conf >>>> >>>> I just raised the OSD log level >>>> >>>> ceph tell osd.* injectargs --debug-osd 5/5 >>>> >>>> I'll check OSD logs in the next days... >>>> >>>> Thanks >>>> >>>> >>>> >>>> Il 28/02/2018 11:59, Paul Emmerich ha scritto: >>>> >>>> Hi, >>>> >>>> might be http://tracker.ceph.com/issues/22464 >>>> >>>> Can you check the OSD log file to see if the reported checksum >>>> is 0x6706be76? >>>> >>>> >>>> Paul >>>> >>>> Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata < >>>> mbald...@hsamiata.it>: >>>> >>>> Hello >>>> >>>> I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and >>>> 1x240GB SSD. I created this cluster after Luminous release, so all OSDs are >>>> Bluestore. In my crush map I have two rules, one targeting the SSDs and one >>>> targeting the HDDs. I have 4 pools, one using the SSD rule and the others >>>> using the HDD rule, three pools are size=3 min_size=2, one is size=2 >>>> min_size=1 (this one have content that it's ok to lose) >>>> >>>> In the last 3 month I'm having a strange random problem. I planned my >>>> osd scrubs during the night (osd scrub begin hour = 20, osd scrub end hour >>>> = 7) when office is closed so there is low impact on the users. Some >>>> mornings, when I ceph the cluster health, I find: >>>> >>>> HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent >>>> OSD_SCRUB_ERRORS X scrub errors >>>> PG_DAMAGED Possible data damage: Y pg inconsistent >>>> >>>> X and Y sometimes are 1, sometimes 2. >>>> >>>> I issue a ceph health detail, check the damaged PGs, and run a ceph pg >>>> repair for the damaged PGs, I get >>>> >>>> instructing pg PG on osd.N to repair >>>> >>>> PG are different, OSD that have to repair PG is different, even the >>>> node hosting the OSD is different, I made a list of all PGs and OSDs. This >>>> morning is the most recent case: >>>> >>>> > ceph health detail >>>> HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent >>>> OSD_SCRUB_ERRORS 2 scrub errors >>>> PG_DAMAGED Possible data damage: 2 pgs inconsistent >>>> pg 13.65 is active+clean+inconsistent, acting [4,2,6] >>>> pg 14.31 is active+clean+inconsistent, acting [8,3,1] >>>> >>>> > ceph pg repair 13.65 >>>> instructing pg 13.65 on osd.4 to repair >>>> >>>> (node-2)> tail /var/log/ceph/ceph-osd.4.log >>>> 2018-02-28 08:38:47.593447 7f112cf76700 0 log_channel(cluster) log [DBG] >>>> : 13.65 repair starts >>>> 2018-02-28 08:39:37.573342 7f112cf76700 0 log_channel(cluster) log [DBG] >>>> : 13.65 repair ok, 0 fixed >>>> >>>> > ceph pg repair 14.31 >>>> instructing pg 14.31 on osd.8 to repair >>>> >>>> (node-3)> tail /var/log/ceph/ceph-osd.8.log >>>> 2018-02-28 08:52:37.297490 7f4dd0816700 0 log_channel(cluster) log [DBG] >>>> : 14.31 repair starts >>>> 2018-02-28 08:53:00.704020 7f4dd0816700 0 log_channel(cluster) log [DBG] >>>> : 14.31 repair ok, 0 fixed >>>> >>>> >>>> >>>> I made a list of when I got OSD_SCRUB_ERRORS, what PG and what OSD had >>>> to repair PG. Date is dd/mm/yyyy >>>> >>>> 21/12/2017 -- pg 14.29 is active+clean+inconsistent, acting [6,2,4] >>>> >>>> 18/01/2018 -- pg 14.5a is active+clean+inconsistent, acting [6,4,1] >>>> >>>> 22/01/2018 -- pg 9.3a is active+clean+inconsistent, acting [2,7] >>>> >>>> 29/01/2018 -- pg 13.3e is active+clean+inconsistent, acting [4,6,1] >>>> instructing pg 13.3e on osd.4 to repair >>>> >>>> 07/02/2018 -- pg 13.7e is active+clean+inconsistent, acting [8,2,5] >>>> instructing pg 13.7e on osd.8 to repair >>>> >>>> 09/02/2018 -- pg 13.30 is active+clean+inconsistent, acting [7,3,2] >>>> instructing pg 13.30 on osd.7 to repair >>>> >>>> 15/02/2018 -- pg 9.35 is active+clean+inconsistent, acting [1,8] >>>> instructing pg 9.35 on osd.1 to repair >>>> >>>> pg 13.3e is active+clean+inconsistent, acting [4,6,1] >>>> instructing pg 13.3e on osd.4 to repair >>>> >>>> 17/02/2018 -- pg 9.2d is active+clean+inconsistent, acting [7,5] >>>> instructing pg 9.2d on osd.7 to repair >>>> >>>> 22/02/2018 -- pg 9.24 is active+clean+inconsistent, acting [5,8] >>>> instructing pg 9.24 on osd.5 to repair >>>> >>>> 28/02/2018 -- pg 13.65 is active+clean+inconsistent, acting [4,2,6] >>>> instructing pg 13.65 on osd.4 to repair >>>> >>>> pg 14.31 is active+clean+inconsistent, acting [8,3,1] >>>> instructing pg 14.31 on osd.8 to repair >>>> >>>> >>>> >>>> >>>> If can be useful, my ceph.conf is here: >>>> >>>> [global] >>>> auth client required = none >>>> auth cluster required = none >>>> auth service required = none >>>> fsid = 24d5d6bc-0943-4345-b44e-46c19099004b >>>> cluster network = 10.10.10.0/24 >>>> public network = 10.10.10.0/24 >>>> keyring = /etc/pve/priv/$cluster.$name.keyring >>>> mon allow pool delete = true >>>> osd journal size = 5120 >>>> osd pool default min size = 2 >>>> osd pool default size = 3 >>>> bluestore_block_db_size = 64424509440 >>>> >>>> debug asok = 0/0 >>>> debug auth = 0/0 >>>> debug buffer = 0/0 >>>> debug client = 0/0 >>>> debug context = 0/0 >>>> debug crush = 0/0 >>>> debug filer = 0/0 >>>> debug filestore = 0/0 >>>> debug finisher = 0/0 >>>> debug heartbeatmap = 0/0 >>>> debug journal = 0/0 >>>> debug journaler = 0/0 >>>> debug lockdep = 0/0 >>>> debug mds = 0/0 >>>> debug mds balancer = 0/0 >>>> debug mds locker = 0/0 >>>> debug mds log = 0/0 >>>> debug mds log expire = 0/0 >>>> debug mds migrator = 0/0 >>>> debug mon = 0/0 >>>> debug monc = 0/0 >>>> debug ms = 0/0 >>>> debug objclass = 0/0 >>>> debug objectcacher = 0/0 >>>> debug objecter = 0/0 >>>> debug optracker = 0/0 >>>> debug osd = 0/0 >>>> debug paxos = 0/0 >>>> debug perfcounter = 0/0 >>>> debug rados = 0/0 >>>> debug rbd = 0/0 >>>> debug rgw = 0/0 >>>> debug throttle = 0/0 >>>> debug timer = 0/0 >>>> debug tp = 0/0 >>>> >>>> >>>> [osd] >>>> keyring = /var/lib/ceph/osd/ceph-$id/keyring >>>> osd max backfills = 1 >>>> osd recovery max active = 1 >>>> >>>> osd scrub begin hour = 20 >>>> osd scrub end hour = 7 >>>> osd scrub during recovery = false >>>> osd scrub load threshold = 0.3 >>>> >>>> [client] >>>> rbd cache = true >>>> rbd cache size = 268435456 # 256MB >>>> rbd cache max dirty = 201326592 # 192MB >>>> rbd cache max dirty age = 2 >>>> rbd cache target dirty = 33554432 # 32MB >>>> rbd cache writethrough until flush = true >>>> >>>> >>>> #[mgr] >>>> #debug_mgr = 20 >>>> >>>> >>>> [mon.pve-hs-main] >>>> host = pve-hs-main >>>> mon addr = 10.10.10.251:6789 >>>> >>>> [mon.pve-hs-2] >>>> host = pve-hs-2 >>>> mon addr = 10.10.10.252:6789 >>>> >>>> [mon.pve-hs-3] >>>> host = pve-hs-3 >>>> mon addr = 10.10.10.253:6789 >>>> >>>> My ceph versions: >>>> >>>> { >>>> "mon": { >>>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>>> luminous (stable)": 3 >>>> }, >>>> "mgr": { >>>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>>> luminous (stable)": 3 >>>> }, >>>> "osd": { >>>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>>> luminous (stable)": 12 >>>> }, >>>> "mds": {}, >>>> "overall": { >>>> "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) >>>> luminous (stable)": 18 >>>> } >>>> } >>>> >>>> >>>> >>>> >>>> My ceph osd tree: >>>> >>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>> -1 8.93686 root default >>>> -6 2.94696 host pve-hs-2 >>>> 3 hdd 0.90959 osd.3 up 1.00000 1.00000 >>>> 4 hdd 0.90959 osd.4 up 1.00000 1.00000 >>>> 5 hdd 0.90959 osd.5 up 1.00000 1.00000 >>>> 10 ssd 0.21819 osd.10 up 1.00000 1.00000 >>>> -3 2.86716 host pve-hs-3 >>>> 6 hdd 0.85599 osd.6 up 1.00000 1.00000 >>>> 7 hdd 0.85599 osd.7 up 1.00000 1.00000 >>>> 8 hdd 0.93700 osd.8 up 1.00000 1.00000 >>>> 11 ssd 0.21819 osd.11 up 1.00000 1.00000 >>>> -7 3.12274 host pve-hs-main >>>> 0 hdd 0.96819 osd.0 up 1.00000 1.00000 >>>> 1 hdd 0.96819 osd.1 up 1.00000 1.00000 >>>> 2 hdd 0.96819 osd.2 up 1.00000 1.00000 >>>> 9 ssd 0.21819 osd.9 up 1.00000 1.00000 >>>> >>>> >>>> My pools: >>>> >>>> pool 9 'cephbackup' replicated size 2 min_size 1 crush_rule 1 object_hash >>>> rjenkins pg_num 64 pgp_num 64 last_change 5665 flags hashpspool >>>> stripe_width 0 application rbd >>>> removed_snaps [1~3] >>>> pool 13 'cephwin' replicated size 3 min_size 2 crush_rule 1 object_hash >>>> rjenkins pg_num 128 pgp_num 128 last_change 16454 flags hashpspool >>>> stripe_width 0 application rbd >>>> removed_snaps [1~5] >>>> pool 14 'cephnix' replicated size 3 min_size 2 crush_rule 1 object_hash >>>> rjenkins pg_num 128 pgp_num 128 last_change 16482 flags hashpspool >>>> stripe_width 0 application rbd >>>> removed_snaps [1~227] >>>> pool 17 'cephssd' replicated size 3 min_size 2 crush_rule 2 object_hash >>>> rjenkins pg_num 64 pgp_num 64 last_change 8601 flags hashpspool >>>> stripe_width 0 application rbd >>>> removed_snaps [1~3] >>>> >>>> >>>> >>>> I can't understand where the problem comes from, I don't think it's >>>> hardware, if I had a failed disk, then I should have problems always on the >>>> same OSD. Any ideas >>>> >>>> Thanks >>>> >>>> >>>> >>>> -- >>>> *Marco Baldini* >>>> *H.S. Amiata Srl* >>>> Ufficio: 0577-779396 >>>> Cellulare: 335-8765169 >>>> WEB: www.hsamiata.it >>>> EMAIL: mbald...@hsamiata.it >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> -- >>>> Mit freundlichen Grüßen / Best Regards >>>> Paul Emmerich >>>> >>>> croit GmbH >>>> Freseniusstr. 31h >>>> 81247 München >>>> www.croit.io >>>> Tel: +49 89 1896585 90 <+49%2089%20189658590> >>>> >>>> Geschäftsführer: Martin Verges >>>> Handelsregister: Amtsgericht München >>>> USt-IdNr: DE310638492 >>>> >>>> >>>> -- >>>> *Marco Baldini* >>>> *H.S. Amiata Srl* >>>> Ufficio: 0577-779396 >>>> Cellulare: 335-8765169 >>>> WEB: www.hsamiata.it >>>> EMAIL: mbald...@hsamiata.it >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing >>>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> -- >>>> *Marco Baldini* >>>> *H.S. Amiata Srl* >>>> Ufficio: 0577-779396 >>>> Cellulare: 335-8765169 >>>> WEB: www.hsamiata.it >>>> EMAIL: mbald...@hsamiata.it >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing >>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> -- >>> *Marco Baldini* >>> *H.S. Amiata Srl* >>> Ufficio: 0577-779396 >>> Cellulare: 335-8765169 >>> WEB: www.hsamiata.it >>> EMAIL: mbald...@hsamiata.it >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >> >> _______________________________________________ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> -- >> *Marco Baldini* >> *H.S. Amiata Srl* >> Ufficio: 0577-779396 >> Cellulare: 335-8765169 >> WEB: www.hsamiata.it >> EMAIL: mbald...@hsamiata.it >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > Cheers, > Brad > -- Cheers, Brad
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com