Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Finally goodness happened! I applied PR and ran repair on OSD unmodified after initial failure. It went through without any errors and now I'm able to fuse mount the OSD and export PGs off it using ceph-objectstore-tool. Just in order to not mess it up I haven't started ceph-osd until I have PGs backed up. Cheers Igor, you're the best! > On 3.10.2018, at 14:39, Igor Fedotov wrote: > > To fix this specific issue please apply the following PR: > https://github.com/ceph/ceph/pull/24339 > > This wouldn't fix original issue but just in case please try to run repair > again. Will need log if an error is different from ENOSPC from your latest > email. > > > Thanks, > > Igor > > > On 10/3/2018 1:58 PM, Sergey Malinin wrote: >> Repair has gone farther but failed on something different - this time it >> appears to be related to store inconsistency rather than lack of free space. >> Emailed log to you, beware: over 2GB uncompressed. >> >> >>> On 3.10.2018, at 13:15, Igor Fedotov wrote: >>> >>> You may want to try new updates from the PR along with disabling flush on >>> recovery for rocksdb (avoid_flush_during_recovery parameter). >>> >>> Full cmd line might looks like: >>> >>> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" >>> bin/ceph-bluestore-tool --path repair >>> >>> >>> To be applied for "non-expanded" OSDs where repair didn't pass. >>> >>> Please collect a log during repair... >>> >>> >>> Thanks, >>> >>> Igor >>> >>> On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. > On 2.10.2018, at 16:07, Igor Fedotov wrote: > > You mentioned repair had worked before, is that correct? What's the > difference now except the applied patch? Different OSD? Anything else? > > > On 10/2/2018 3:52 PM, Sergey Malinin wrote: > >> It didn't work, emailed logs to you. >> >> >>> On 2.10.2018, at 14:43, Igor Fedotov wrote: >>> >>> The major change is in get_bluefs_rebalance_txn function, it lacked >>> bluefs_rebalance_txn assignment.. >>> >>> >>> >>> On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? > On 2.10.2018, at 14:15, Igor Fedotov wrote: > > Please update the patch from the PR - it didn't update bluefs extents > list before. > > Also please set debug bluestore 20 when re-running repair and collect > the log. > > If repair doesn't help - would you send repair and startup logs > directly to me as I have some issues accessing ceph-post-file uploads. > > > Thanks, > > Igor > > > On 10/2/2018 11:39 AM, Sergey Malinin wrote: >> Yes, I did repair all OSDs and it finished with 'repair success'. I >> backed up OSDs so now I have more room to play. >> I posted log files using ceph-post-file with the following IDs: >> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 >> 20df7df5-f0c9-4186-aa21-4e5c0172cd93 >> >> >>> On 2.10.2018, at 11:26, Igor Fedotov wrote: >>> >>> You did repair for any of this OSDs, didn't you? For all of them? >>> >>> >>> Would you please provide the log for both types (failed on mount >>> and failed with enospc) of failing OSDs. Prior to collecting please >>> remove existing ones prior and set debug bluestore to 20. >>> >>> >>> >>> On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. > On 1.10.2018, at 19:26, Igor Fedotov wrote: > > So you should call repair which rebalances (i.e. allocates > additional space) BlueFS space. Hence allowing OSD to start. > > Thanks, > > Igor > > > On 10/1/2018 7:22 PM, Igor Fedotov wrote: >> Not exactly. The rebalancing from this kv_sync_thread still >> might be deferred due to the nature of this thread (haven't 100% >> sure though). >> >> Here is my PR showing the idea (still untested and perhaps
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
To fix this specific issue please apply the following PR: https://github.com/ceph/ceph/pull/24339 This wouldn't fix original issue but just in case please try to run repair again. Will need log if an error is different from ENOSPC from your latest email. Thanks, Igor On 10/3/2018 1:58 PM, Sergey Malinin wrote: Repair has gone farther but failed on something different - this time it appears to be related to store inconsistency rather than lack of free space. Emailed log to you, beware: over 2GB uncompressed. On 3.10.2018, at 13:15, Igor Fedotov wrote: You may want to try new updates from the PR along with disabling flush on recovery for rocksdb (avoid_flush_during_recovery parameter). Full cmd line might looks like: CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" bin/ceph-bluestore-tool --path repair To be applied for "non-expanded" OSDs where repair didn't pass. Please collect a log during repair... Thanks, Igor On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); }
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Update: I rebuilt ceph-osd with latest PR and it started, worked for a few minutes and eventually failed on enospc. After that ceph-bluestore-tool repair started to fail on enospc again. I was unable to collect ceph-osd log, so emailed you the most recent repair log. > On 3.10.2018, at 13:58, Sergey Malinin wrote: > > Repair has gone farther but failed on something different - this time it > appears to be related to store inconsistency rather than lack of free space. > Emailed log to you, beware: over 2GB uncompressed. > > >> On 3.10.2018, at 13:15, Igor Fedotov wrote: >> >> You may want to try new updates from the PR along with disabling flush on >> recovery for rocksdb (avoid_flush_during_recovery parameter). >> >> Full cmd line might looks like: >> >> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" >> bin/ceph-bluestore-tool --path repair >> >> >> To be applied for "non-expanded" OSDs where repair didn't pass. >> >> Please collect a log during repair... >> >> >> Thanks, >> >> Igor >> >> On 10/2/2018 4:32 PM, Sergey Malinin wrote: >>> Repair goes through only when LVM volume has been expanded, otherwise it >>> fails with enospc as well as any other operation. However, expanding the >>> volume immediately renders bluefs unmountable with IO error. >>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very >>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or >>> after volume expansion. >>> >>> On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: > It didn't work, emailed logs to you. > > >> On 2.10.2018, at 14:43, Igor Fedotov wrote: >> >> The major change is in get_bluefs_rebalance_txn function, it lacked >> bluefs_rebalance_txn assignment.. >> >> >> >> On 10/2/2018 2:40 PM, Sergey Malinin wrote: >>> PR doesn't seem to have changed since yesterday. Am I missing something? >>> >>> On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: > Yes, I did repair all OSDs and it finished with 'repair success'. I > backed up OSDs so now I have more room to play. > I posted log files using ceph-post-file with the following IDs: > 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 > 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > > >> On 2.10.2018, at 11:26, Igor Fedotov wrote: >> >> You did repair for any of this OSDs, didn't you? For all of them? >> >> >> Would you please provide the log for both types (failed on mount and >> failed with enospc) of failing OSDs. Prior to collecting please >> remove existing ones prior and set debug bluestore to 20. >> >> >> >> On 10/2/2018 2:16 AM, Sergey Malinin wrote: >>> I was able to apply patches to mimic, but nothing changed. One osd >>> that I had space expanded on fails with bluefs mount IO error, >>> others keep failing with enospc. >>> >>> On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: > Not exactly. The rebalancing from this kv_sync_thread still might > be deferred due to the nature of this thread (haven't 100% sure > though). > > Here is my PR showing the idea (still untested and perhaps > unfinished!!!) > > https://github.com/ceph/ceph/pull/24353 > > > Igor > > > On 10/1/2018 7:07 PM, Sergey Malinin wrote: >> Can you please confirm whether I got this right: >> >> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 >> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 >> @@ -9049,22 +9049,17 @@ >>
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Repair has gone farther but failed on something different - this time it appears to be related to store inconsistency rather than lack of free space. Emailed log to you, beware: over 2GB uncompressed. > On 3.10.2018, at 13:15, Igor Fedotov wrote: > > You may want to try new updates from the PR along with disabling flush on > recovery for rocksdb (avoid_flush_during_recovery parameter). > > Full cmd line might looks like: > > CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" > bin/ceph-bluestore-tool --path repair > > > To be applied for "non-expanded" OSDs where repair didn't pass. > > Please collect a log during repair... > > > Thanks, > > Igor > > On 10/2/2018 4:32 PM, Sergey Malinin wrote: >> Repair goes through only when LVM volume has been expanded, otherwise it >> fails with enospc as well as any other operation. However, expanding the >> volume immediately renders bluefs unmountable with IO error. >> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very >> end of bluefs-log-dump), I'm not sure whether corruption occurred before or >> after volume expansion. >> >> >>> On 2.10.2018, at 16:07, Igor Fedotov wrote: >>> >>> You mentioned repair had worked before, is that correct? What's the >>> difference now except the applied patch? Different OSD? Anything else? >>> >>> >>> On 10/2/2018 3:52 PM, Sergey Malinin wrote: >>> It didn't work, emailed logs to you. > On 2.10.2018, at 14:43, Igor Fedotov wrote: > > The major change is in get_bluefs_rebalance_txn function, it lacked > bluefs_rebalance_txn assignment.. > > > > On 10/2/2018 2:40 PM, Sergey Malinin wrote: >> PR doesn't seem to have changed since yesterday. Am I missing something? >> >> >>> On 2.10.2018, at 14:15, Igor Fedotov wrote: >>> >>> Please update the patch from the PR - it didn't update bluefs extents >>> list before. >>> >>> Also please set debug bluestore 20 when re-running repair and collect >>> the log. >>> >>> If repair doesn't help - would you send repair and startup logs >>> directly to me as I have some issues accessing ceph-post-file uploads. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > On 2.10.2018, at 11:26, Igor Fedotov wrote: > > You did repair for any of this OSDs, didn't you? For all of them? > > > Would you please provide the log for both types (failed on mount and > failed with enospc) of failing OSDs. Prior to collecting please > remove existing ones prior and set debug bluestore to 20. > > > > On 10/2/2018 2:16 AM, Sergey Malinin wrote: >> I was able to apply patches to mimic, but nothing changed. One osd >> that I had space expanded on fails with bluefs mount IO error, >> others keep failing with enospc. >> >> >>> On 1.10.2018, at 19:26, Igor Fedotov wrote: >>> >>> So you should call repair which rebalances (i.e. allocates >>> additional space) BlueFS space. Hence allowing OSD to start. >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: > Can you please confirm whether I got this right: > > --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 > +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 > @@ -9049,22 +9049,17 @@ > throttle_bytes.put(costs); > PExtentVector bluefs_gift_extents; > - if (bluefs && > - after_flush - bluefs_last_balance > > - cct->_conf->bluestore_bluefs_balance_interval) { > -bluefs_last_balance = after_flush; > -int r = _balance_bluefs_freespace(_gift_extents); > -assert(r >= 0); > -if (r > 0) { > - for (auto& p :
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Alex, upstream recommendations for DB sizing are probably good enough but as most of fixed allocations they aren't super optimal for all the use cases. Usually one either wastes space or lacks it pme day in such configs. So I think we should have means to have more freedom in volumes management (change sizes, migrate, coalesce and split). LVM usage is a big step toward that but it is still insufficient and lacks additional helpers sometimes. To avoid the issue Sergey is experiencing IMO it's better to have standalone DB volume with some extra spare space. Even if the physical media is the same it helps to avoid this lazy rebalancing procedure which is the issue's root cause. But this wouldn't eliminate it totally - if spillover to main device takes place one might face it again. The same improvement can be probably done with single device configuration by proper rebalance tuning though (bluestore_bluefs_min and other params) but that's more complicated to debug and setup properly IMO. Anyway I think the issue is met very rarely. Sorry given all that I wouldn't comment if 30 GB fits your scenario or not. I don't know :) Thanks, Igor On 10/2/2018 5:23 PM, Alex Litvak wrote: Igor, Thank you for your reply. So what you are saying there are really no sensible space requirements for a collocated device? Even if I setup 30 GB for DB (which I really wouldn't like to do due to a space waste considerations ) there is a chance that if this space feels up I will be in the same trouble under some heavy load scenario? On 10/2/2018 9:15 AM, Igor Fedotov wrote: Even with a single device bluestore has a sort of implicit "BlueFS partition" where DB is stored. And it dynamically adjusts (rebalances) the space for that partition in background. Unfortunately it might perform that "too lazy" and hence under some heavy load it might end-up with the lack of space for that partition. While main device still has plenty of free space. I'm planning to refactor this re-balancing procedure in the future to eliminate the root cause. Thanks, Igor On 10/2/2018 5:04 PM, Alex Litvak wrote: I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full. And if it is not true, what would be sensible defaults on 800 GB SSD? I used ceph-ansible to build my cluster with system defaults and from I reading in this thread doesn't give me a good feeling at all. Document ion on the topic is very sketchy and online posts contradict each other some times. Thank you in advance, On 10/2/2018 8:52 AM, Igor Fedotov wrote: May I have a repair log for that "already expanded" OSD? On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
You may want to try new updates from the PR along with disabling flush on recovery for rocksdb (avoid_flush_during_recovery parameter). Full cmd line might looks like: CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" bin/ceph-bluestore-tool --path repair To be applied for "non-expanded" OSDs where repair didn't pass. Please collect a log during repair... Thanks, Igor On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Sent download link by email. verbosity=10, over 900M uncompressed. > On 2.10.2018, at 16:52, Igor Fedotov wrote: > > May I have a repair log for that "already expanded" OSD? > > > On 10/2/2018 4:32 PM, Sergey Malinin wrote: >> Repair goes through only when LVM volume has been expanded, otherwise it >> fails with enospc as well as any other operation. However, expanding the >> volume immediately renders bluefs unmountable with IO error. >> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very >> end of bluefs-log-dump), I'm not sure whether corruption occurred before or >> after volume expansion. >> >> >>> On 2.10.2018, at 16:07, Igor Fedotov wrote: >>> >>> You mentioned repair had worked before, is that correct? What's the >>> difference now except the applied patch? Different OSD? Anything else? >>> >>> >>> On 10/2/2018 3:52 PM, Sergey Malinin wrote: >>> It didn't work, emailed logs to you. > On 2.10.2018, at 14:43, Igor Fedotov wrote: > > The major change is in get_bluefs_rebalance_txn function, it lacked > bluefs_rebalance_txn assignment.. > > > > On 10/2/2018 2:40 PM, Sergey Malinin wrote: >> PR doesn't seem to have changed since yesterday. Am I missing something? >> >> >>> On 2.10.2018, at 14:15, Igor Fedotov wrote: >>> >>> Please update the patch from the PR - it didn't update bluefs extents >>> list before. >>> >>> Also please set debug bluestore 20 when re-running repair and collect >>> the log. >>> >>> If repair doesn't help - would you send repair and startup logs >>> directly to me as I have some issues accessing ceph-post-file uploads. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > On 2.10.2018, at 11:26, Igor Fedotov wrote: > > You did repair for any of this OSDs, didn't you? For all of them? > > > Would you please provide the log for both types (failed on mount and > failed with enospc) of failing OSDs. Prior to collecting please > remove existing ones prior and set debug bluestore to 20. > > > > On 10/2/2018 2:16 AM, Sergey Malinin wrote: >> I was able to apply patches to mimic, but nothing changed. One osd >> that I had space expanded on fails with bluefs mount IO error, >> others keep failing with enospc. >> >> >>> On 1.10.2018, at 19:26, Igor Fedotov wrote: >>> >>> So you should call repair which rebalances (i.e. allocates >>> additional space) BlueFS space. Hence allowing OSD to start. >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: > Can you please confirm whether I got this right: > > --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 > +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 > @@ -9049,22 +9049,17 @@ > throttle_bytes.put(costs); > PExtentVector bluefs_gift_extents; > - if (bluefs && > - after_flush - bluefs_last_balance > > - cct->_conf->bluestore_bluefs_balance_interval) { > -bluefs_last_balance = after_flush; > -int r = _balance_bluefs_freespace(_gift_extents); > -assert(r >= 0); > -if (r > 0) { > - for (auto& p : bluefs_gift_extents) { > -bluefs_extents.insert(p.offset, p.length); > - } > - bufferlist bl; > - encode(bluefs_extents, bl); > - dout(10) << __func__ << " bluefs_extents now 0x" << > std::hex > - << bluefs_extents << std::dec << dendl; > - synct->set(PREFIX_SUPER, "bluefs_extents", bl); > + int r = _balance_bluefs_freespace(_gift_extents); > +
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
On Tue, Oct 2, 2018 at 10:23 AM Alex Litvak wrote: > > Igor, > > Thank you for your reply. So what you are saying there are really no > sensible space requirements for a collocated device? Even if I setup 30 > GB for DB (which I really wouldn't like to do due to a space waste > considerations ) there is a chance that if this space feels up I will be > in the same trouble under some heavy load scenario? We do have good sizing recommendations for a separate block.db partition. Roughly it shouldn't be less than 4% the size of the data device. http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing > > On 10/2/2018 9:15 AM, Igor Fedotov wrote: > > Even with a single device bluestore has a sort of implicit "BlueFS > > partition" where DB is stored. And it dynamically adjusts (rebalances) > > the space for that partition in background. Unfortunately it might > > perform that "too lazy" and hence under some heavy load it might end-up > > with the lack of space for that partition. While main device still has > > plenty of free space. > > > > I'm planning to refactor this re-balancing procedure in the future to > > eliminate the root cause. > > > > > > Thanks, > > > > Igor > > > > > > On 10/2/2018 5:04 PM, Alex Litvak wrote: > >> I am sorry for interrupting the thread, but my understanding always > >> was that blue store on the single device should not care of the DB > >> size, i.e. it would use the data part for all operations if DB is > >> full. And if it is not true, what would be sensible defaults on 800 > >> GB SSD? I used ceph-ansible to build my cluster with system defaults > >> and from I reading in this thread doesn't give me a good feeling at > >> all. Document ion on the topic is very sketchy and online posts > >> contradict each other some times. > >> > >> Thank you in advance, > >> > >> On 10/2/2018 8:52 AM, Igor Fedotov wrote: > >>> May I have a repair log for that "already expanded" OSD? > >>> > >>> > >>> On 10/2/2018 4:32 PM, Sergey Malinin wrote: > Repair goes through only when LVM volume has been expanded, > otherwise it fails with enospc as well as any other operation. > However, expanding the volume immediately renders bluefs unmountable > with IO error. > 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at > the very end of bluefs-log-dump), I'm not sure whether corruption > occurred before or after volume expansion. > > > > On 2.10.2018, at 16:07, Igor Fedotov wrote: > > > > You mentioned repair had worked before, is that correct? What's the > > difference now except the applied patch? Different OSD? Anything else? > > > > > > On 10/2/2018 3:52 PM, Sergey Malinin wrote: > > > >> It didn't work, emailed logs to you. > >> > >> > >>> On 2.10.2018, at 14:43, Igor Fedotov wrote: > >>> > >>> The major change is in get_bluefs_rebalance_txn function, it > >>> lacked bluefs_rebalance_txn assignment.. > >>> > >>> > >>> > >>> On 10/2/2018 2:40 PM, Sergey Malinin wrote: > PR doesn't seem to have changed since yesterday. Am I missing > something? > > > > On 2.10.2018, at 14:15, Igor Fedotov wrote: > > > > Please update the patch from the PR - it didn't update bluefs > > extents list before. > > > > Also please set debug bluestore 20 when re-running repair and > > collect the log. > > > > If repair doesn't help - would you send repair and startup logs > > directly to me as I have some issues accessing ceph-post-file > > uploads. > > > > > > Thanks, > > > > Igor > > > > > > On 10/2/2018 11:39 AM, Sergey Malinin wrote: > >> Yes, I did repair all OSDs and it finished with 'repair > >> success'. I backed up OSDs so now I have more room to play. > >> I posted log files using ceph-post-file with the following IDs: > >> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 > >> 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > >> > >> > >>> On 2.10.2018, at 11:26, Igor Fedotov wrote: > >>> > >>> You did repair for any of this OSDs, didn't you? For all of > >>> them? > >>> > >>> > >>> Would you please provide the log for both types (failed on > >>> mount and failed with enospc) of failing OSDs. Prior to > >>> collecting please remove existing ones prior and set debug > >>> bluestore to 20. > >>> > >>> > >>> > >>> On 10/2/2018 2:16 AM, Sergey Malinin wrote: > I was able to apply patches to mimic, but nothing changed. > One osd that I had space expanded on fails with bluefs mount > IO error, others keep failing with enospc. > > > > On 1.10.2018, at
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Igor, Thank you for your reply. So what you are saying there are really no sensible space requirements for a collocated device? Even if I setup 30 GB for DB (which I really wouldn't like to do due to a space waste considerations ) there is a chance that if this space feels up I will be in the same trouble under some heavy load scenario? On 10/2/2018 9:15 AM, Igor Fedotov wrote: Even with a single device bluestore has a sort of implicit "BlueFS partition" where DB is stored. And it dynamically adjusts (rebalances) the space for that partition in background. Unfortunately it might perform that "too lazy" and hence under some heavy load it might end-up with the lack of space for that partition. While main device still has plenty of free space. I'm planning to refactor this re-balancing procedure in the future to eliminate the root cause. Thanks, Igor On 10/2/2018 5:04 PM, Alex Litvak wrote: I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full. And if it is not true, what would be sensible defaults on 800 GB SSD? I used ceph-ansible to build my cluster with system defaults and from I reading in this thread doesn't give me a good feeling at all. Document ion on the topic is very sketchy and online posts contradict each other some times. Thank you in advance, On 10/2/2018 8:52 AM, Igor Fedotov wrote: May I have a repair log for that "already expanded" OSD? On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak 2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc 2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) <<
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Even with a single device bluestore has a sort of implicit "BlueFS partition" where DB is stored. And it dynamically adjusts (rebalances) the space for that partition in background. Unfortunately it might perform that "too lazy" and hence under some heavy load it might end-up with the lack of space for that partition. While main device still has plenty of free space. I'm planning to refactor this re-balancing procedure in the future to eliminate the root cause. Thanks, Igor On 10/2/2018 5:04 PM, Alex Litvak wrote: I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full. And if it is not true, what would be sensible defaults on 800 GB SSD? I used ceph-ansible to build my cluster with system defaults and from I reading in this thread doesn't give me a good feeling at all. Document ion on the topic is very sketchy and online posts contradict each other some times. Thank you in advance, On 10/2/2018 8:52 AM, Igor Fedotov wrote: May I have a repair log for that "already expanded" OSD? On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak 2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc 2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { + for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } + bufferlist bl; +
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full. And if it is not true, what would be sensible defaults on 800 GB SSD? I used ceph-ansible to build my cluster with system defaults and from I reading in this thread doesn't give me a good feeling at all. Document ion on the topic is very sketchy and online posts contradict each other some times. Thank you in advance, On 10/2/2018 8:52 AM, Igor Fedotov wrote: May I have a repair log for that "already expanded" OSD? On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak 2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc 2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { + for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } + bufferlist bl; + encode(bluefs_extents, bl); + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; + synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
May I have a repair log for that "already expanded" OSD? On 10/2/2018 4:32 PM, Sergey Malinin wrote: Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. On 2.10.2018, at 16:07, Igor Fedotov wrote: You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Repair goes through only when LVM volume has been expanded, otherwise it fails with enospc as well as any other operation. However, expanding the volume immediately renders bluefs unmountable with IO error. 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end of bluefs-log-dump), I'm not sure whether corruption occurred before or after volume expansion. > On 2.10.2018, at 16:07, Igor Fedotov wrote: > > You mentioned repair had worked before, is that correct? What's the > difference now except the applied patch? Different OSD? Anything else? > > > On 10/2/2018 3:52 PM, Sergey Malinin wrote: > >> It didn't work, emailed logs to you. >> >> >>> On 2.10.2018, at 14:43, Igor Fedotov wrote: >>> >>> The major change is in get_bluefs_rebalance_txn function, it lacked >>> bluefs_rebalance_txn assignment.. >>> >>> >>> >>> On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? > On 2.10.2018, at 14:15, Igor Fedotov wrote: > > Please update the patch from the PR - it didn't update bluefs extents > list before. > > Also please set debug bluestore 20 when re-running repair and collect the > log. > > If repair doesn't help - would you send repair and startup logs directly > to me as I have some issues accessing ceph-post-file uploads. > > > Thanks, > > Igor > > > On 10/2/2018 11:39 AM, Sergey Malinin wrote: >> Yes, I did repair all OSDs and it finished with 'repair success'. I >> backed up OSDs so now I have more room to play. >> I posted log files using ceph-post-file with the following IDs: >> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 >> 20df7df5-f0c9-4186-aa21-4e5c0172cd93 >> >> >>> On 2.10.2018, at 11:26, Igor Fedotov wrote: >>> >>> You did repair for any of this OSDs, didn't you? For all of them? >>> >>> >>> Would you please provide the log for both types (failed on mount and >>> failed with enospc) of failing OSDs. Prior to collecting please remove >>> existing ones prior and set debug bluestore to 20. >>> >>> >>> >>> On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. > On 1.10.2018, at 19:26, Igor Fedotov wrote: > > So you should call repair which rebalances (i.e. allocates additional > space) BlueFS space. Hence allowing OSD to start. > > Thanks, > > Igor > > > On 10/1/2018 7:22 PM, Igor Fedotov wrote: >> Not exactly. The rebalancing from this kv_sync_thread still might be >> deferred due to the nature of this thread (haven't 100% sure though). >> >> Here is my PR showing the idea (still untested and perhaps >> unfinished!!!) >> >> https://github.com/ceph/ceph/pull/24353 >> >> >> Igor >> >> >> On 10/1/2018 7:07 PM, Sergey Malinin wrote: >>> Can you please confirm whether I got this right: >>> >>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 >>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 >>> @@ -9049,22 +9049,17 @@ >>> throttle_bytes.put(costs); >>> PExtentVector bluefs_gift_extents; >>> - if (bluefs && >>> - after_flush - bluefs_last_balance > >>> - cct->_conf->bluestore_bluefs_balance_interval) { >>> -bluefs_last_balance = after_flush; >>> -int r = _balance_bluefs_freespace(_gift_extents); >>> -assert(r >= 0); >>> -if (r > 0) { >>> - for (auto& p : bluefs_gift_extents) { >>> -bluefs_extents.insert(p.offset, p.length); >>> - } >>> - bufferlist bl; >>> - encode(bluefs_extents, bl); >>> - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> - << bluefs_extents << std::dec << dendl; >>> - synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> + int r = _balance_bluefs_freespace(_gift_extents); >>> + ceph_assert(r >= 0); >>> + if (r > 0) { >>> +for (auto& p : bluefs_gift_extents) { >>> + bluefs_extents.insert(p.offset, p.length); >>> } >>> +bufferlist bl; >>> +encode(bluefs_extents, bl); >>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> + << bluefs_extents << std::dec << dendl; >>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> }
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
You mentioned repair had worked before, is that correct? What's the difference now except the applied patch? Different OSD? Anything else? On 10/2/2018 3:52 PM, Sergey Malinin wrote: It didn't work, emailed logs to you. On 2.10.2018, at 14:43, Igor Fedotov wrote: The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
It didn't work, emailed logs to you. > On 2.10.2018, at 14:43, Igor Fedotov wrote: > > The major change is in get_bluefs_rebalance_txn function, it lacked > bluefs_rebalance_txn assignment.. > > > > On 10/2/2018 2:40 PM, Sergey Malinin wrote: >> PR doesn't seem to have changed since yesterday. Am I missing something? >> >> >>> On 2.10.2018, at 14:15, Igor Fedotov wrote: >>> >>> Please update the patch from the PR - it didn't update bluefs extents list >>> before. >>> >>> Also please set debug bluestore 20 when re-running repair and collect the >>> log. >>> >>> If repair doesn't help - would you send repair and startup logs directly to >>> me as I have some issues accessing ceph-post-file uploads. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > On 2.10.2018, at 11:26, Igor Fedotov wrote: > > You did repair for any of this OSDs, didn't you? For all of them? > > > Would you please provide the log for both types (failed on mount and > failed with enospc) of failing OSDs. Prior to collecting please remove > existing ones prior and set debug bluestore to 20. > > > > On 10/2/2018 2:16 AM, Sergey Malinin wrote: >> I was able to apply patches to mimic, but nothing changed. One osd that >> I had space expanded on fails with bluefs mount IO error, others keep >> failing with enospc. >> >> >>> On 1.10.2018, at 19:26, Igor Fedotov wrote: >>> >>> So you should call repair which rebalances (i.e. allocates additional >>> space) BlueFS space. Hence allowing OSD to start. >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: > Can you please confirm whether I got this right: > > --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 > +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 > @@ -9049,22 +9049,17 @@ > throttle_bytes.put(costs); > PExtentVector bluefs_gift_extents; > - if (bluefs && > - after_flush - bluefs_last_balance > > - cct->_conf->bluestore_bluefs_balance_interval) { > -bluefs_last_balance = after_flush; > -int r = _balance_bluefs_freespace(_gift_extents); > -assert(r >= 0); > -if (r > 0) { > - for (auto& p : bluefs_gift_extents) { > -bluefs_extents.insert(p.offset, p.length); > - } > - bufferlist bl; > - encode(bluefs_extents, bl); > - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex > - << bluefs_extents << std::dec << dendl; > - synct->set(PREFIX_SUPER, "bluefs_extents", bl); > + int r = _balance_bluefs_freespace(_gift_extents); > + ceph_assert(r >= 0); > + if (r > 0) { > +for (auto& p : bluefs_gift_extents) { > + bluefs_extents.insert(p.offset, p.length); > } > +bufferlist bl; > +encode(bluefs_extents, bl); > +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex > + << bluefs_extents << std::dec << dendl; > +synct->set(PREFIX_SUPER, "bluefs_extents", bl); > } > // cleanup sync deferred keys > >> On 1.10.2018, at 18:39, Igor Fedotov wrote: >> >> So you have just a single main device per OSD >> >> Then bluestore-tool wouldn't help, it's unable to expand BlueFS >> partition at main device, standalone devices are supported only. >> >> Given that you're able to rebuild the code I can suggest to make a >> patch that triggers BlueFS rebalance (see code snippet below) on >> repairing. >> PExtentVector bluefs_gift_extents; >> int r = _balance_bluefs_freespace(_gift_extents); >> ceph_assert(r >= 0); >> if (r > 0) { >>for (auto& p : bluefs_gift_extents) { >> bluefs_extents.insert(p.offset, p.length); >>} >>
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
The major change is in get_bluefs_rebalance_txn function, it lacked bluefs_rebalance_txn assignment.. On 10/2/2018 2:40 PM, Sergey Malinin wrote: PR doesn't seem to have changed since yesterday. Am I missing something? On 2.10.2018, at 14:15, Igor Fedotov wrote: Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1)
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
PR doesn't seem to have changed since yesterday. Am I missing something? > On 2.10.2018, at 14:15, Igor Fedotov wrote: > > Please update the patch from the PR - it didn't update bluefs extents list > before. > > Also please set debug bluestore 20 when re-running repair and collect the log. > > If repair doesn't help - would you send repair and startup logs directly to > me as I have some issues accessing ceph-post-file uploads. > > > Thanks, > > Igor > > > On 10/2/2018 11:39 AM, Sergey Malinin wrote: >> Yes, I did repair all OSDs and it finished with 'repair success'. I backed >> up OSDs so now I have more room to play. >> I posted log files using ceph-post-file with the following IDs: >> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 >> 20df7df5-f0c9-4186-aa21-4e5c0172cd93 >> >> >>> On 2.10.2018, at 11:26, Igor Fedotov wrote: >>> >>> You did repair for any of this OSDs, didn't you? For all of them? >>> >>> >>> Would you please provide the log for both types (failed on mount and failed >>> with enospc) of failing OSDs. Prior to collecting please remove existing >>> ones prior and set debug bluestore to 20. >>> >>> >>> >>> On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. > On 1.10.2018, at 19:26, Igor Fedotov wrote: > > So you should call repair which rebalances (i.e. allocates additional > space) BlueFS space. Hence allowing OSD to start. > > Thanks, > > Igor > > > On 10/1/2018 7:22 PM, Igor Fedotov wrote: >> Not exactly. The rebalancing from this kv_sync_thread still might be >> deferred due to the nature of this thread (haven't 100% sure though). >> >> Here is my PR showing the idea (still untested and perhaps unfinished!!!) >> >> https://github.com/ceph/ceph/pull/24353 >> >> >> Igor >> >> >> On 10/1/2018 7:07 PM, Sergey Malinin wrote: >>> Can you please confirm whether I got this right: >>> >>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 >>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 >>> @@ -9049,22 +9049,17 @@ >>> throttle_bytes.put(costs); >>> PExtentVector bluefs_gift_extents; >>> - if (bluefs && >>> - after_flush - bluefs_last_balance > >>> - cct->_conf->bluestore_bluefs_balance_interval) { >>> -bluefs_last_balance = after_flush; >>> -int r = _balance_bluefs_freespace(_gift_extents); >>> -assert(r >= 0); >>> -if (r > 0) { >>> - for (auto& p : bluefs_gift_extents) { >>> -bluefs_extents.insert(p.offset, p.length); >>> - } >>> - bufferlist bl; >>> - encode(bluefs_extents, bl); >>> - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> - << bluefs_extents << std::dec << dendl; >>> - synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> + int r = _balance_bluefs_freespace(_gift_extents); >>> + ceph_assert(r >= 0); >>> + if (r > 0) { >>> +for (auto& p : bluefs_gift_extents) { >>> + bluefs_extents.insert(p.offset, p.length); >>> } >>> +bufferlist bl; >>> +encode(bluefs_extents, bl); >>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> + << bluefs_extents << std::dec << dendl; >>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> } >>> // cleanup sync deferred keys >>> On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: > I have rebuilt the tool, but none of my OSDs
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Please update the patch from the PR - it didn't update bluefs extents list before. Also please set debug bluestore 20 when re-running repair and collect the log. If repair doesn't help - would you send repair and startup logs directly to me as I have some issues accessing ceph-post-file uploads. Thanks, Igor On 10/2/2018 11:39 AM, Sergey Malinin wrote: Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 On 2.10.2018, at 11:26, Igor Fedotov wrote: You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb:
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Yes, I did repair all OSDs and it finished with 'repair success'. I backed up OSDs so now I have more room to play. I posted log files using ceph-post-file with the following IDs: 4af9cc4d-9c73-41c9-9c38-eb6c551047a0 20df7df5-f0c9-4186-aa21-4e5c0172cd93 > On 2.10.2018, at 11:26, Igor Fedotov wrote: > > You did repair for any of this OSDs, didn't you? For all of them? > > > Would you please provide the log for both types (failed on mount and failed > with enospc) of failing OSDs. Prior to collecting please remove existing ones > prior and set debug bluestore to 20. > > > > On 10/2/2018 2:16 AM, Sergey Malinin wrote: >> I was able to apply patches to mimic, but nothing changed. One osd that I >> had space expanded on fails with bluefs mount IO error, others keep failing >> with enospc. >> >> >>> On 1.10.2018, at 19:26, Igor Fedotov wrote: >>> >>> So you should call repair which rebalances (i.e. allocates additional >>> space) BlueFS space. Hence allowing OSD to start. >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: > Can you please confirm whether I got this right: > > --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 > +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 > @@ -9049,22 +9049,17 @@ > throttle_bytes.put(costs); > PExtentVector bluefs_gift_extents; > - if (bluefs && > - after_flush - bluefs_last_balance > > - cct->_conf->bluestore_bluefs_balance_interval) { > -bluefs_last_balance = after_flush; > -int r = _balance_bluefs_freespace(_gift_extents); > -assert(r >= 0); > -if (r > 0) { > - for (auto& p : bluefs_gift_extents) { > -bluefs_extents.insert(p.offset, p.length); > - } > - bufferlist bl; > - encode(bluefs_extents, bl); > - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex > - << bluefs_extents << std::dec << dendl; > - synct->set(PREFIX_SUPER, "bluefs_extents", bl); > + int r = _balance_bluefs_freespace(_gift_extents); > + ceph_assert(r >= 0); > + if (r > 0) { > +for (auto& p : bluefs_gift_extents) { > + bluefs_extents.insert(p.offset, p.length); > } > +bufferlist bl; > +encode(bluefs_extents, bl); > +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex > + << bluefs_extents << std::dec << dendl; > +synct->set(PREFIX_SUPER, "bluefs_extents", bl); > } > // cleanup sync deferred keys > >> On 1.10.2018, at 18:39, Igor Fedotov wrote: >> >> So you have just a single main device per OSD >> >> Then bluestore-tool wouldn't help, it's unable to expand BlueFS >> partition at main device, standalone devices are supported only. >> >> Given that you're able to rebuild the code I can suggest to make a patch >> that triggers BlueFS rebalance (see code snippet below) on repairing. >> PExtentVector bluefs_gift_extents; >> int r = _balance_bluefs_freespace(_gift_extents); >> ceph_assert(r >= 0); >> if (r > 0) { >>for (auto& p : bluefs_gift_extents) { >> bluefs_extents.insert(p.offset, p.length); >>} >>bufferlist bl; >>encode(bluefs_extents, bl); >>dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >> << bluefs_extents << std::dec << dendl; >>synct->set(PREFIX_SUPER, "bluefs_extents", bl); >> } >> >> If it waits I can probably make a corresponding PR tomorrow. >> >> Thanks, >> Igor >> On 10/1/2018 6:16 PM, Sergey Malinin wrote: >>> I have rebuilt the tool, but none of my OSDs no matter dead or alive >>> have any symlinks other than 'block' pointing to LVM. >>> I adjusted main device size but it looks like it needs even more space >>> for db compaction. After executing bluefs-bdev-expand OSD fails to >>> start, however 'fsck' and 'repair' commands finished successfully. >>> >>> 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init >>> 2018-10-01 18:02:39.763 7fc9226c6240 1 >>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation >>> metadata >>> 2018-10-01 18:02:40.907 7fc9226c6240 1 >>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in >>> 2249899 extents >>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 >>>
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
You did repair for any of this OSDs, didn't you? For all of them? Would you please provide the log for both types (failed on mount and failed with enospc) of failing OSDs. Prior to collecting please remove existing ones prior and set debug bluestore to 20. On 10/2/2018 2:16 AM, Sergey Malinin wrote: I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. On 1.10.2018, at 19:26, Igor Fedotov wrote: So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { -bluefs_last_balance = after_flush; -int r = _balance_bluefs_freespace(_gift_extents); -assert(r >= 0); -if (r > 0) { - for (auto& p : bluefs_gift_extents) { -bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { +for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } +bufferlist bl; +encode(bluefs_extents, bl); +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; +synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
I was able to apply patches to mimic, but nothing changed. One osd that I had space expanded on fails with bluefs mount IO error, others keep failing with enospc. > On 1.10.2018, at 19:26, Igor Fedotov wrote: > > So you should call repair which rebalances (i.e. allocates additional space) > BlueFS space. Hence allowing OSD to start. > > Thanks, > > Igor > > > On 10/1/2018 7:22 PM, Igor Fedotov wrote: >> Not exactly. The rebalancing from this kv_sync_thread still might be >> deferred due to the nature of this thread (haven't 100% sure though). >> >> Here is my PR showing the idea (still untested and perhaps unfinished!!!) >> >> https://github.com/ceph/ceph/pull/24353 >> >> >> Igor >> >> >> On 10/1/2018 7:07 PM, Sergey Malinin wrote: >>> Can you please confirm whether I got this right: >>> >>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 >>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 >>> @@ -9049,22 +9049,17 @@ >>> throttle_bytes.put(costs); >>> PExtentVector bluefs_gift_extents; >>> - if (bluefs && >>> - after_flush - bluefs_last_balance > >>> - cct->_conf->bluestore_bluefs_balance_interval) { >>> -bluefs_last_balance = after_flush; >>> -int r = _balance_bluefs_freespace(_gift_extents); >>> -assert(r >= 0); >>> -if (r > 0) { >>> - for (auto& p : bluefs_gift_extents) { >>> -bluefs_extents.insert(p.offset, p.length); >>> - } >>> - bufferlist bl; >>> - encode(bluefs_extents, bl); >>> - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> - << bluefs_extents << std::dec << dendl; >>> - synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> + int r = _balance_bluefs_freespace(_gift_extents); >>> + ceph_assert(r >= 0); >>> + if (r > 0) { >>> +for (auto& p : bluefs_gift_extents) { >>> + bluefs_extents.insert(p.offset, p.length); >>> } >>> +bufferlist bl; >>> +encode(bluefs_extents, bl); >>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex >>> + << bluefs_extents << std::dec << dendl; >>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl); >>> } >>> // cleanup sync deferred keys >>> On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: > I have rebuilt the tool, but none of my OSDs no matter dead or alive have > any symlinks other than 'block' pointing to LVM. > I adjusted main device size but it looks like it needs even more space > for db compaction. After executing bluefs-bdev-expand OSD fails to start, > however 'fsck' and 'repair' commands finished successfully. > > 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init > 2018-10-01 18:02:39.763 7fc9226c6240 1 > bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation > metadata > 2018-10-01 18:02:40.907 7fc9226c6240 1 > bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 > extents > 2018-10-01 18:02:40.951 7fc9226c6240 -1 > bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs > extra 0x[6d6f00~50c80] > 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 > shutdown > 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown > 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: > [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling > all background work > 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: > [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete > 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount > 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 > shutdown > 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 > /var/lib/ceph/osd/ceph-1/block) close > 2018-10-01 18:02:41.267 7fc9226c6240 1
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
So you should call repair which rebalances (i.e. allocates additional space) BlueFS space. Hence allowing OSD to start. Thanks, Igor On 10/1/2018 7:22 PM, Igor Fedotov wrote: Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak 2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc 2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { + for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } + bufferlist bl; + encode(bluefs_extents, bl); + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex + << bluefs_extents << std::dec << dendl; + synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) Input/output error On 1.10.2018, at 18:09, Igor Fedotov wrote: Well, actually you can avoid bluestore-tool rebuild. You'll need to edit the first chunk of blocks.db where labels are stored. (Please make a backup first!!!) Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit integer encoding. (Please verify that old value at this offset exactly corresponds to you original volume size and/or 'size' label reported by ceph-bluestore-tool). So you have to put new DB volume size
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Not exactly. The rebalancing from this kv_sync_thread still might be deferred due to the nature of this thread (haven't 100% sure though). Here is my PR showing the idea (still untested and perhaps unfinished!!!) https://github.com/ceph/ceph/pull/24353 Igor On 10/1/2018 7:07 PM, Sergey Malinin wrote: Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { + for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } + bufferlist bl; + encode(bluefs_extents, bl); + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex +<< bluefs_extents << std::dec << dendl; + synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys On 1.10.2018, at 18:39, Igor Fedotov wrote: So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) Input/output error On 1.10.2018, at 18:09, Igor Fedotov wrote: Well, actually you can avoid bluestore-tool rebuild. You'll need to edit the first chunk of blocks.db where labels are stored. (Please make a backup first!!!) Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit integer encoding. (Please verify that old value at this offset exactly corresponds to you original volume size and/or 'size' label reported by ceph-bluestore-tool). So you have to put new DB volume size there. Or you can send the first 4K chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me and
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Can you please confirm whether I got this right: --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300 +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300 @@ -9049,22 +9049,17 @@ throttle_bytes.put(costs); PExtentVector bluefs_gift_extents; - if (bluefs && - after_flush - bluefs_last_balance > - cct->_conf->bluestore_bluefs_balance_interval) { - bluefs_last_balance = after_flush; - int r = _balance_bluefs_freespace(_gift_extents); - assert(r >= 0); - if (r > 0) { - for (auto& p : bluefs_gift_extents) { - bluefs_extents.insert(p.offset, p.length); - } - bufferlist bl; - encode(bluefs_extents, bl); - dout(10) << __func__ << " bluefs_extents now 0x" << std::hex - << bluefs_extents << std::dec << dendl; - synct->set(PREFIX_SUPER, "bluefs_extents", bl); + int r = _balance_bluefs_freespace(_gift_extents); + ceph_assert(r >= 0); + if (r > 0) { + for (auto& p : bluefs_gift_extents) { + bluefs_extents.insert(p.offset, p.length); } + bufferlist bl; + encode(bluefs_extents, bl); + dout(10) << __func__ << " bluefs_extents now 0x" << std::hex +<< bluefs_extents << std::dec << dendl; + synct->set(PREFIX_SUPER, "bluefs_extents", bl); } // cleanup sync deferred keys > On 1.10.2018, at 18:39, Igor Fedotov wrote: > > So you have just a single main device per OSD > > Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at > main device, standalone devices are supported only. > > Given that you're able to rebuild the code I can suggest to make a patch that > triggers BlueFS rebalance (see code snippet below) on repairing. > PExtentVector bluefs_gift_extents; > int r = _balance_bluefs_freespace(_gift_extents); > ceph_assert(r >= 0); > if (r > 0) { > for (auto& p : bluefs_gift_extents) { > bluefs_extents.insert(p.offset, p.length); > } > bufferlist bl; > encode(bluefs_extents, bl); > dout(10) << __func__ << " bluefs_extents now 0x" << std::hex ><< bluefs_extents << std::dec << dendl; > synct->set(PREFIX_SUPER, "bluefs_extents", bl); > } > > If it waits I can probably make a corresponding PR tomorrow. > > Thanks, > Igor > On 10/1/2018 6:16 PM, Sergey Malinin wrote: >> I have rebuilt the tool, but none of my OSDs no matter dead or alive have >> any symlinks other than 'block' pointing to LVM. >> I adjusted main device size but it looks like it needs even more space for >> db compaction. After executing bluefs-bdev-expand OSD fails to start, >> however 'fsck' and 'repair' commands finished successfully. >> >> 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init >> 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) >> _open_alloc opening allocation metadata >> 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) >> _open_alloc loaded 285 GiB in 2249899 extents >> 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) >> _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] >> 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown >> 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown >> 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: >> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all >> background work >> 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: >> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete >> 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount >> 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown >> 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 >> /var/lib/ceph/osd/ceph-1/block) close >> 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 >> /var/lib/ceph/osd/ceph-1/block) close >> 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount >> object store >> 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) >> Input/output error >> >> >>> On 1.10.2018, at 18:09, Igor Fedotov wrote: >>> >>> Well, actually you can avoid bluestore-tool rebuild. >>> >>> You'll need to edit the first chunk of blocks.db where labels are stored. >>> (Please make a backup first!!!) >>> >>> Size label is stored at offset 0x52 and is 8 bytes long - little-endian >>> 64bit integer encoding. (Please verify that old value at this offset >>> exactly corresponds to you original volume size and/or 'size' label >>> reported by ceph-bluestore-tool). >>> >>> So you have to put new DB volume size there. Or you can send the first 4K >>> chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to >>> me and I'll do that for you. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
So you have just a single main device per OSD Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at main device, standalone devices are supported only. Given that you're able to rebuild the code I can suggest to make a patch that triggers BlueFS rebalance (see code snippet below) on repairing. PExtentVector bluefs_gift_extents; int r = _balance_bluefs_freespace(_gift_extents); ceph_assert(r >= 0); if (r > 0) { for (auto& p : bluefs_gift_extents) { bluefs_extents.insert(p.offset, p.length); } bufferlist bl; encode(bluefs_extents, bl); dout(10) << __func__ << " bluefs_extents now 0x" << std::hex << bluefs_extents << std::dec << dendl; synct->set(PREFIX_SUPER, "bluefs_extents", bl); } If it waits I can probably make a corresponding PR tomorrow. Thanks, Igor On 10/1/2018 6:16 PM, Sergey Malinin wrote: I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) Input/output error On 1.10.2018, at 18:09, Igor Fedotov wrote: Well, actually you can avoid bluestore-tool rebuild. You'll need to edit the first chunk of blocks.db where labels are stored. (Please make a backup first!!!) Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit integer encoding. (Please verify that old value at this offset exactly corresponds to you original volume size and/or 'size' label reported by ceph-bluestore-tool). So you have to put new DB volume size there. Or you can send the first 4K chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me and I'll do that for you. Thanks, Igor On 10/1/2018 5:32 PM, Igor Fedotov wrote: On 10/1/2018 5:03 PM, Sergey Malinin wrote: Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv compact", however it still needs more space. Is that because I didn't update DB size with set-label-key? In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" command to commit bluefs volume expansion. Unfortunately the last command doesn't handle "size" label properly. That's why you might need to backport and rebuild with the mentioned commits. What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB: # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 inferring bluefs devices from bluestore path { "/var/lib/ceph/osd/ceph-1/block": { "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", "size": 471305551872, "btime": "2018-07-31 03:06:43.751243", "description": "main", "bluefs": "1", "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "XXX", "ready": "ready", "whoami": "1" } } 'size' label but your output is for block(aka slow) device. It should return labels for db/wal devices as well (block.db and block.wal symlinks respectively). It works for me in master, can't verify with mimic at the moment though.. Here is output for master: # bin/ceph-bluestore-tool show-label --path dev/osd0 inferring bluefs devices from bluestore path { "dev/osd0/block": {
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
I have rebuilt the tool, but none of my OSDs no matter dead or alive have any symlinks other than 'block' pointing to LVM. I adjusted main device size but it looks like it needs even more space for db compaction. After executing bluefs-bdev-expand OSD fails to start, however 'fsck' and 'repair' commands finished successfully. 2018-10-01 18:02:39.755 7fc9226c6240 1 freelist init 2018-10-01 18:02:39.763 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata 2018-10-01 18:02:40.907 7fc9226c6240 1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 extents 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80] 2018-10-01 18:02:40.951 7fc9226c6240 1 stupidalloc 0x0x55d053fb9180 shutdown 2018-10-01 18:02:40.963 7fc9226c6240 1 freelist shutdown 2018-10-01 18:02:40.963 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2018-10-01 18:02:40.967 7fc9226c6240 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2018-10-01 18:02:40.971 7fc9226c6240 1 bluefs umount 2018-10-01 18:02:40.975 7fc9226c6240 1 stupidalloc 0x0x55d053883800 shutdown 2018-10-01 18:02:40.975 7fc9226c6240 1 bdev(0x55d053c32e00 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.267 7fc9226c6240 1 bdev(0x55d053c32a80 /var/lib/ceph/osd/ceph-1/block) close 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 18:02:41.443 7fc9226c6240 -1 ** ERROR: osd init failed: (5) Input/output error > On 1.10.2018, at 18:09, Igor Fedotov wrote: > > Well, actually you can avoid bluestore-tool rebuild. > > You'll need to edit the first chunk of blocks.db where labels are stored. > (Please make a backup first!!!) > > Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit > integer encoding. (Please verify that old value at this offset exactly > corresponds to you original volume size and/or 'size' label reported by > ceph-bluestore-tool). > > So you have to put new DB volume size there. Or you can send the first 4K > chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me > and I'll do that for you. > > > Thanks, > > Igor > > > On 10/1/2018 5:32 PM, Igor Fedotov wrote: >> >> >> On 10/1/2018 5:03 PM, Sergey Malinin wrote: >>> Before I received your response, I had already added 20GB to the OSD (by >>> epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool >>> bluestore-kv compact", however it still needs more space. >>> Is that because I didn't update DB size with set-label-key? >> In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" >> command to commit bluefs volume expansion. >> Unfortunately the last command doesn't handle "size" label properly. That's >> why you might need to backport and rebuild with the mentioned commits. >> >>> What exactly is the label-key that needs to be updated, as I couldn't find >>> which one is related to DB: >>> >>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 >>> inferring bluefs devices from bluestore path >>> { >>> "/var/lib/ceph/osd/ceph-1/block": { >>> "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", >>> "size": 471305551872, >>> "btime": "2018-07-31 03:06:43.751243", >>> "description": "main", >>> "bluefs": "1", >>> "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", >>> "kv_backend": "rocksdb", >>> "magic": "ceph osd volume v026", >>> "mkfs_done": "yes", >>> "osd_key": "XXX", >>> "ready": "ready", >>> "whoami": "1" >>> } >>> } >> 'size' label but your output is for block(aka slow) device. >> >> It should return labels for db/wal devices as well (block.db and block.wal >> symlinks respectively). It works for me in master, can't verify with mimic >> at the moment though.. >> Here is output for master: >> >> # bin/ceph-bluestore-tool show-label --path dev/osd0 >> inferring bluefs devices from bluestore path >> { >> "dev/osd0/block": { >> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", >> "size": 21474836480, >> "btime": "2018-09-10 15:55:09.044039", >> "description": "main", >> "bluefs": "1", >> "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c", >> "kv_backend": "rocksdb", >> "magic": "ceph osd volume v026", >> "mkfs_done": "yes", >> "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==", >> "ready": "ready", >> "whoami": "0" >> }, >> "dev/osd0/block.wal": { >> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", >> "size": 1048576000, >> "btime": "2018-09-10 15:55:09.044985", >> "description": "bluefs wal" >> }, >>
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Well, actually you can avoid bluestore-tool rebuild. You'll need to edit the first chunk of blocks.db where labels are stored. (Please make a backup first!!!) Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit integer encoding. (Please verify that old value at this offset exactly corresponds to you original volume size and/or 'size' label reported by ceph-bluestore-tool). So you have to put new DB volume size there. Or you can send the first 4K chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me and I'll do that for you. Thanks, Igor On 10/1/2018 5:32 PM, Igor Fedotov wrote: On 10/1/2018 5:03 PM, Sergey Malinin wrote: Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv compact", however it still needs more space. Is that because I didn't update DB size with set-label-key? In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" command to commit bluefs volume expansion. Unfortunately the last command doesn't handle "size" label properly. That's why you might need to backport and rebuild with the mentioned commits. What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB: # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 inferring bluefs devices from bluestore path { "/var/lib/ceph/osd/ceph-1/block": { "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", "size": 471305551872, "btime": "2018-07-31 03:06:43.751243", "description": "main", "bluefs": "1", "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "XXX", "ready": "ready", "whoami": "1" } } 'size' label but your output is for block(aka slow) device. It should return labels for db/wal devices as well (block.db and block.wal symlinks respectively). It works for me in master, can't verify with mimic at the moment though.. Here is output for master: # bin/ceph-bluestore-tool show-label --path dev/osd0 inferring bluefs devices from bluestore path { "dev/osd0/block": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 21474836480, "btime": "2018-09-10 15:55:09.044039", "description": "main", "bluefs": "1", "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==", "ready": "ready", "whoami": "0" }, "dev/osd0/block.wal": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 1048576000, "btime": "2018-09-10 15:55:09.044985", "description": "bluefs wal" }, "dev/osd0/block.db": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 1048576000, "btime": "2018-09-10 15:55:09.044469", "description": "bluefs db" } } You can try --dev option instead of --path, e.g. ceph-bluestore-tool show-label --dev On 1.10.2018, at 16:48, Igor Fedotov wrote: This looks like a sort of deadlock when BlueFS needs some additional space to replay the log left after the crash. Which happens during BlueFS open. But such a space (at slow device as DB is full) is gifted in background during bluefs rebalance procedure which will occur after the open. Hence OSDs stuck in permanent crashing.. The only way to recover I can suggest for now is to expand DB volumes. You can do that with lvm tools if you have any spare space for that. Once resized you'll need ceph-bluestore-tool to indicate volume expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label with set-label-key command. The latter is a bit tricky for mimic - you might need to backport https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b and rebuild ceph-bluestore-tool. Alternatively you can backport https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b then bluefs expansion and label updates will occur in a single step. I'll do these backports in upstream but this will take some time to pass all the procedures and get into official mimic release. Will fire a ticket to fix the original issue as well. Thanks, Igor On 10/1/2018 3:28 PM, Sergey Malinin wrote: These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices. OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824 DB/block usage collected by prometheus module for 3 failed and 1 survived
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
On 10/1/2018 5:03 PM, Sergey Malinin wrote: Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv compact", however it still needs more space. Is that because I didn't update DB size with set-label-key? In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" command to commit bluefs volume expansion. Unfortunately the last command doesn't handle "size" label properly. That's why you might need to backport and rebuild with the mentioned commits. What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB: # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 inferring bluefs devices from bluestore path { "/var/lib/ceph/osd/ceph-1/block": { "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", "size": 471305551872, "btime": "2018-07-31 03:06:43.751243", "description": "main", "bluefs": "1", "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "XXX", "ready": "ready", "whoami": "1" } } 'size' label but your output is for block(aka slow) device. It should return labels for db/wal devices as well (block.db and block.wal symlinks respectively). It works for me in master, can't verify with mimic at the moment though.. Here is output for master: # bin/ceph-bluestore-tool show-label --path dev/osd0 inferring bluefs devices from bluestore path { "dev/osd0/block": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 21474836480, "btime": "2018-09-10 15:55:09.044039", "description": "main", "bluefs": "1", "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==", "ready": "ready", "whoami": "0" }, "dev/osd0/block.wal": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 1048576000, "btime": "2018-09-10 15:55:09.044985", "description": "bluefs wal" }, "dev/osd0/block.db": { "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75", "size": 1048576000, "btime": "2018-09-10 15:55:09.044469", "description": "bluefs db" } } You can try --dev option instead of --path, e.g. ceph-bluestore-tool show-label --dev On 1.10.2018, at 16:48, Igor Fedotov wrote: This looks like a sort of deadlock when BlueFS needs some additional space to replay the log left after the crash. Which happens during BlueFS open. But such a space (at slow device as DB is full) is gifted in background during bluefs rebalance procedure which will occur after the open. Hence OSDs stuck in permanent crashing.. The only way to recover I can suggest for now is to expand DB volumes. You can do that with lvm tools if you have any spare space for that. Once resized you'll need ceph-bluestore-tool to indicate volume expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label with set-label-key command. The latter is a bit tricky for mimic - you might need to backport https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b and rebuild ceph-bluestore-tool. Alternatively you can backport https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b then bluefs expansion and label updates will occur in a single step. I'll do these backports in upstream but this will take some time to pass all the procedures and get into official mimic release. Will fire a ticket to fix the original issue as well. Thanks, Igor On 10/1/2018 3:28 PM, Sergey Malinin wrote: These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices. OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824 DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs: ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has survived ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0 ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv compact", however it still needs more space. Is that because I didn't update DB size with set-label-key? What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB: # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 inferring bluefs devices from bluestore path { "/var/lib/ceph/osd/ceph-1/block": { "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc", "size": 471305551872, "btime": "2018-07-31 03:06:43.751243", "description": "main", "bluefs": "1", "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "XXX", "ready": "ready", "whoami": "1" } } > On 1.10.2018, at 16:48, Igor Fedotov wrote: > > This looks like a sort of deadlock when BlueFS needs some additional space to > replay the log left after the crash. Which happens during BlueFS open. > > But such a space (at slow device as DB is full) is gifted in background > during bluefs rebalance procedure which will occur after the open. > > Hence OSDs stuck in permanent crashing.. > > The only way to recover I can suggest for now is to expand DB volumes. You > can do that with lvm tools if you have any spare space for that. > > Once resized you'll need ceph-bluestore-tool to indicate volume expansion to > BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label > with set-label-key command. > > The latter is a bit tricky for mimic - you might need to backport > https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b > > and rebuild ceph-bluestore-tool. Alternatively you can backport > https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b > > then bluefs expansion and label updates will occur in a single step. > > I'll do these backports in upstream but this will take some time to pass all > the procedures and get into official mimic release. > > Will fire a ticket to fix the original issue as well. > > > Thanks, > > Igor > > > On 10/1/2018 3:28 PM, Sergey Malinin wrote: >> These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare >> --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices. >> OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified >> setting is bluestore_cache_kv_max=1073741824 >> >> DB/block usage collected by prometheus module for 3 failed and 1 survived >> OSDs: >> >> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0 >> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0 >> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one >> has survived >> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0 >> >> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0 >> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0 >> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0 >> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0 >> >> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0 >> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0 >> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0 >> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0 >> >> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0 >> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0 >> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0 >> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0 >> >> >> First crashed OSD was doing DB compaction, others crashed shortly after >> during backfilling. Workload was "ceph-data-scan scan_inodes" filling >> metadata pool located on these OSDs at the rate close to 10k objects/second. >> Here is the log excerpt of the first crash occurrence: >> >> 2018-10-01 03:27:12.762 7fbf16dd6700 0 bluestore(/var/lib/ceph/osd/ceph-1) >> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000 >> 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: >> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB >> 24] Generated table #89741: 106356 keys, 68110589 bytes >> 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 >> {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": >> "table_file_creation", "file_number": 89741, "file_size": 68110589, >> "table_properties": {"data_size": 67112903, "index_size": 579319, >> "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, >> "raw_value_size": 60994583, "raw_average_value_size": 573, >> "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": >> "rocksdb.BuiltinBloomFilter", "kDeletedKeys":
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
This looks like a sort of deadlock when BlueFS needs some additional space to replay the log left after the crash. Which happens during BlueFS open. But such a space (at slow device as DB is full) is gifted in background during bluefs rebalance procedure which will occur after the open. Hence OSDs stuck in permanent crashing.. The only way to recover I can suggest for now is to expand DB volumes. You can do that with lvm tools if you have any spare space for that. Once resized you'll need ceph-bluestore-tool to indicate volume expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label with set-label-key command. The latter is a bit tricky for mimic - you might need to backport https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b and rebuild ceph-bluestore-tool. Alternatively you can backport https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b then bluefs expansion and label updates will occur in a single step. I'll do these backports in upstream but this will take some time to pass all the procedures and get into official mimic release. Will fire a ticket to fix the original issue as well. Thanks, Igor On 10/1/2018 3:28 PM, Sergey Malinin wrote: These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices. OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824 DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs: ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has survived ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0 ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0 First crashed OSD was doing DB compaction, others crashed shortly after during backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool located on these OSDs at the rate close to 10k objects/second. Here is the log excerpt of the first crash occurrence: 2018-10-01 03:27:12.762 7fbf16dd6700 0 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89741: 106356 keys, 68110589 bytes 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89741, "file_size": 68110589, "table_properties": {"data_size": 67112903, "index_size": 579319, "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, "raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}} 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89742: 23214 keys, 16352315 bytes 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632938670, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89742, "file_size": 16352315, "table_properties": {"data_size": 16116986, "index_size": 139894, "filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, "raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 3928, "num_entries": 23214, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "90", "kMergeOperands": "0"}} 2018-10-01 03:27:13.042 7fbf1e5e5700 1 bluefs _allocate failed to allocate 0x410 on bdev 1, free 0x1a0; fallback to bdev 2 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 0x410 on bdev 2, dne 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x40ea9f1 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices. OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824 DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs: ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0 ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has survived ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0 ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0 ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0 ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0 ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0 First crashed OSD was doing DB compaction, others crashed shortly after during backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool located on these OSDs at the rate close to 10k objects/second. Here is the log excerpt of the first crash occurrence: 2018-10-01 03:27:12.762 7fbf16dd6700 0 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89741: 106356 keys, 68110589 bytes 2018-10-01 03:27:12.886 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89741, "file_size": 68110589, "table_properties": {"data_size": 67112903, "index_size": 579319, "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, "raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}} 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89742: 23214 keys, 16352315 bytes 2018-10-01 03:27:12.934 7fbf1e5e5700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632938670, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89742, "file_size": 16352315, "table_properties": {"data_size": 16116986, "index_size": 139894, "filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, "raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 3928, "num_entries": 23214, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "90", "kMergeOperands": "0"}} 2018-10-01 03:27:13.042 7fbf1e5e5700 1 bluefs _allocate failed to allocate 0x410 on bdev 1, free 0x1a0; fallback to bdev 2 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 0x410 on bdev 2, dne 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x40ea9f1 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fbf1e5e5700 time 2018-10-01 03:27:13.048298 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc") ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fbf2d4fe5c2] 2: (()+0x26c787) [0x7fbf2d4fe787] 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5619325114b4] 4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d] 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x56193271c399] 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x56193271d42b] 7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x3db) [0x56193276098b] 8: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9) [0x561932763da9] 9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504] 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc54) [0x5619325b5c44] 11:
Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"
Hi Sergey, could you please provide more details on your OSDs ? What are sizes for DB/block devices? Do you have any modifications in BlueStore config settings? Can you share stats you're referring to? Thanks, Igor On 10/1/2018 12:29 PM, Sergey Malinin wrote: Hello, 3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and no longer start. Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% used. Although OSDs have over 50% of free space, it is not reallocated for DB usage. 2018-10-01 12:18:06.744 7f1d6a04d240 1 bluefs _allocate failed to allocate 0x10 on bdev 1, free 0x0; fallback to bdev 2 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 0x10 on bdev 2, dne 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0xa8700 2018-10-01 12:18:06.748 7f1d6a04d240 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f1d6a04d240 time 2018-10-01 12:18:06.746800 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc") ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f1d6146f5c2] 2: (()+0x26c787) [0x7f1d6146f787] 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5586b22684b4] 4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d] 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399] 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b] 7: (rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rock sdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector >, std::allocator > > > co nst*, unsigned int, std::__cxx11::basic_string, std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::Compression Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94] 8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457] 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool)+0x19de) [0x5586b232373e] 10: (rocksdb::DBImpl::Recover(std::vector > const&, bool, bool, bool)+0x5d4) [0x5586b23242f4] 11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, std::vector >*, rocksdb::DB**, bool)+0x68b) [0x5586b232559b] 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, std::allocator > const&, std::vector const&, std::vector >*, rocksdb::DB**)+0x22) [0x5586b2326e72] 13: (RocksDBStore::do_open(std::ostream&, bool, std::vector > const*)+0x170c) [0x5586b220219c] 14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e] 15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807] 16: (OSD::init()+0x295) [0x5586b1d673c5] 17: (main()+0x268d) [0x5586b1c554ed] 18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97] 19: (_start()+0x2a) [0x5586b1d1d7fa] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com