Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-16 Thread Wido den Hollander


On 10/16/18 11:32 AM, Igor Fedotov wrote:
> 
> 
> On 10/16/2018 6:57 AM, Wido den Hollander wrote:
>>
>> On 10/16/2018 12:04 AM, Igor Fedotov wrote:
>>> On 10/15/2018 11:47 PM, Wido den Hollander wrote:
 Hi,

 On 10/15/2018 10:43 PM, Igor Fedotov wrote:
> Hi Wido,
>
> once you apply the PR you'll probably see the initial error in the log
> that triggers the dump. Which is most probably the lack of space
> reported by _balance_bluefs_freespace() function. If so this means
> that
> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
> device to gift to BlueFS. I.e. your main device space is very
> fragmented.
>
> Unfortunately I don't know any ways to recover from this state but OSD
> redeployment or data removal.
>
 We are moving data away from these OSDs. Lucky us that we have HDD OSDs
 in there as well, moving a lot of data there.

 How would re-deployment work? Just wipe the OSDs and bring them back
 into the cluster again? That would be a very painful task.. :-(
>>> Good chances that you'll face the same issue again one day.
>>> May be allocate some SSDs to serve as DB devices?
>> Maybe, but this is a very common use-case where people run WAL+DB+DATA
>> on a single SSD.
> Yeah, but I'd consider that as a workaround until better solution is
> provided.
> Before your case this high fragmentation issue had been rather a
> theoretical outlook with a very small probability.
> 

Understood. What I see now is that after offloading the RGW bucket
indexes from these OSDs there is free space.

The OSDs are now randomly going to 100% util (their disk) in reading a
lot from their disk.

Cranking up the logging shows me:

2018-10-16 11:50:20.325674 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cd93c~f0b
2018-10-16 11:50:20.325836 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547ce847~ef9
2018-10-16 11:50:20.325997 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cf740

bluefs / bdev seem to be reading a lot, a lot. From what I can tell they
are discarding OMAP data from RocksDB which causes a lot of reads on BlueFS.

>> Now we are running into it, but aren't the chances big other people will
>> run into it as well?
>>
> Upcoming PR that brings an ability for offline BlueFS volume
> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
> help to recover from this issue in future by migrating BlueFS data
> to a
> new larger DB volume. (targeted for Nautilus, not sure about
> backporting
> to Mimic or Luminous).
>
> For now I can suggest the only preventive mean to avoid the case -
> have
> large enough space at your standalone DB volume. So that master device
> isn't used for DB at all or as minimum as possible. Hence no rebalance
> is needed and no fragmentation is present.
>
 I see, but these are SSD-only OSDs.

> BTW wondering if you have one for your OSDs? How large if so?
>
 The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

 The fullest OSD currently is 78% full, which is 348GiB free on the
 1.75TiB device.

 Does this information help?
>>> Yeah, thanks for sharing.
>> Let me know if you need more!
> Additionally wondering if you know how many data (in average) has been
> totally written to these OSDs. I mean an aggregation over all writes
> (even ones that already has been removed) not the current usage.
> Use patterns, object sizes etc would be interesting as well.
>

These OSDs were migrated from FileStore to BlueStore in March 2018.

Using blkdiscard the SSDs (Samsung PM863a 1.92TB) were wiped and then
deployed with BlueStore.

They run:

- RBD (couple of pools)
- RGW indexes

The actual RGW data is on HDD OSDs in the same cluster.

The whole cluster does a steady 20k IOps (R+W) during the day I would
say. Roughly 200MB/sec write and 100MB/sec read.

Suddenly OSDs started to grind to a halt last week and have been
flapping and showing slow requests ever since.

The RGW indexes have been offloaded to the HDD nodes in the meantime.

Wido

>>>
>> Wido
>>
 Thanks!

 Wido

> Everything above is "IMO", some chances that I missed something..
>
>
> Thanks,
>
> Igor
>
>
> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>> I don't know anything about the BlueStore code, but given the
>>> snippets
>>> you've posted this appears to be a debug thing that doesn't expect
>>> to be
>>> invoked (or perhaps only in an unexpected case that it's trying
>>> hard to
>>> recover from). Have you checked where the dump() function is invoked
>>> from? I'd imagine it's something about having to try extra-hard to
>>> allocate free space or 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-16 Thread Igor Fedotov



On 10/16/2018 6:57 AM, Wido den Hollander wrote:


On 10/16/2018 12:04 AM, Igor Fedotov wrote:

On 10/15/2018 11:47 PM, Wido den Hollander wrote:

Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:

Hi Wido,

once you apply the PR you'll probably see the initial error in the log
that triggers the dump. Which is most probably the lack of space
reported by _balance_bluefs_freespace() function. If so this means that
BlueFS rebalance is unable to allocate contiguous 1M chunk at main
device to gift to BlueFS. I.e. your main device space is very
fragmented.

Unfortunately I don't know any ways to recover from this state but OSD
redeployment or data removal.


We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(

Good chances that you'll face the same issue again one day.
May be allocate some SSDs to serve as DB devices?

Maybe, but this is a very common use-case where people run WAL+DB+DATA
on a single SSD.
Yeah, but I'd consider that as a workaround until better solution is 
provided.
Before your case this high fragmentation issue had been rather a 
theoretical outlook with a very small probability.



Now we are running into it, but aren't the chances big other people will
run into it as well?


Upcoming PR that brings an ability for offline BlueFS volume
manipulation (https://github.com/ceph/ceph/pull/23103) will probably
help to recover from this issue in future by migrating BlueFS data to a
new larger DB volume. (targeted for Nautilus, not sure about backporting
to Mimic or Luminous).

For now I can suggest the only preventive mean to avoid the case - have
large enough space at your standalone DB volume. So that master device
isn't used for DB at all or as minimum as possible. Hence no rebalance
is needed and no fragmentation is present.


I see, but these are SSD-only OSDs.


BTW wondering if you have one for your OSDs? How large if so?


The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?

Yeah, thanks for sharing.

Let me know if you need more!
Additionally wondering if you know how many data (in average) has been 
totally written to these OSDs. I mean an aggregation over all writes 
(even ones that already has been removed) not the current usage.

Use patterns, object sizes etc would be interesting as well.




Wido


Thanks!

Wido


Everything above is "IMO", some chances that I missed something..


Thanks,

Igor


On 10/15/2018 10:12 PM, Wido den Hollander wrote:

On 10/15/2018 08:23 PM, Gregory Farnum wrote:

I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect
to be
invoked (or perhaps only in an unexpected case that it's trying
hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.

It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido


-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander mailto:w...@42on.com>> wrote:



   On 10/11/2018 12:08 AM, Wido den Hollander wrote:
   > Hi,
   >
   > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
12.2.8 I'm
   > seeing OSDs writing heavily to their logfiles spitting out
these
   lines:
   >
   >
   > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2078000~34000
   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd22cc000~24000
   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd230~2
   > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2324000~24000
   > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd26c~24000
   > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2704000~3
   >
   > It goes so fast that the OS-disk in this case can't keep up
and become
   > 100% util.
   >
   > This causes the OSD to slow down and cause slow requests and
   starts to flap.
   >

   I've set 'log_file' to /dev/null for now, but that doesn't
solve it
   either. Randomly OSDs just start spitting out 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Wido den Hollander


On 10/16/2018 12:04 AM, Igor Fedotov wrote:
> 
> On 10/15/2018 11:47 PM, Wido den Hollander wrote:
>> Hi,
>>
>> On 10/15/2018 10:43 PM, Igor Fedotov wrote:
>>> Hi Wido,
>>>
>>> once you apply the PR you'll probably see the initial error in the log
>>> that triggers the dump. Which is most probably the lack of space
>>> reported by _balance_bluefs_freespace() function. If so this means that
>>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
>>> device to gift to BlueFS. I.e. your main device space is very
>>> fragmented.
>>>
>>> Unfortunately I don't know any ways to recover from this state but OSD
>>> redeployment or data removal.
>>>
>> We are moving data away from these OSDs. Lucky us that we have HDD OSDs
>> in there as well, moving a lot of data there.
>>
>> How would re-deployment work? Just wipe the OSDs and bring them back
>> into the cluster again? That would be a very painful task.. :-(
> 
> Good chances that you'll face the same issue again one day.
> May be allocate some SSDs to serve as DB devices?

Maybe, but this is a very common use-case where people run WAL+DB+DATA
on a single SSD.

Now we are running into it, but aren't the chances big other people will
run into it as well?

>>
>>> Upcoming PR that brings an ability for offline BlueFS volume
>>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
>>> help to recover from this issue in future by migrating BlueFS data to a
>>> new larger DB volume. (targeted for Nautilus, not sure about backporting
>>> to Mimic or Luminous).
>>>
>>> For now I can suggest the only preventive mean to avoid the case - have
>>> large enough space at your standalone DB volume. So that master device
>>> isn't used for DB at all or as minimum as possible. Hence no rebalance
>>> is needed and no fragmentation is present.
>>>
>> I see, but these are SSD-only OSDs.
>>
>>> BTW wondering if you have one for your OSDs? How large if so?
>>>
>> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.
>>
>> The fullest OSD currently is 78% full, which is 348GiB free on the
>> 1.75TiB device.
>>
>> Does this information help?
> Yeah, thanks for sharing.

Let me know if you need more!

Wido

>>
>> Thanks!
>>
>> Wido
>>
>>> Everything above is "IMO", some chances that I missed something..
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
 On 10/15/2018 08:23 PM, Gregory Farnum wrote:
> I don't know anything about the BlueStore code, but given the snippets
> you've posted this appears to be a debug thing that doesn't expect
> to be
> invoked (or perhaps only in an unexpected case that it's trying
> hard to
> recover from). Have you checked where the dump() function is invoked
> from? I'd imagine it's something about having to try extra-hard to
> allocate free space or something.
 It seems BlueFS that is having a hard time finding free space.

 I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

 It will stop the spamming, but that's not the root cause. The OSDs in
 this case are at max 80% full and they do have a lot of OMAP (RGW
 indexes) in them, but that's all.

 I'm however not sure why this is happening suddenly in this cluster.

 Wido

> -Greg
>
> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander  > wrote:
>
>
>
>   On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>   > Hi,
>   >
>   > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
> 12.2.8 I'm
>   > seeing OSDs writing heavily to their logfiles spitting out
> these
>   lines:
>   >
>   >
>   > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd2078000~34000
>   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd22cc000~24000
>   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd230~2
>   > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd2324000~24000
>   > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd26c~24000
>   > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>   0x0x55828ae047d0
>   > dump  0x15cd2704000~3
>   >
>   > It goes so fast that the OS-disk in this case can't keep up
> and become
>   > 100% util.
>   >
>   > This causes the OSD to slow down and cause slow requests and
>   starts to flap.
>   >
>
>   I've set 'log_file' to /dev/null for now, but that doesn't
> solve it
>   either. Randomly OSDs just start 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Igor Fedotov


On 10/15/2018 11:47 PM, Wido den Hollander wrote:

Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:

Hi Wido,

once you apply the PR you'll probably see the initial error in the log
that triggers the dump. Which is most probably the lack of space
reported by _balance_bluefs_freespace() function. If so this means that
BlueFS rebalance is unable to allocate contiguous 1M chunk at main
device to gift to BlueFS. I.e. your main device space is very fragmented.

Unfortunately I don't know any ways to recover from this state but OSD
redeployment or data removal.


We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(


Good chances that you'll face the same issue again one day.
May be allocate some SSDs to serve as DB devices?



Upcoming PR that brings an ability for offline BlueFS volume
manipulation (https://github.com/ceph/ceph/pull/23103) will probably
help to recover from this issue in future by migrating BlueFS data to a
new larger DB volume. (targeted for Nautilus, not sure about backporting
to Mimic or Luminous).

For now I can suggest the only preventive mean to avoid the case - have
large enough space at your standalone DB volume. So that master device
isn't used for DB at all or as minimum as possible. Hence no rebalance
is needed and no fragmentation is present.


I see, but these are SSD-only OSDs.


BTW wondering if you have one for your OSDs? How large if so?


The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?

Yeah, thanks for sharing.


Thanks!

Wido


Everything above is "IMO", some chances that I missed something..


Thanks,

Igor


On 10/15/2018 10:12 PM, Wido den Hollander wrote:

On 10/15/2018 08:23 PM, Gregory Farnum wrote:

I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect to be
invoked (or perhaps only in an unexpected case that it's trying hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.

It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido


-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander mailto:w...@42on.com>> wrote:



  On 10/11/2018 12:08 AM, Wido den Hollander wrote:
  > Hi,
  >
  > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
12.2.8 I'm
  > seeing OSDs writing heavily to their logfiles spitting out these
  lines:
  >
  >
  > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd2078000~34000
  > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd22cc000~24000
  > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd230~2
  > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd2324000~24000
  > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd26c~24000
  > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
  0x0x55828ae047d0
  > dump  0x15cd2704000~3
  >
  > It goes so fast that the OS-disk in this case can't keep up
and become
  > 100% util.
  >
  > This causes the OSD to slow down and cause slow requests and
  starts to flap.
  >

  I've set 'log_file' to /dev/null for now, but that doesn't solve it
  either. Randomly OSDs just start spitting out slow requests and
have
  these issues.

  Any suggestions on how to fix this?

  Wido

  > It seems that this is *only* happening on OSDs which are the
fullest
  > (~85%) on this cluster and they have about ~400 PGs each (Yes,
I know,
  > that's high).
  >
  > Looking at StupidAllocator.cc I see this piece of code:
  >
  > void StupidAllocator::dump()
  > {
  >   std::lock_guard l(lock);
  >   for (unsigned bin = 0; bin < free.size(); ++bin) {
  >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
  >                   << free[bin].num_intervals() << " extents"
<< dendl;
  >     for (auto p = free[bin].begin();
  >          p != free[bin].end();
  >          ++p) {
  >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Wido den Hollander
Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:
> Hi Wido,
> 
> once you apply the PR you'll probably see the initial error in the log
> that triggers the dump. Which is most probably the lack of space
> reported by _balance_bluefs_freespace() function. If so this means that
> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
> device to gift to BlueFS. I.e. your main device space is very fragmented.
> 
> Unfortunately I don't know any ways to recover from this state but OSD
> redeployment or data removal.
> 

We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(

> Upcoming PR that brings an ability for offline BlueFS volume
> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
> help to recover from this issue in future by migrating BlueFS data to a
> new larger DB volume. (targeted for Nautilus, not sure about backporting
> to Mimic or Luminous).
> 
> For now I can suggest the only preventive mean to avoid the case - have
> large enough space at your standalone DB volume. So that master device
> isn't used for DB at all or as minimum as possible. Hence no rebalance
> is needed and no fragmentation is present.
> 

I see, but these are SSD-only OSDs.

> BTW wondering if you have one for your OSDs? How large if so?
> 

The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?

Thanks!

Wido

> Everything above is "IMO", some chances that I missed something..
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>>
>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>> I don't know anything about the BlueStore code, but given the snippets
>>> you've posted this appears to be a debug thing that doesn't expect to be
>>> invoked (or perhaps only in an unexpected case that it's trying hard to
>>> recover from). Have you checked where the dump() function is invoked
>>> from? I'd imagine it's something about having to try extra-hard to
>>> allocate free space or something.
>> It seems BlueFS that is having a hard time finding free space.
>>
>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543
>>
>> It will stop the spamming, but that's not the root cause. The OSDs in
>> this case are at max 80% full and they do have a lot of OMAP (RGW
>> indexes) in them, but that's all.
>>
>> I'm however not sure why this is happening suddenly in this cluster.
>>
>> Wido
>>
>>> -Greg
>>>
>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander >> > wrote:
>>>
>>>
>>>
>>>  On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>>>  > Hi,
>>>  >
>>>  > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
>>> 12.2.8 I'm
>>>  > seeing OSDs writing heavily to their logfiles spitting out these
>>>  lines:
>>>  >
>>>  >
>>>  > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd2078000~34000
>>>  > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd22cc000~24000
>>>  > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd230~2
>>>  > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd2324000~24000
>>>  > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd26c~24000
>>>  > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>>>  0x0x55828ae047d0
>>>  > dump  0x15cd2704000~3
>>>  >
>>>  > It goes so fast that the OS-disk in this case can't keep up
>>> and become
>>>  > 100% util.
>>>  >
>>>  > This causes the OSD to slow down and cause slow requests and
>>>  starts to flap.
>>>  >
>>>
>>>  I've set 'log_file' to /dev/null for now, but that doesn't solve it
>>>  either. Randomly OSDs just start spitting out slow requests and
>>> have
>>>  these issues.
>>>
>>>  Any suggestions on how to fix this?
>>>
>>>  Wido
>>>
>>>  > It seems that this is *only* happening on OSDs which are the
>>> fullest
>>>  > (~85%) on this cluster and they have about ~400 PGs each (Yes,
>>> I know,
>>>  > that's high).
>>>  >
>>>  > Looking at StupidAllocator.cc I see this piece of code:
>>>  >
>>>  > void StupidAllocator::dump()
>>>  > {
>>>  >   std::lock_guard l(lock);
>>>  >   for (unsigned bin = 0; bin < free.size(); ++bin) {
>>>  >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>>>  >                   << free[bin].num_intervals() << " extents"
>>> << dendl;
>>>  >     for (auto p = 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Igor Fedotov

Hi Wido,

once you apply the PR you'll probably see the initial error in the log 
that triggers the dump. Which is most probably the lack of space 
reported by _balance_bluefs_freespace() function. If so this means that 
BlueFS rebalance is unable to allocate contiguous 1M chunk at main 
device to gift to BlueFS. I.e. your main device space is very fragmented.


Unfortunately I don't know any ways to recover from this state but OSD 
redeployment or data removal.


Upcoming PR that brings an ability for offline BlueFS volume 
manipulation (https://github.com/ceph/ceph/pull/23103) will probably 
help to recover from this issue in future by migrating BlueFS data to a 
new larger DB volume. (targeted for Nautilus, not sure about backporting 
to Mimic or Luminous).


For now I can suggest the only preventive mean to avoid the case - have 
large enough space at your standalone DB volume. So that master device 
isn't used for DB at all or as minimum as possible. Hence no rebalance 
is needed and no fragmentation is present.


BTW wondering if you have one for your OSDs? How large if so?

Everything above is "IMO", some chances that I missed something..


Thanks,

Igor


On 10/15/2018 10:12 PM, Wido den Hollander wrote:


On 10/15/2018 08:23 PM, Gregory Farnum wrote:

I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect to be
invoked (or perhaps only in an unexpected case that it's trying hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.

It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido


-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander mailto:w...@42on.com>> wrote:



 On 10/11/2018 12:08 AM, Wido den Hollander wrote:
 > Hi,
 >
 > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
 > seeing OSDs writing heavily to their logfiles spitting out these
 lines:
 >
 >
 > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd2078000~34000
 > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd22cc000~24000
 > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd230~2
 > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd2324000~24000
 > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd26c~24000
 > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
 0x0x55828ae047d0
 > dump  0x15cd2704000~3
 >
 > It goes so fast that the OS-disk in this case can't keep up and become
 > 100% util.
 >
 > This causes the OSD to slow down and cause slow requests and
 starts to flap.
 >

 I've set 'log_file' to /dev/null for now, but that doesn't solve it
 either. Randomly OSDs just start spitting out slow requests and have
 these issues.

 Any suggestions on how to fix this?

 Wido

 > It seems that this is *only* happening on OSDs which are the fullest
 > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
 > that's high).
 >
 > Looking at StupidAllocator.cc I see this piece of code:
 >
 > void StupidAllocator::dump()
 > {
 >   std::lock_guard l(lock);
 >   for (unsigned bin = 0; bin < free.size(); ++bin) {
 >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
 >                   << free[bin].num_intervals() << " extents" << dendl;
 >     for (auto p = free[bin].begin();
 >          p != free[bin].end();
 >          ++p) {
 >       ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
 > << "~"
 >                     << p.get_len() << std::dec << dendl;
 >     }
 >   }
 > }
 >
 > I'm just wondering why it would spit out these lines and what's
 causing it.
 >
 > Has anybody seen this before?
 >
 > Wido
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com 
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 >
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Wido den Hollander


On 10/15/2018 08:23 PM, Gregory Farnum wrote:
> I don't know anything about the BlueStore code, but given the snippets
> you've posted this appears to be a debug thing that doesn't expect to be
> invoked (or perhaps only in an unexpected case that it's trying hard to
> recover from). Have you checked where the dump() function is invoked
> from? I'd imagine it's something about having to try extra-hard to
> allocate free space or something.

It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido

> -Greg
> 
> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander  > wrote:
> 
> 
> 
> On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> > Hi,
> >
> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> > seeing OSDs writing heavily to their logfiles spitting out these
> lines:
> >
> >
> > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2078000~34000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd22cc000~24000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd230~2
> > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2324000~24000
> > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd26c~24000
> > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2704000~3
> >
> > It goes so fast that the OS-disk in this case can't keep up and become
> > 100% util.
> >
> > This causes the OSD to slow down and cause slow requests and
> starts to flap.
> >
> 
> I've set 'log_file' to /dev/null for now, but that doesn't solve it
> either. Randomly OSDs just start spitting out slow requests and have
> these issues.
> 
> Any suggestions on how to fix this?
> 
> Wido
> 
> > It seems that this is *only* happening on OSDs which are the fullest
> > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> > that's high).
> >
> > Looking at StupidAllocator.cc I see this piece of code:
> >
> > void StupidAllocator::dump()
> > {
> >   std::lock_guard l(lock);
> >   for (unsigned bin = 0; bin < free.size(); ++bin) {
> >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
> >                   << free[bin].num_intervals() << " extents" << dendl;
> >     for (auto p = free[bin].begin();
> >          p != free[bin].end();
> >          ++p) {
> >       ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> > << "~"
> >                     << p.get_len() << std::dec << dendl;
> >     }
> >   }
> > }
> >
> > I'm just wondering why it would spit out these lines and what's
> causing it.
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Gregory Farnum
I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect to be
invoked (or perhaps only in an unexpected case that it's trying hard to
recover from). Have you checked where the dump() function is invoked from?
I'd imagine it's something about having to try extra-hard to allocate free
space or something.
-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander  wrote:

>
>
> On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> > Hi,
> >
> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> > seeing OSDs writing heavily to their logfiles spitting out these lines:
> >
> >
> > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2078000~34000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd22cc000~24000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd230~2
> > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2324000~24000
> > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd26c~24000
> > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2704000~3
> >
> > It goes so fast that the OS-disk in this case can't keep up and become
> > 100% util.
> >
> > This causes the OSD to slow down and cause slow requests and starts to
> flap.
> >
>
> I've set 'log_file' to /dev/null for now, but that doesn't solve it
> either. Randomly OSDs just start spitting out slow requests and have
> these issues.
>
> Any suggestions on how to fix this?
>
> Wido
>
> > It seems that this is *only* happening on OSDs which are the fullest
> > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> > that's high).
> >
> > Looking at StupidAllocator.cc I see this piece of code:
> >
> > void StupidAllocator::dump()
> > {
> >   std::lock_guard l(lock);
> >   for (unsigned bin = 0; bin < free.size(); ++bin) {
> > ldout(cct, 0) << __func__ << " free bin " << bin << ": "
> >   << free[bin].num_intervals() << " extents" << dendl;
> > for (auto p = free[bin].begin();
> >  p != free[bin].end();
> >  ++p) {
> >   ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> > << "~"
> > << p.get_len() << std::dec << dendl;
> > }
> >   }
> > }
> >
> > I'm just wondering why it would spit out these lines and what's causing
> it.
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-15 Thread Wido den Hollander



On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> Hi,
> 
> On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> seeing OSDs writing heavily to their logfiles spitting out these lines:
> 
> 
> 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2078000~34000
> 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd22cc000~24000
> 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd230~2
> 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2324000~24000
> 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd26c~24000
> 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2704000~3
> 
> It goes so fast that the OS-disk in this case can't keep up and become
> 100% util.
> 
> This causes the OSD to slow down and cause slow requests and starts to flap.
> 

I've set 'log_file' to /dev/null for now, but that doesn't solve it
either. Randomly OSDs just start spitting out slow requests and have
these issues.

Any suggestions on how to fix this?

Wido

> It seems that this is *only* happening on OSDs which are the fullest
> (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> that's high).
> 
> Looking at StupidAllocator.cc I see this piece of code:
> 
> void StupidAllocator::dump()
> {
>   std::lock_guard l(lock);
>   for (unsigned bin = 0; bin < free.size(); ++bin) {
> ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>   << free[bin].num_intervals() << " extents" << dendl;
> for (auto p = free[bin].begin();
>  p != free[bin].end();
>  ++p) {
>   ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> << "~"
> << p.get_len() << std::dec << dendl;
> }
>   }
> }
> 
> I'm just wondering why it would spit out these lines and what's causing it.
> 
> Has anybody seen this before?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-11 Thread Wido den Hollander


On 10/11/2018 03:12 AM, David Turner wrote:
> Not a resolution, but an idea that you've probably thought of. 
> Disabling logging on any affected OSDs (possibly just all of them) seems
> like a needed step to be able to keep working with this cluster to
> finish the upgrade and get it healthier.
> 

Thanks for the tip! But I wouldn't know how to silence the stupidalloc.

debug_osd = 0/0
debug_bluefs = 0/0
debug_bluestore = 0/0
debug_bdev = 0/0

It's all set, but the logs keep coming.

Right now we found a work-around by offloading some data from these
OSDs. The cluster is a mix of SSDs and HDDs. The problem is with the SSD
OSDs. So we moved a pool from SSD to HDD and that seems to have fixed
the problem for now.

But it will probably get back as soon as some OSDs go >80%.

Wido

> On Wed, Oct 10, 2018 at 6:37 PM Wido den Hollander  > wrote:
> 
> 
> 
> On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> > Hi,
> >
> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> > seeing OSDs writing heavily to their logfiles spitting out these
> lines:
> >
> >
> > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2078000~34000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd22cc000~24000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd230~2
> > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2324000~24000
> > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd26c~24000
> > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
> 0x0x55828ae047d0
> > dump  0x15cd2704000~3
> >
> > It goes so fast that the OS-disk in this case can't keep up and become
> > 100% util.
> >
> > This causes the OSD to slow down and cause slow requests and
> starts to flap.
> >
> > It seems that this is *only* happening on OSDs which are the fullest
> > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> > that's high).
> >
> 
> After some searching I stumbled upon this Bugzilla report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1600138
> 
> That seems to be the same issue, although I'm not 100% sure.
> 
> Wido
> 
> > Looking at StupidAllocator.cc I see this piece of code:
> >
> > void StupidAllocator::dump()
> > {
> >   std::lock_guard l(lock);
> >   for (unsigned bin = 0; bin < free.size(); ++bin) {
> >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
> >                   << free[bin].num_intervals() << " extents" << dendl;
> >     for (auto p = free[bin].begin();
> >          p != free[bin].end();
> >          ++p) {
> >       ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> > << "~"
> >                     << p.get_len() << std::dec << dendl;
> >     }
> >   }
> > }
> >
> > I'm just wondering why it would spit out these lines and what's
> causing it.
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-10 Thread David Turner
Not a resolution, but an idea that you've probably thought of.  Disabling
logging on any affected OSDs (possibly just all of them) seems like a
needed step to be able to keep working with this cluster to finish the
upgrade and get it healthier.

On Wed, Oct 10, 2018 at 6:37 PM Wido den Hollander  wrote:

>
>
> On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> > Hi,
> >
> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> > seeing OSDs writing heavily to their logfiles spitting out these lines:
> >
> >
> > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2078000~34000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd22cc000~24000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd230~2
> > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2324000~24000
> > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd26c~24000
> > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2704000~3
> >
> > It goes so fast that the OS-disk in this case can't keep up and become
> > 100% util.
> >
> > This causes the OSD to slow down and cause slow requests and starts to
> flap.
> >
> > It seems that this is *only* happening on OSDs which are the fullest
> > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> > that's high).
> >
>
> After some searching I stumbled upon this Bugzilla report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1600138
>
> That seems to be the same issue, although I'm not 100% sure.
>
> Wido
>
> > Looking at StupidAllocator.cc I see this piece of code:
> >
> > void StupidAllocator::dump()
> > {
> >   std::lock_guard l(lock);
> >   for (unsigned bin = 0; bin < free.size(); ++bin) {
> > ldout(cct, 0) << __func__ << " free bin " << bin << ": "
> >   << free[bin].num_intervals() << " extents" << dendl;
> > for (auto p = free[bin].begin();
> >  p != free[bin].end();
> >  ++p) {
> >   ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> > << "~"
> > << p.get_len() << std::dec << dendl;
> > }
> >   }
> > }
> >
> > I'm just wondering why it would spit out these lines and what's causing
> it.
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-10 Thread Wido den Hollander



On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> Hi,
> 
> On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> seeing OSDs writing heavily to their logfiles spitting out these lines:
> 
> 
> 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2078000~34000
> 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd22cc000~24000
> 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd230~2
> 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2324000~24000
> 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd26c~24000
> 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> dump  0x15cd2704000~3
> 
> It goes so fast that the OS-disk in this case can't keep up and become
> 100% util.
> 
> This causes the OSD to slow down and cause slow requests and starts to flap.
> 
> It seems that this is *only* happening on OSDs which are the fullest
> (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> that's high).
> 

After some searching I stumbled upon this Bugzilla report:
https://bugzilla.redhat.com/show_bug.cgi?id=1600138

That seems to be the same issue, although I'm not 100% sure.

Wido

> Looking at StupidAllocator.cc I see this piece of code:
> 
> void StupidAllocator::dump()
> {
>   std::lock_guard l(lock);
>   for (unsigned bin = 0; bin < free.size(); ++bin) {
> ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>   << free[bin].num_intervals() << " extents" << dendl;
> for (auto p = free[bin].begin();
>  p != free[bin].end();
>  ++p) {
>   ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> << "~"
> << p.get_len() << std::dec << dendl;
> }
>   }
> }
> 
> I'm just wondering why it would spit out these lines and what's causing it.
> 
> Has anybody seen this before?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-10 Thread Wido den Hollander
Hi,

On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
seeing OSDs writing heavily to their logfiles spitting out these lines:


2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2078000~34000
2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd22cc000~24000
2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd230~2
2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2324000~24000
2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd26c~24000
2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
dump  0x15cd2704000~3

It goes so fast that the OS-disk in this case can't keep up and become
100% util.

This causes the OSD to slow down and cause slow requests and starts to flap.

It seems that this is *only* happening on OSDs which are the fullest
(~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
that's high).

Looking at StupidAllocator.cc I see this piece of code:

void StupidAllocator::dump()
{
  std::lock_guard l(lock);
  for (unsigned bin = 0; bin < free.size(); ++bin) {
ldout(cct, 0) << __func__ << " free bin " << bin << ": "
  << free[bin].num_intervals() << " extents" << dendl;
for (auto p = free[bin].begin();
 p != free[bin].end();
 ++p) {
  ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
<< "~"
<< p.get_len() << std::dec << dendl;
}
  }
}

I'm just wondering why it would spit out these lines and what's causing it.

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com