Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-18 Thread Mike Lovell
re-adding the list.

i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark another hit set
missing. :)  i think the code that removes the hit set from the pg data is
before that assert so its possible it still removed it from the history.

mike

On Thu, Oct 18, 2018 at 9:11 AM Lawrence Smith <
lawrence.sm...@uni-muenster.de> wrote:

> Hi Mike,
>
> Thanks a bunch for your writeup, that was exactly the problem and
> solution! All i did was comment out the assert and ad an if(obc){ } after
> to make sure i don't run into a segfault, and now the cluster is healthy
> once again. I am not sure if ceph will register a mismatch in a byte count
> while scrubbing due to the missing object, but I don't think so.
>
> Anyway, I just wanted to thank you for your help!
>
> Best wishes,
>
> Lawrence
>
> On 10/13/2018 02:00 AM, Mike Lovell wrote:
>
> what was the object name that you marked lost? was it one of the cache
> tier hit_sets?
>
> the trace you have does seem to be failing when the OSD is trying to
> remove a hit set that is no longer needed. i ran into a similar problem
> which might have been why that bug you listed was created. maybe providing
> what i have since discovered about hit sets might help.
>
> the hit sets are what the cache tier uses to know which objects have been
> accessed in a given period of time. these hit sets are then stored in the
> object store using an object name that is generated. for the version you're
> running, the code for that generation is at
> https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
> its bascially "hit_set__archive__" where the
> times are recorded in the hit set history. that hit set history is stored
> as part of the PG metadata. you can get a list of all of the hit sets the
> PG has by looking at 'ceph pg  query' and looking at the
> ['info']['hit_set_history']['history'] array. each entry in that array has
> the information on each hit set for the PG and the times are what is used
> in generation of the object name. there should be one ceph object for each
> hit set listed in that array.
>
> if you told the cluster to mark one of the hit set objects as lost, its
> possible the OSD cannot get that object and is hitting the assert(obc) near
> the end of PrimaryLogPG::hit_set_trim in the same source file referenced
> above. you can potentially verify this by a couple methods. i think if you
> set debug_osd at 20 that it should log a line saying something like
> "hit_set_trim removing hit_set__archive." if that name matches
> one of the ones you marked lost, then is this almost certainly the cause.
> you can also do a find on the OSD directory, if you're using file store,
> and look for the right file name. something like 'find
> /var/lib/ceph/osd/ceph-/current/_head -name
> hit\*set\*\*archive\*' should work. include the \ to escape the * so
> bash doesn't interpret it. if you're using bluestore, i think you can use
> the ceph-objectstore-tool while the osd is stopped to get a list of
> objects. you'll probably want to only look in the .ceph-internal namespace
> since the hit sets are stored in that namespace.
>
> there are a couple potential ways to get around this. what we did when we
> had the problem was run a custom build of the ceph-osd where we commented
> out the assert(obc); line in hit_set_trim. that build was only run for long
> enough to get the cluster back online and then to flush and evict the
> entire cache, remove the cache, restart using the normal ceph builds, and
> then recreate the cache.
>
> the other options are things that i don't know for sure if they'll work.
> if you're using file store, you might be able to just copy another hit set
> to the file name of the missing hit set object. this should be pretty
> benign and its just going to remove the object in a moment anyways. also,
> i'm not entirely sure how to come up with what directory to put the object
> in if the osd has done any directory splitting. maybe someone on the list
> will know how to do this. there might be a way with the
> ceph-objectstore-tool to write in the object but i couldn't find one in my
> testing on hammer.
>
> the last option i can think of, is that if you can completely stop any
> traffic to the pools in question, its possible the OSDs wont crash.
> hit_set_trim doesn't appear to get called if there is no client traffic
> reaching the osds and the hit sets aren't being updated. if you can stop
> anything from using the pools in question and guarantee nothing will come
>

Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-12 Thread Mike Lovell
what was the object name that you marked lost? was it one of the cache tier
hit_sets?

the trace you have does seem to be failing when the OSD is trying to remove
a hit set that is no longer needed. i ran into a similar problem which
might have been why that bug you listed was created. maybe providing what i
have since discovered about hit sets might help.

the hit sets are what the cache tier uses to know which objects have been
accessed in a given period of time. these hit sets are then stored in the
object store using an object name that is generated. for the version you're
running, the code for that generation is at
https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
its bascially "hit_set__archive__" where the
times are recorded in the hit set history. that hit set history is stored
as part of the PG metadata. you can get a list of all of the hit sets the
PG has by looking at 'ceph pg  query' and looking at the
['info']['hit_set_history']['history'] array. each entry in that array has
the information on each hit set for the PG and the times are what is used
in generation of the object name. there should be one ceph object for each
hit set listed in that array.

if you told the cluster to mark one of the hit set objects as lost, its
possible the OSD cannot get that object and is hitting the assert(obc) near
the end of PrimaryLogPG::hit_set_trim in the same source file referenced
above. you can potentially verify this by a couple methods. i think if you
set debug_osd at 20 that it should log a line saying something like
"hit_set_trim removing hit_set__archive." if that name matches
one of the ones you marked lost, then is this almost certainly the cause.
you can also do a find on the OSD directory, if you're using file store,
and look for the right file name. something like 'find
/var/lib/ceph/osd/ceph-/current/_head -name
hit\*set\*\*archive\*' should work. include the \ to escape the * so
bash doesn't interpret it. if you're using bluestore, i think you can use
the ceph-objectstore-tool while the osd is stopped to get a list of
objects. you'll probably want to only look in the .ceph-internal namespace
since the hit sets are stored in that namespace.

there are a couple potential ways to get around this. what we did when we
had the problem was run a custom build of the ceph-osd where we commented
out the assert(obc); line in hit_set_trim. that build was only run for long
enough to get the cluster back online and then to flush and evict the
entire cache, remove the cache, restart using the normal ceph builds, and
then recreate the cache.

the other options are things that i don't know for sure if they'll work. if
you're using file store, you might be able to just copy another hit set to
the file name of the missing hit set object. this should be pretty benign
and its just going to remove the object in a moment anyways. also, i'm not
entirely sure how to come up with what directory to put the object in if
the osd has done any directory splitting. maybe someone on the list will
know how to do this. there might be a way with the ceph-objectstore-tool to
write in the object but i couldn't find one in my testing on hammer.

the last option i can think of, is that if you can completely stop any
traffic to the pools in question, its possible the OSDs wont crash.
hit_set_trim doesn't appear to get called if there is no client traffic
reaching the osds and the hit sets aren't being updated. if you can stop
anything from using the pools in question and guarantee nothing will come
in, then it might be possible to keep the OSDs long enough to flush
everything from the cache tier, remove it, and recreate it. this option
seems like a long shot and i don't know for sure it'll work. it just seemed
to me like the OSDs would stay up in a similar scenario on my hammer test
cluster. its possible things have changed in luminous and hit_set_trim gets
called more often. i also didn't test whether the process of flushing and
evicting the objects in the cache caused hit_set_trim to get called.

hopefully that gives you some more info on what might be going on and ways
around it. i'm not entirely sure why there is still the assert(obj); in
hit_set_trim still. there was a little bit of discussion about removing it
since it means the object its trying to remove is gone anyways. i think
that just happened for a little bit in irc. i guess it didn't happen cause
no one followed up on it.

good luck and hopefully you don't blame me if things get worse. :)
mike

On Fri, Oct 12, 2018 at 7:34 AM Lawrence Smith <
lawrence.sm...@uni-muenster.de> wrote:

> Hi all,
>
> we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
> pool with a min_size of 2. We recently encountered an "object unfound"
> error in one of our pgs in this pool. After marking this object lost,
> two of the acting osds crashed and were unable to start up again, with
> only the primary osd staying up. Hoping the cluster might 

Re: [ceph-users] All pools full after one OSD got OSD_FULL state

2018-03-29 Thread Mike Lovell
On Thu, Mar 29, 2018 at 1:17 AM, Jakub Jaszewski 
wrote:

> Many thanks Mike, that justifies stopped IOs. I've just finished adding
> new disks to cluster and now try to evenly reweight OSD  by PG.
>
> May I ask you two more questions?
> 1. As I was in a hurry I did not check if only write ops were blocked or
> reads from the pools as well, do you know that maybe ?
>

i don't know for certain but i think reads are still processed normally.


> 2. All OSDs are shared with all pools in our cluster. What in the case
> when each pool has its dedicated OSDs, does one FULL OSD block only one
> pool or whole cluster ?
>

i've not tested this one for sure but my understanding of this is that
pools that use separate sets of osds would still be able to do writes.
assuming they haven't filled up as well. if you have multiple pools that
use the full osd, then at least those pools would be blocked. this should
be visible from 'ceph df' since it would show different amounts of MAX
AVAIL for the pools using different sets of osds.

Thanks!
>

np

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG mapped to OSDs on same host although 'chooseleaf type host'

2018-02-22 Thread Mike Lovell
was the pg-upmap feature used to force a pg to get mapped to a particular
osd?

mike

On Thu, Feb 22, 2018 at 10:28 AM, Wido den Hollander  wrote:

> Hi,
>
> I have a situation with a cluster which was recently upgraded to Luminous
> and has a PG mapped to OSDs on the same host.
>
> root@man:~# ceph pg map 1.41
> osdmap e21543 pg 1.41 (1.41) -> up [15,7,4] acting [15,7,4]
> root@man:~#
>
> root@man:~# ceph osd find 15|jq -r '.crush_location.host'
> n02
> root@man:~# ceph osd find 7|jq -r '.crush_location.host'
> n01
> root@man:~# ceph osd find 4|jq -r '.crush_location.host'
> n02
> root@man:~#
>
> As you can see, OSD 15 and 4 are both on the host 'n02'.
>
> This PG went inactive when the machine hosting both OSDs went down for
> maintenance.
>
> My first suspect was the CRUSHMap and the rules, but those are fine:
>
> rule replicated_ruleset {
> id 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> This is the only rule in the CRUSHMap.
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -1   19.50325 root default
> -22.78618 host n01
>  5   ssd  0.92999 osd.5  up  1.0 1.0
>  7   ssd  0.92619 osd.7  up  1.0 1.0
> 14   ssd  0.92999 osd.14 up  1.0 1.0
> -32.78618 host n02
>  4   ssd  0.92999 osd.4  up  1.0 1.0
>  8   ssd  0.92619 osd.8  up  1.0 1.0
> 15   ssd  0.92999 osd.15 up  1.0 1.0
> -42.78618 host n03
>  3   ssd  0.92999 osd.3  up  0.94577 1.0
>  9   ssd  0.92619 osd.9  up  0.82001 1.0
> 16   ssd  0.92999 osd.16 up  0.84885 1.0
> -52.78618 host n04
>  2   ssd  0.92999 osd.2  up  0.93501 1.0
> 10   ssd  0.92619 osd.10 up  0.76031 1.0
> 17   ssd  0.92999 osd.17 up  0.82883 1.0
> -62.78618 host n05
>  6   ssd  0.92999 osd.6  up  0.84470 1.0
> 11   ssd  0.92619 osd.11 up  0.80530 1.0
> 18   ssd  0.92999 osd.18 up  0.86501 1.0
> -72.78618 host n06
>  1   ssd  0.92999 osd.1  up  0.88353 1.0
> 12   ssd  0.92619 osd.12 up  0.79602 1.0
> 19   ssd  0.92999 osd.19 up  0.83171 1.0
> -82.78618 host n07
>  0   ssd  0.92999 osd.0  up  1.0 1.0
> 13   ssd  0.92619 osd.13 up  0.86043 1.0
> 20   ssd  0.92999 osd.20 up  0.77153 1.0
>
> Here you see osd.15 and osd.4 on the same host 'n02'.
>
> This cluster was upgraded from Hammer to Jewel and now Luminous and it
> doesn't have the latest tunables yet, but should that matter? I never
> encountered this before.
>
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable chooseleaf_stable 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> I don't want to touch this yet in the case this is a bug or glitch in the
> matrix somewhere.
>
> I hope it's just a admin mistake, but so far I'm not able to find a clue
> pointing to that.
>
> root@man:~# ceph osd dump|head -n 12
> epoch 21545
> fsid 0b6fb388-6233-4eeb-a55c-476ed12bdf0a
> created 2015-04-28 14:43:53.950159
> modified 2018-02-22 17:56:42.497849
> flags sortbitwise,recovery_deletes,purged_snapdirs
> crush_version 22
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.85
> require_min_compat_client luminous
> min_compat_client luminous
> require_osd_release luminous
> root@man:~#
>
> I also downloaded the CRUSHmap and ran crushtool with --test and
> --show-mappings, but that didn't show any PG mapped to the same host.
>
> Any ideas on what might be going on here?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-22 Thread Mike Lovell
adding ceph-users back on.

it sounds like the enterprise samsungs and hitachis have been mentioned on
the list as alternatives. i have 2 micron 5200 (pro i think) that i'm
beginning testing on and have some micron 9100 nvme drives to use as
journals. so the enterprise micron might be good. i did try some micron
m600s a couple years ago and was disappointed by them so i'm avoiding the
"prosumer" ones from micron if i can. my use case has been the 1TB range
ssds and am using them mainly as a cache tier and filestore. my needs might
not line up closely with yours though.

mike

On Thu, Feb 22, 2018 at 3:58 PM, Hans Chris Jones <
chris.jo...@lambdastack.io> wrote:

> Interesting. This does not inspire confidence. What SSDs (2TB or 4TB) do
> people have good success with in high use production systems with bluestore?
>
> Thanks
>
> On Thu, Feb 22, 2018 at 5:32 PM, Mike Lovell <mike.lov...@endurance.com>
> wrote:
>
>> hrm. intel has, until a year ago, been very good with ssds. the
>> description of your experience definitely doesn't inspire confidence. intel
>> also dropping the entire s3xxx and p3xxx series last year before having a
>> viable replacement has been driving me nuts.
>>
>> i don't know that i have the luxury of being able to return all of the
>> ones i have or just buying replacements. i'm going to need to at least try
>> them in production. it'll probably happen with the s4600 limited to a
>> particular fault domain. these are also going to be filestore osds so maybe
>> that will result in a different behavior. i'll try to post updates as i
>> have them.
>>
>> mike
>>
>> On Thu, Feb 22, 2018 at 2:33 PM, David Herselman <d...@syrex.co> wrote:
>>
>>> Hi Mike,
>>>
>>>
>>>
>>> I eventually got hold of a customer relations manager at Intel but his
>>> attitude was lack luster and Intel never officially responded to any
>>> correspondence we sent them. The Intel s4600 drives all passed our standard
>>> burn-in tests, they exclusively appear to fail once they handle production
>>> BlueStore usage, generally after a couple days use.
>>>
>>>
>>>
>>> Intel really didn’t seem interested, even after explaining that the
>>> drives were in different physical systems in different data centres and
>>> that I had been in contact with another Intel customer who had experienced
>>> similar failures in Dell equipment (our servers are pure Intel).
>>>
>>>
>>>
>>>
>>>
>>> Perhaps there’s interest in a Lawyer picking up the issue and their
>>> attitude. Not advising customers of a known issue which leads to data loss
>>> is simply negligent, especially on a product that they tout as being more
>>> reliable than spinners and has their Data Centre reliability stamp.
>>>
>>>
>>>
>>> I returned the lot and am done with Intel SSDs, will advise as many
>>> customers and peers to do the same…
>>>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>> David Herselman
>>>
>>>
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-22 Thread Mike Lovell
hrm. intel has, until a year ago, been very good with ssds. the description
of your experience definitely doesn't inspire confidence. intel also
dropping the entire s3xxx and p3xxx series last year before having a viable
replacement has been driving me nuts.

i don't know that i have the luxury of being able to return all of the ones
i have or just buying replacements. i'm going to need to at least try them
in production. it'll probably happen with the s4600 limited to a particular
fault domain. these are also going to be filestore osds so maybe that will
result in a different behavior. i'll try to post updates as i have them.

mike

On Thu, Feb 22, 2018 at 2:33 PM, David Herselman <d...@syrex.co> wrote:

> Hi Mike,
>
>
>
> I eventually got hold of a customer relations manager at Intel but his
> attitude was lack luster and Intel never officially responded to any
> correspondence we sent them. The Intel s4600 drives all passed our standard
> burn-in tests, they exclusively appear to fail once they handle production
> BlueStore usage, generally after a couple days use.
>
>
>
> Intel really didn’t seem interested, even after explaining that the drives
> were in different physical systems in different data centres and that I had
> been in contact with another Intel customer who had experienced similar
> failures in Dell equipment (our servers are pure Intel).
>
>
>
>
>
> Perhaps there’s interest in a Lawyer picking up the issue and their
> attitude. Not advising customers of a known issue which leads to data loss
> is simply negligent, especially on a product that they tout as being more
> reliable than spinners and has their Data Centre reliability stamp.
>
>
>
> I returned the lot and am done with Intel SSDs, will advise as many
> customers and peers to do the same…
>
>
>
>
>
> Regards
>
> David Herselman
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Mike Lovell
> *Sent:* Thursday, 22 February 2018 11:19 PM
> *To:* ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] Many concurrent drive failures - How do I
> activate pgs?
>
>
>
> has anyone tried with the most recent firmwares from intel? i've had a
> number of s4600 960gb drives that have been waiting for me to get around to
> adding them to a ceph cluster. this as well as having 2 die almost
> simultaneously in a different storage box is giving me pause. i noticed
> that David listed some output showing his ssds were running firmware
> version SCV10100. the drives i have came with the same one. it looks
> like SCV10111 is available through the latest isdct package. i'm working
> through upgrading mine and attempting some burn in testing. just curious if
> anyone has had any luck there.
>
>
>
> mike
>
>
>
> On Thu, Feb 22, 2018 at 9:49 AM, Chris Sarginson <csarg...@gmail.com>
> wrote:
>
> Hi Caspar,
>
>
>
> Sean and I replaced the problematic DC S4600 disks (after all but one had
> failed) in our cluster with Samsung SM863a disks.
>
> There was an NDA for new Intel firmware (as mentioned earlier in the
> thread by David) but given the problems we were experiencing we moved all
> Intel disks to a single failure domain but were unable to get to deploy
> additional firmware to test.
>
>
> The Samsung should fit your requirements.
>
>
>
> http://www.samsung.com/semiconductor/minisite/ssd/
> product/enterprise/sm863a/
>
>
>
> Regards
>
> Chris
>
>
>
> On Thu, 22 Feb 2018 at 12:50 Caspar Smit <caspars...@supernas.eu> wrote:
>
> Hi Sean and David,
>
>
>
> Do you have any follow ups / news on the Intel DC S4600 case? We are
> looking into this drives to use as DB/WAL devices for a new to be build
> cluster.
>
>
>
> Did Intel provide anything (like new firmware) which should fix the issues
> you were having or are these drives still unreliable?
>
>
>
> At the moment we are also looking into the Intel DC S3610 as an
> alternative which are a step back in performance but should be very
> reliable.
>
>
>
> Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as
> DB/WAL? (Aiming for ~3 DWPD should be sufficient for DB/WAL?)
>
>
>
> Kind regards,
>
> Caspar
>
>
>
> 2018-01-12 15:45 GMT+01:00 Sean Redmond <sean.redmo...@gmail.com>:
>
> Hi David,
>
>
>
> To follow up on this I had a 4th drive fail (out of 12) and have opted to
> order the below disks as a replacement, I have an ongoing case with Intel
> via the supplier - Will report back anything useful - But I am going to
> avoid the Intel s4600 2TB SSD's for the moment.
>
>
>
&g

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-22 Thread Mike Lovell
has anyone tried with the most recent firmwares from intel? i've had a
number of s4600 960gb drives that have been waiting for me to get around to
adding them to a ceph cluster. this as well as having 2 die almost
simultaneously in a different storage box is giving me pause. i noticed
that David listed some output showing his ssds were running firmware
version SCV10100. the drives i have came with the same one. it looks
like SCV10111 is available through the latest isdct package. i'm working
through upgrading mine and attempting some burn in testing. just curious if
anyone has had any luck there.

mike

On Thu, Feb 22, 2018 at 9:49 AM, Chris Sarginson  wrote:

> Hi Caspar,
>
> Sean and I replaced the problematic DC S4600 disks (after all but one had
> failed) in our cluster with Samsung SM863a disks.
> There was an NDA for new Intel firmware (as mentioned earlier in the
> thread by David) but given the problems we were experiencing we moved all
> Intel disks to a single failure domain but were unable to get to deploy
> additional firmware to test.
>
> The Samsung should fit your requirements.
>
> http://www.samsung.com/semiconductor/minisite/ssd/
> product/enterprise/sm863a/
>
> Regards
> Chris
>
> On Thu, 22 Feb 2018 at 12:50 Caspar Smit  wrote:
>
>> Hi Sean and David,
>>
>> Do you have any follow ups / news on the Intel DC S4600 case? We are
>> looking into this drives to use as DB/WAL devices for a new to be build
>> cluster.
>>
>> Did Intel provide anything (like new firmware) which should fix the
>> issues you were having or are these drives still unreliable?
>>
>> At the moment we are also looking into the Intel DC S3610 as an
>> alternative which are a step back in performance but should be very
>> reliable.
>>
>> Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as
>> DB/WAL? (Aiming for ~3 DWPD should be sufficient for DB/WAL?)
>>
>> Kind regards,
>> Caspar
>>
>> 2018-01-12 15:45 GMT+01:00 Sean Redmond :
>>
>>> Hi David,
>>>
>>> To follow up on this I had a 4th drive fail (out of 12) and have opted
>>> to order the below disks as a replacement, I have an ongoing case with
>>> Intel via the supplier - Will report back anything useful - But I am going
>>> to avoid the Intel s4600 2TB SSD's for the moment.
>>>
>>> 1.92TB Samsung SM863a 2.5" Enterprise SSD, SATA3 6Gb/s, 2-bit MLC V-NAND
>>>
>>> Regards
>>> Sean Redmond
>>>
>>> On Wed, Jan 10, 2018 at 11:08 PM, Sean Redmond 
>>> wrote:
>>>
 Hi David,

 Thanks for your email, they are connected inside Dell R730XD (2.5 inch
 24 disk model) in None RAID mode via a perc RAID card.

 The version of ceph is Jewel with kernel 4.13.X and ubuntu 16.04.

 Thanks for your feedback on the HGST disks.

 Thanks

 On Wed, Jan 10, 2018 at 10:55 PM, David Herselman  wrote:

> Hi Sean,
>
>
>
> No, Intel’s feedback has been… Pathetic… I have yet to receive
> anything more than a request to ‘sign’ a non-disclosure agreement, to
> obtain beta firmware. No official answer as to whether or not one can
> logically unlock the drives, no answer to my question whether or not Intel
> publish serial numbers anywhere pertaining to recalled batches and no
> information pertaining to whether or not firmware updates would address 
> any
> known issues.
>
>
>
> This with us being an accredited Intel Gold partner…
>
>
>
>
>
> We’ve returned the lot and ended up with 9/12 of the drives failing in
> the same manner. The replaced drives, which had different serial number
> ranges, also failed. Very frustrating is that the drives fail in a way 
> that
> result in unbootable servers, unless one adds ‘rootdelay=240’ to the 
> kernel.
>
>
>
>
>
> I would be interested to know what platform your drives were in and
> whether or not they were connected to a RAID module/card.
>
>
>
> PS: After much searching we’ve decided to order the NVMe conversion
> kit and have ordered HGST UltraStar SN200 2.5 inch SFF drives with a 3 
> DWPD
> rating.
>
>
>
>
>
> Regards
>
> David Herselman
>
>
>
> *From:* Sean Redmond [mailto:sean.redmo...@gmail.com]
> *Sent:* Thursday, 11 January 2018 12:45 AM
> *To:* David Herselman 
> *Cc:* Christian Balzer ; ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] Many concurrent drive failures - How do I
> activate pgs?
>
>
>
> Hi,
>
>
>
> I have a case where 3 out to 12 of these Intel S4600 2TB model failed
> within a matter of days after being burn-in tested then placed into
> production.
>
>
>
> I am interested to know, did you every get any further feedback from
> the 

Re: [ceph-users] Removing cache tier for RBD pool

2018-01-19 Thread Mike Lovell
On Tue, Jan 16, 2018 at 9:25 AM, Jens-U. Mozdzen <jmozd...@nde.ag> wrote:

> Hello Mike,
>
> Zitat von Mike Lovell <mike.lov...@endurance.com>:
>
>> On Mon, Jan 8, 2018 at 6:08 AM, Jens-U. Mozdzen <jmozd...@nde.ag> wrote:
>>
>>> Hi *,
>>> [...]
>>> 1. Does setting the cache mode to "forward" lead to above situation of
>>> remaining locks on hot-storage pool objects? Maybe the clients' unlock
>>> requests are forwarded to the cold-storage pool, leaving the hot-storage
>>> objects locked? If so, this should be documented and it'd seem impossible
>>> to cleanly remove a cache tier during live operations.
>>>
>>> 2. What is the significant difference between "rados
>>> cache-flush-evict-all" and separate "cache-flush" and "cache-evict"
>>> cycles?
>>> Or is it some implementation error that leads to those "file not found"
>>> errors with "cache-flush-evict-all", while the manual cycles work
>>> successfully?
>>>
>>> Thank you for any insight you might be able to share.
>>>
>>> Regards,
>>> Jens
>>>
>>>
>> i've removed a cache tier in environments a few times. the only locked
>> files i ran into were the rbd_directory and rbd_header objects for each
>> volume. the rbd_headers for each rbd volume are locked as long as the vm
>> is
>> running. every time i've tried to remove a cache tier, i shutdown all of
>> the vms before starting the procedure and there wasn't any problem getting
>> things flushed+evicted. so i can't really give any further insight into
>> what might have happened other than it worked for me. i set the cache-mode
>> to forward everytime before flushing and evicting objects.
>>
>
> while your report doesn't confirm my suspicion expressed in my question 1,
> it at least is another example where removing the cache worked *after
> stopping all instances*, rather than "live". If, OTOH, this limitation is
> confirmed, it should be added to the docs.
>
> Out of curiosity: Do you have any other users for the pool? After stopping
> all VMs (and the image-related services on our Openstack control nodes), my
> pool was without access, so I saw no need to put the caching tier to
> "forward" mode.


the pools were exclusively for rbd and vms. due to other constraints of our
system, i had to shut everything down. i think that if you were to have the
pool in forward mode, shut down a single vm, flush the objects related to
its rbd volumes, then start it that it wont promote objects from those rbd
volumes again. you could, in theory, then power cycle the vms one at a time
and be able to remove the cache tier. thats probably not a practical case
though. the cold tier in our system didn't have the iops capacity to run
all of the vms without the tier so we just shut them all down.

i don't think there really is a significant technical difference between
>> the cache-flush-evict-all command and doing separate cache-flush and
>> cache-evict on individual objects. my understanding is
>> cache-flush-evict-all is just a short cut to getting everything in the
>> cache flushed and evicted. did the cache-flush-evict-all error on some
>> objects where the separate operations succeeded? you're description
>> doesn't
>> say if there was but then you say you used both styles during your second
>> attempt.
>>
>
> It was actually that every run of "cache-flush-evict-all" did report
> errors on all remaining objects, while running the loop manually (issue
> flush for every objects, then issue evict for every remaining object) did
> work flawlessly. That's why my question 2 came up.
>
> The objects I saw seemed related to the images stored in the pool, not any
> "management data" (like the suggested hitset persistence).


 hrm. i don't think i've seen that then. it has been several months since
i've done it though so i might be forgetting.

on a different note, you say that your cluster is on 12.2 but the cache
>> tiers were created on an earlier version. which version was the cache tier
>> created on? how well did the upgrade process work? i am curious since the
>> production clusters i have using a cache tier are still on 10.2 and i'm
>> about to begin the process of testing the upgrade to 12.2. any info on
>> that
>> experience you can share would be helpful.
>>
>
> I *believe* the cache was created on 10.2, but cannot recall for sure. I
> remember having had similar problems in those earlier days with a previous
> instance of that caching tier, but many root causes were "

Re: [ceph-users] Removing cache tier for RBD pool

2018-01-15 Thread Mike Lovell
On Mon, Jan 8, 2018 at 6:08 AM, Jens-U. Mozdzen  wrote:

> Hi *,
>
> trying to remove a caching tier from a pool used for RBD / Openstack, we
> followed the procedure from http://docs.ceph.com/docs/mast
> er/rados/operations/cache-tiering/#removing-a-writeback-cache and ran
> into problems.
>
> The cluster is currently running Ceph 12.2.2, the caching tier was created
> with an earlier release of Ceph.
>
> First of all, setting the cache-mode to "forward" is reported to be
> unsafe, which is not mentioned in the documentation - if it's really meant
> to be used in this case, the need for "--yes-i-really-mean-it" should be
> documented.
>
> Unfortunately, using "rados -p hot-storage cache-flush-evict-all" not only
> reported errors ("file not found") for many objects, but left us with quite
> a number of objects in the pool and new ones being created, despite the
> "forward" mode. Even after stopping all Openstack instances ("VMs"), we
> could also see that the remaining objects in the pool were still locked.
> Manually unlocking these via rados commands worked, but
> "cache-flush-evict-all" then still reported those "file not found" errors
> and 1070 objects remained in the pool, like before. We checked the
> remaining objects via "rados stat" both in the hot-storage and the
> cold-storage pool and could see that every hot-storage object had a
> counter-part in cold-storage with identical stat info. We also compared
> some of the objects (with size > 0) and found the hot-storage and
> cold-storage entities to be identical.
>
> We aborted that attempt, reverted the mode to "writeback" and restarted
> the Openstack cluster - everything was working fine again, of course still
> using the cache tier.
>
> During a recent maintenance window, the Openstack cluster was shut down
> again and we re-tried the procedure. As there were no active users of the
> images pool, we skipped the step of forcing the cache mode to forward and
> immediately issued the "cache-flush-evict-all" command. Again 1070 objects
> remained in the hot-storage pool (and gave "file not found" errors), but
> unlike last time, none were locked.
>
> Out of curiosity we then issued loops of "rados -p hot-storage cache-flush
> " and "rados -p hot-storage cache-evict " for all
> objects in the hot-storage pool and surprisingly not only received no error
> messages at all, but were left with an empty hot-storage pool! We then
> proceeded with the further steps from the docs and were able to
> successfully remove the cache tier.
>
> This leaves us with two questions:
>
> 1. Does setting the cache mode to "forward" lead to above situation of
> remaining locks on hot-storage pool objects? Maybe the clients' unlock
> requests are forwarded to the cold-storage pool, leaving the hot-storage
> objects locked? If so, this should be documented and it'd seem impossible
> to cleanly remove a cache tier during live operations.
>
> 2. What is the significant difference between "rados
> cache-flush-evict-all" and separate "cache-flush" and "cache-evict" cycles?
> Or is it some implementation error that leads to those "file not found"
> errors with "cache-flush-evict-all", while the manual cycles work
> successfully?
>
> Thank you for any insight you might be able to share.
>
> Regards,
> Jens
>

i've removed a cache tier in environments a few times. the only locked
files i ran into were the rbd_directory and rbd_header objects for each
volume. the rbd_headers for each rbd volume are locked as long as the vm is
running. every time i've tried to remove a cache tier, i shutdown all of
the vms before starting the procedure and there wasn't any problem getting
things flushed+evicted. so i can't really give any further insight into
what might have happened other than it worked for me. i set the cache-mode
to forward everytime before flushing and evicting objects.

i don't think there really is a significant technical difference between
the cache-flush-evict-all command and doing separate cache-flush and
cache-evict on individual objects. my understanding is
cache-flush-evict-all is just a short cut to getting everything in the
cache flushed and evicted. did the cache-flush-evict-all error on some
objects where the separate operations succeeded? you're description doesn't
say if there was but then you say you used both styles during your second
attempt.

there being objects left in the hot storage pool is something i've seen,
even after it looks like everything has been flushed. when i dug deeper, it
looked like all of the objects left in the pool were the hitset objects
that the cache tier uses for tracking how frequently objects are used.
those hitsets need to be persisted in case an osd restarts or the pg is
migrated to another osd. the method it uses for that is just storing the
hitset as another object but one that is internal to ceph. since they're
internal, the objects are hidden from some commands like "rados ls" but
still get counted as 

Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Mike Lovell
On Mon, Mar 20, 2017 at 4:20 PM, Nick Fisk <n...@fisk.me.uk> wrote:

> Just a few corrections, hope you don't mind
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Mike Lovell
> > Sent: 20 March 2017 20:30
> > To: Webert de Souza Lima <webert.b...@gmail.com>
> > Cc: ceph-users <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] cephfs cache tiering - hitset
> >
> > i'm not an expert but here is my understanding of it. a hit_set keeps
> track of
> > whether or not an object was accessed during the timespan of the hit_set.
> > for example, if you have a hit_set_period of 600, then the hit_set
> covers a
> > period of 10 minutes. the hit_set_count defines how many of the hit_sets
> to
> > keep a record of. setting this to a value of 12 with the 10 minute
> > hit_set_period would mean that there is a record of objects accessed
> over a
> > 2 hour period. the min_read_recency_for_promote, and its newer
> > min_write_recency_for_promote sibling, define how many of these hit_sets
> > and object must be in before and object is promoted from the storage pool
> > into the cache pool. if this were set to 6 with the previous examples,
> it means
> > that the cache tier will promote an object if that object has been
> accessed at
> > least once in 6 of the 12 10-minute periods. it doesn't matter how many
> > times the object was used in each period and so 6 requests in one
> 10-minute
> > hit_set will not cause a promotion. it would be any number of access in 6
> > separate 10-minute periods over the 2 hours.
>
> Sort of, the recency looks at the last N most recent hitsets. So if set to
> 6, then the object would have to be in all last 6 hitsets. Because of this,
> during testing I found setting recency above 2 or 3 made the behavior quite
> binary. If an object was hot enough, it would probably be in every hitset,
> if it was only warm it would never be in enough hitsets in row. I did
> experiment with X out of N promotion logic, ie must be in 3 hitsets out of
> 10 non sequential. If you could find the right number to configure, you
> could get improved cache behavior, but if not, then there was a large
> chance it would be worse.
>
> For promotion I think having more hitsets probably doesn't add much value,
> but they may help when it comes to determining what to flush.
>

that's good to know. i just made an assumption without actually digging in
to the code. do you recommend keeping the number of hitsets equal to the
max of either min_read_recency_for_promote and
min_write_recency_for_promote? how are the hitsets checked during flush
and/or eviction?

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Mike Lovell
i'm not an expert but here is my understanding of it. a hit_set keeps track
of whether or not an object was accessed during the timespan of the
hit_set. for example, if you have a hit_set_period of 600, then the hit_set
covers a period of 10 minutes. the hit_set_count defines how many of the
hit_sets to keep a record of. setting this to a value of 12 with the 10
minute hit_set_period would mean that there is a record of objects accessed
over a 2 hour period. the min_read_recency_for_promote, and its newer
min_write_recency_for_promote sibling, define how many of these hit_sets
and object must be in before and object is promoted from the storage pool
into the cache pool. if this were set to 6 with the previous examples, it
means that the cache tier will promote an object if that object has been
accessed at least once in 6 of the 12 10-minute periods. it doesn't matter
how many times the object was used in each period and so 6 requests in one
10-minute hit_set will not cause a promotion. it would be any number of
access in 6 separate 10-minute periods over the 2 hours.

this is just an example and might not fit well for your use case. the
systems i run have a lower hit_set_period, higher hit_set_count, and higher
recency options. that means that the osds use some more memory (each
hit_set takes space but i think they use the same amount of space
regardless of period) but hit_set covers a smaller amount of time. the
longer the period, the more likely a given object is in the hit_set.
without knowing your access patterns, it would be hard to recommend
settings. the overhead of a promotion can be substantial and so i'd
probably go with settings that only promote after many requests to an
object.

one thing to note is that the recency options only seemed to work for me in
jewel. i haven't tried infernalis. the older versions of hammer didn't seem
to use the min_read_recency_for_promote properly and 0.94.6 definitely had
a bug that could corrupt data when min_read_recency_for_promote was more
than 1. even though that was fixed in 0.94.7, i was hesitant to increase it
will still on hammer. min_write_recency_for_promote wasn't added till after
hammer.

hopefully that helps.
mike

On Fri, Mar 17, 2017 at 2:02 PM, Webert de Souza Lima  wrote:

> Hello everyone,
>
> I`m deploying a ceph cluster with cephfs and I`d like to tune ceph cache
> tiering, and I`m
> a little bit confused of the settings hit_set_count, hit_set_period and
> min_read_recency_for_promote. The docs are very lean and I can`f find any
> more detailed explanation anywhere.
>
> Could someone provide me a better understandment of this?
>
> Thanks in advance!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hammer to jewel upgrade experiences? cache tier experience?

2017-03-06 Thread Mike Lovell
has anyone on the list done an upgrade from hammer (something later than
0.94.6) to jewel with a cache tier configured? i tried doing one last week
and had a hiccup with it. i'm curious if others have been able to
successfully do the upgrade and, if so, did they take any extra steps
related to the cache tier?

there has also been some talk here about how rare or unsupported cache
tiering is. i've heard some say that its use is uncommon but i am not sure
if that is the cases. are there many on the list using cache tiering? in
particular, with rbd volumes and clients? what are some of the the
communities' experiences there?

thanks
mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osds crashing during hit_set_trim and hit_set_remove_all

2017-03-03 Thread Mike Lovell
i started an upgrade process to go from 0.94.7 to 10.2.5 on a production
cluster that is using cache tiering. this cluster has 3 monitors, 28
storage nodes, around 370 osds. the upgrade of the monitors completed
without issue. i then upgraded 2 of the storage nodes, and after the
restarts, the osds started crashing during hit_set_trim. here is some of
the output from the log.

2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function
'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
osd/ReplicatedPG.cc: 10514: FAILED assert(obc)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbddac5]
 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a0d1a]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 6: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
 7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 10: (()+0x7dc5) [0x7f8c1c209dc5]
 11: (clone()+0x6d) [0x7f8c1aceaced]

it started on just one osd and then spread to others until most of the osds
that are part of the cache tier were crashing. that was happening on both
the osds that were running jewel and on the ones running hammer. in the
process of trying to sort this out, the use_gmt_hitset option was set to
true and all of the osds were upgraded to hammer. we still have not been
able to determine a cause or a fix.

it looks like when hit_set_trim and hit_set_remove_all are being called,
they are calling hit_set_archive_object() to generate a name based on a
timestamp and then calling get_object_context() which then returns nothing
and triggers an assert.

i raised the debug_osd to 10/10 and then analyzed the logs after the crash.
i found the following in the ceph osd log afterwards.

2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043
pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923]
local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0
266042/266042/266042) [146,116,179] r=0 lpr=266042
 pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0
active+degraded NIBBLEWISE] get_object_context: no obc for soid
19:2ba0:.ceph-internal::hit_set_19.5d4_archive_2017-03-03
05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
6Z:head and !can_create
2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters --
start
2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters --
finish
2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify
from osd.255
2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters --
start
2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters --
finish
2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function
'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time
2017-03-03 03:10:31.918201
osd/ReplicatedPG.cc: 11494: FAILED assert(obc)

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f21acee9425]
 2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
 3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
 4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated
const&)+0xac) [0x7f21ac916adc]
 5: (boost::statechart::simple_state::react_impl(boost::statechart::event_base
const&, void const*)+0x179) [0x7f21a
c974909]
 6: (boost::statechart::simple_state,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xcd) [0x7f21ac977ccd]
 7: (boost::statechart::state_machine::send_event(boost::statechart::event_base
const&)+0x6b) [0x7f21ac95
d9cb]
 8: (PG::handle_peering_event(std::shared_ptr,
PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
 9: (OSD::process_peering_events(std::list >
const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
 10: (OSD::PeeringWQ::_process(std::list > const&,

Re: [ceph-users] Issue with upgrade from 0.94.9 to 10.2.5

2017-01-23 Thread Mike Lovell
i was just testing an upgrade of some monitors in a test cluster from
hammer (0.94.7) to jewel (10.2.5). after upgrade each of the first two
monitors, i stopped and restarted a single osd to cause changes in the
maps. the same error messages showed up in ceph -w. i haven't dug into it
much but just wanted to second that i've seen this happen on a recent
hammer to recent jewel upgrade.

mike

On Wed, Jan 18, 2017 at 4:25 AM, Piotr Dałek 
wrote:

> On 01/17/2017 12:52 PM, Piotr Dałek wrote:
>
>> During our testing we found out that during upgrade from 0.94.9 to 10.2.5
>> we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading
>> 0.94.6
>> -> 0.94.9 saturating mon node networking"). Apparently, there's a few
>> commits for both hammer and jewel which are supposed to fix this issue for
>> upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still
>> seeing this upgrading to Jewel, and symptoms are exactly same - after
>> upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors
>> after failing the CRC check. Anyone else encountered this?
>>
>
> http://tracker.ceph.com/issues/18582
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Mike Lovell
On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart  wrote:
> Hello all,
>
> I'm running into an issue with ceph osds crashing over the last 4
> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>
> A little setup information:
> 26 hosts
> 2x 400GB Intel DC P3700 SSDs
> 12x6TB spinning disks
> 4x4TB spinning disks.
>
> The SSDs are used for both journals and as an OSD (for the cephfs
> metadata pool).
>
> We were running Ceph with some success in this configuration
> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10
> months.
>
> Up through Friday, we were healthy.
>
> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
> then finally dying off, hitting their suicide timeout due to missing
> heartbeats. At the time, we were running Infernalis, getting ready to
> upgrade to Jewel.
>
> I spent the weekend and Monday, attempting to stabilize those OSDs,
> unfortunately failing. As part of the stabilzation attempts, I check
> iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
> and overall SMART health of the SSDs, everything looks normal. I
> checked to make sure the time was in sync between all hosts.
>
> I also tried to move the metadata pool to the spinning disks (to
> remove some dependence on the SSDs, just in case). The suicide timeout
> issues followed the pool migration. The spinning disks started timing
> out. This was at a time when *all* of client the IOPs to the ceph
> cluster were in the low 100's as reported to by ceph -s. I was
> restarting failed OSDs as fast as they were dying and I couldn't keep
> up. I checked the switches and NICs for errors and drops. No changes
> in the frequency of them. We're talking an error every 20-25 minutes.
> I would expect network issues to affect other OSDs (and pools) in the
> system, too.
>
> On Tuesday, I got together with my coworker, and we tried together to
> stabilize the cluster. We finally went into emergency maintenance
> mode, as we could not get the metadata pool healthy. We stopped the
> MDS, we tried again to let things stabilize, with no client IO to the
> pool. Again more suicide timeouts.
>
> Then, we rebooted the ceph nodes, figuring there *might* be something
> stuck in a hardware IO queue or cache somewhere. Again more crashes
> when the machines came back up.
>
> We figured at this point, there was nothing to lose by performing the
> update to Jewel, and, who knows, maybe we were hitting a bug that had
> been fixed. Reboots were involved again (kernel updates, too).
>
> More crashes.
>
> I finally decided, that there *could* be an unlikely chance that jumbo
> frames might suddenly be an issue (after years of using them with
> these switches). I turned down the MTUs on the ceph nodes to the
> standard 1500.
>
> More crashes.
>
> We decided to try and let things settle out overnight, with no IO.
> That brings us to today:
>
> We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
> crashed due to the suicide timeout. I've tried starting them one at a
> time, they're still dying off with suicide timeouts.
>
> I've gathered the logs I could think to:
> A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
> CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
> OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
> Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt
>
> At the moment, we're dead in the water. I would appreciate any
> pointers to getting this fixed.
>
> --
> Adam Tygart
> Beocat Sysadmin

i spent some time trying to figure out what is happening from your
osd.16 log but i've run out of time i can spend right now on this.
here is what i think is happening.

if you grep for heartbeat_map in the log, you see that there are
heartbeat timeouts starting at failures for 'OSD::recovery_tp thread
0x7f34c5e41700' which ultimately leads to the osd committing suicide
after a period of time. specifically, the
osd_recovery_thread_suicide_timeout which defaults to 300 seconds.
looking for the thread id of 7f34c5e41700 in the log shows attempts to
do recovery on the PG 32.10c. this is the last log line that seems to
indicate any kind of progress on that recovery.

2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561]
local-les=493771 n=3917 ec=44014 les/c/f 493771/486667/0
497332/497662/497662) [214,143,448]/[16] r=0 lpr=497662
pi=483321-497661/190 rops=1 bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
undersized+degraded+remapped+backfilling+peered] send_push_op
32:30966cd6:::100042c76a0.:head v 250315'1040233 size 0
recovery_info: 
ObjectRecoveryInfo(32:30966cd6:::100042c76a0.:head@250315'1040233,
size: 0, copy_subset: [], clone_subset: {})

the other osds in this PG are 143, 214, and 448. looking for osd.143
in the log shows that operations for another PG seem to be completing
until around the 09:26:55.537902 mark where osd.16 says that it and
osd.143 have 

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 9:34 AM, Mike Lovell <mike.lov...@endurance.com>
wrote:

> On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov <
> asheplya...@mirantis.com> wrote:
>
>> Hi,
>>
>> > i also wonder if just taking 148 out of the cluster (probably just
>> marking it out) would help
>>
>> As far as I understand this can only harm your data. The acting set of PG
>> 17.73 is  [41, 148],
>> so after stopping/taking out OSD 148  OSD 41 will store the only copy of
>> objects in PG 17.73
>> (so it won't accept writes any more).
>>
>> > since there are other osds in the up set (140 and 5)
>>
>> These OSDs are not in the acting set, they have no (at least some of the)
>> objects from PG 17.73,
>> and are copying the missing objects from OSDs 41 and 148. Naturally this
>> slows down or even
>> blocks writes to PG 17.73.
>>
>
> k. i didn't know if it could just use the members of the up set that are
> not in the acting set for completing writes. when thinking through it in my
> head it seemed reasonable but i could also see pitfalls with doing it.
> thats why i was asking if it was possible.
>
>
> > the only thing holding things together right now is a while loop doing
>> an 'ceph osd down 41' every minute
>>
>> As far as I understand this disturbs the backfilling and further delays
>> writes to that poor PG.
>>
>
> it definitely does seem to have an impact similar to that. the only upside
> is that it clears the slow io messages though i don't know if it actually
> lets the client io complete. recovery doesn't make any progress though in
> between the down commands. its not making any progress on its own anyways.
>

i went to check things this morning and noticed that the number of objects
misplaced had dropped from what i was expecting and was occasionally seeing
lines from ceph -w saying a number of objects were recovering. the only PG
in a state other than active+clean was the one that 41 and 148 were
bickering about so it looks like they were now passing traffic. it appeared
to start just after on of the osd down events that was happening in the
loop i had running. a little while after the backfill started making
progress, it completed. so its fine now. i would still like to try and find
out the cause since this has happened twice now. but at least its not an
emergency for me at the moment.

one other thing that was odd was that i saw the misplaced objects go
negative during the backfill. this is one of the lines from ceph -w.

2016-04-29 10:38:15.011241 mon.0 [INF] pgmap v27055697: 6144 pgs: 6143
active+clean, 1 active+undersized+degraded+remapped+backfilling; 123 TB
data, 372 TB used, 304 TB / 691 TB avail; 130 MB/s rd, 135 MB/s wr, 11210
op/s; 14547/93845634 objects degraded (0.016%); -13959/93845634 objects
misplaced (-0.015%); 27358 kB/s, 7 objects/s recovering

it seemed to complete around the point where it got to -14.5k misplaced.
i'm guessing this is just a reporting error but i immediately started a
deep-scrub on the pg just to make sure things are consistent.

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling caused RBD corruption on Hammer?

2016-04-29 Thread Mike Lovell
are the new osds running 0.94.5 or did they get the latest .6 packages? are
you also using cache tiering? we ran in to a problem with individual rbd
objects getting corrupted when using 0.94.6 with a cache tier
and min_read_recency_for_promote was > 1. our only solution to corruption
that happened was to restore from backup.
setting min_read_recency_for_promote to 1 or making sure the osds were
running .5 were sufficient to prevent it from happening though we currently
do both.

mike

On Fri, Apr 29, 2016 at 9:41 AM, Robert Sander  wrote:

> Hi,
>
> yesterday we ran into a strange bug / mysterious issue with a Hammer
> 0.94.5 storage cluster.
>
> We added OSDs and the cluster started the backfilling. Suddenly one of
> the running VMs complained that it lost a partition in a 2TB RBD.
>
> After resetting the VM it could not boot any more as the RBD has no
> partition info at the start. :(
>
> It looks like the data in the objects has been changed somehow.
>
> How is that possible? Any ideas?
>
> The VM was restored from a backup but we would still like to know how
> this happened and maybe restore some data that was not backed up before
> the crash.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov  wrote:

> Hi,
>
> > i also wonder if just taking 148 out of the cluster (probably just
> marking it out) would help
>
> As far as I understand this can only harm your data. The acting set of PG
> 17.73 is  [41, 148],
> so after stopping/taking out OSD 148  OSD 41 will store the only copy of
> objects in PG 17.73
> (so it won't accept writes any more).
>
> > since there are other osds in the up set (140 and 5)
>
> These OSDs are not in the acting set, they have no (at least some of the)
> objects from PG 17.73,
> and are copying the missing objects from OSDs 41 and 148. Naturally this
> slows down or even
> blocks writes to PG 17.73.
>

k. i didn't know if it could just use the members of the up set that are
not in the acting set for completing writes. when thinking through it in my
head it seemed reasonable but i could also see pitfalls with doing it.
thats why i was asking if it was possible.


> the only thing holding things together right now is a while loop doing an
> 'ceph osd down 41' every minute
>
> As far as I understand this disturbs the backfilling and further delays
> writes to that poor PG.
>

it definitely does seem to have an impact similar to that. the only upside
is that it clears the slow io messages though i don't know if it actually
lets the client io complete. recovery doesn't make any progress though in
between the down commands. its not making any progress on its own anyways.

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
i attempted to grab some logs from the two osds in questions with debug_ms
and debug_osd at 20. i have looked through them a little bit but digging
through the logs at this verbosity is something i don't have much
experience with. hopefully someone on the list can help make sense of it.
the logs are at these urls.

http://stuff.dev-zero.net/ceph-osd.148.debug.log.gz
http://stuff.dev-zero.net/ceph-osd.41.debug.log.gz
http://stuff.dev-zero.net/ceph.mon.log.gz

it last one is a trimmed portion of the ceph.log from one of the monitors
for the time frame the osd logs cover. to make these, i moved the existing
log file, set the increased verbosity, had the osds reopen their log files,
gave it a few minutes, moved the log files again, and had the osds reopen
their logs a second time. this resulted in something that is hopefully just
enough context to see whats going on.

i did a 'ceph osd down 41' at about the 20:40:06 mark and the cluster seems
to report normal data for the next 30 seconds. after that, the slow io
messages from both osds about ops from each other start appearing. i tried
tracing a few ops in both logs but couldn't make sense of it. can anyone
help me with taking a look and/or pointers about how to understand what's
going on?

oh. this is 0.94.5. the basic cluster layout is two racks with 9 nodes in
each rack with either 12 or 14 osds per node. ssd cache tiering is being
used. the pools are just replicated ones with a size of 3. here is the data
from pg dump for the pg that isn't making progress on recovery, which i'm
guessing is a result of the same problem. the workload is a bunch of vms
with rbd.

pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
17.73 14545 0 14545 14547 0 62650182662 10023 10023
active+undersized+degraded+remapped+backfilling 2016-04-29 01:59:42.148644
161768'26740604 161768:37459478 [140,5,41] 140 [41,148] 41
55547'11246156 2015-09-15
08:53:32.724322 53282'7470580 2015-09-01 07:19:45.054261

i also wonder if just taking 148 out of the cluster (probably just marking
it out) would help. the min size is 2 but, since there are other osds in
the up set (140 and 5), will the cluster keep working? or will it block
until the PG has finished with recovery to the new osds?

thanks in advance. hopefully someone can help soon because right now the
only thing holding things together right now is a while loop doing an 'ceph
osd down 41' every minute. :(

mike

On Thu, Apr 28, 2016 at 5:49 PM, Samuel Just <sj...@redhat.com> wrote:

> I'd guess that to make any progress we'll need debug ms = 20 on both
> sides of the connection when a message is lost.
> -Sam
>
> On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell <mike.lov...@endurance.com>
> wrote:
> > there was a problem on one of the clusters i manage a couple weeks ago
> where
> > pairs of OSDs would wait indefinitely on subops from the other OSD in the
> > pair. we used a liberal dose of "ceph osd down ##" on the osds and
> > eventually things just sorted them out a couple days later.
> >
> > it seems to have come back today and co-workers and i are stuck on
> trying to
> > figure out why this is happening. here are the details that i know.
> > currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> > resulting in lines such as the following in ceph.log.
> >
> > 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> > [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> > 13:28:56.233001: osd_op(client.11172360.0:516946146
> > rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size
> 4194304
> > write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> > waiting for subops from 5,140,148
> >
> > 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 :
> cluster
> > [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> > 13:28:57.070471: osd_op(client.24127500.0:2960618
> > rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size
> 8388608
> > write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> > waiting for subops from 41,115
> >
> > from digging in the logs, it appears like some messages are being lost
> > between the OSDs. this is what osd.41 sees:
> > -
> > 2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
> > client.11172360 10.217.72.41:0/6031968 6 
> > osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0d08
> > [

[ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Mike Lovell
there was a problem on one of the clusters i manage a couple weeks ago
where pairs of OSDs would wait indefinitely on subops from the other OSD in
the pair. we used a liberal dose of "ceph osd down ##" on the osds and
eventually things just sorted them out a couple days later.

it seems to have come back today and co-workers and i are stuck on trying
to figure out why this is happening. here are the details that i know.
currently 2 OSDs, 41 and 148, keep waiting on subops from each other
resulting in lines such as the following in ceph.log.

2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
[WRN] slow request 30.642736 seconds old, received at 2016-04-28
13:28:56.233001: osd_op(client.11172360.0:516946146
rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size 4194304
write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
waiting for subops from 5,140,148

2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
[WRN] slow request 30.960922 seconds old, received at 2016-04-28
13:28:57.070471: osd_op(client.24127500.0:2960618
rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size 8388608
write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
waiting for subops from 41,115

from digging in the logs, it appears like some messages are being lost
between the OSDs. this is what osd.41 sees:
-
2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
client.11172360 10.217.72.41:0/6031968 6 
osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
[set-alloc-hint object_size 4194304 write_size 4194304,write
1835008~143360] 17.3df49873 RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) v5 
236+0+143360 (781016428 0 3953649960) 0x1d551c00 con 0x1a78d9c0
2016-04-28 13:28:56.233983 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146
17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+46 0x1d6db200 con 0x21add440
2016-04-28 13:28:56.234017 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+46 0x1d6dd000 con 0x21ada000
2016-04-28 13:28:56.234046 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
2016-04-28 13:28:56.243555 7f3b35976700  1 -- 10.217.89.22:6825/313003769
<== osd.140 10.217.89.11:6810/4543 23 
osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
2016-04-28 13:28:56.257816 7f3b27d9b700  1 -- 10.217.89.22:6825/313003769
<== osd.5 10.217.89.18:6806/1010441 35 
osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440


this, however is what osd.148 sees:
-
[ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
2016-04-28 13:29:33.470156 7f195fcfc700  1 -- 10.217.72.11:6820/6487 <==
client.11172360 10.217.72.41:0/6031968 460 
osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
[set-alloc-hint object_size 4194304 write_size 4194304,write
1835008~143360] 17.3df49873 RETRY=2
ack+ondisk+retry+write+redirected+known_if_redirected e159002) v5 
236+0+143360 (129493315 0 3953649960) 0x1edf2300 con 0x24dc0d60

also, due to the ceph osd down commands, there is recovery that needs to
happen for a PG shared between these OSDs that is never making any
progress. its probably due to whatever is cause the repops to fail.

i did some tcpdump on both sides limiting things to the ip addresses and
ports being used by these two OSDs and see packets flowing between the two
osds. i attempted to have wireshark decode the actual ceph traffic but it
was only able to get bits and pieces of the ceph protocol bits but at least
for the moment i'm blaming that on the ceph dissector for wireshark. there
aren't any dropped or error packets on any of the network interfaces
involved.

does anyone have any ideas of where to look next or other tips for this?
we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned.
putting them at 20 probably isn't going to be helpful so anyone have a
suggestion on another level to put it at that might be useful? go figure
that this would happen while i'm at the openstack summit and it would keep
me from paying attention to some interesting presentations.

thanks in advance for any help.

mike

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Mike Lovell
just got done with a test against a build of 0.94.6 minus the two commits
that were backported in PR 7207. everything worked as it should with the
cache-mode set to writeback and the min_read_recency_for_promote set to 2.
assuming it works properly on master, there must be a commit that we're
missing on the backport to support this properly.

sage,
i'm adding you to the recipients on this so hopefully you see it. the tl;dr
version is that the backport of the cache recency fix to hammer doesn't
work right and potentially corrupts data when
the min_read_recency_for_promote is set to greater than 1.

mike

On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell <mike.lov...@endurance.com>
wrote:

> robert and i have done some further investigation the past couple days on
> this. we have a test environment with a hard drive tier and an ssd tier as
> a cache. several vms were created with volumes from the ceph cluster. i did
> a test in each guest where i un-tarred the linux kernel source multiple
> times and then did a md5sum check against all of the files in the resulting
> source tree. i started off with the monitors and osds running 0.94.5 and
> never saw any problems.
>
> a single node was then upgraded to 0.94.6 which has osds in both the ssd
> and hard drive tier. i then proceeded to run the same test and, while the
> untar and md5sum operations were running, i changed the ssd tier cache-mode
> from forward to writeback. almost immediately the vms started reporting io
> errors and odd data corruption. the remainder of the cluster was updated to
> 0.94.6, including the monitors, and the same thing happened.
>
> things were cleaned up and reset and then a test was run
> where min_read_recency_for_promote for the ssd cache pool was set to 1. we
> previously had it set to 6. there was never an error with the recency
> setting set to 1. i then tested with it set to 2 and it immediately caused
> failures. we are currently thinking that it is related to the backport of
> the fix for the recency promotion and are in progress of making a .6 build
> without that backport to see if we can cause corruption. is anyone using a
> version from after the original recency fix (PR 6702) with a cache tier in
> writeback mode? anyone have a similar problem?
>
> mike
>
> On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell <mike.lov...@endurance.com>
> wrote:
>
>> something weird happened on one of the ceph clusters that i administer
>> tonight which resulted in virtual machines using rbd volumes seeing
>> corruption in multiple forms.
>>
>> when everything was fine earlier in the day, the cluster was a number of
>> storage nodes spread across 3 different roots in the crush map. the first
>> bunch of storage nodes have both hard drives and ssds in them with the hard
>> drives in one root and the ssds in another. there is a pool for each and
>> the pool for the ssds is a cache tier for the hard drives. the last set of
>> storage nodes were in a separate root with their own pool that is being
>> used for burn in testing.
>>
>> these nodes had run for a while with test traffic and we decided to move
>> them to the main root and pools. the main cluster is running 0.94.5 and the
>> new nodes got 0.94.6 due to them getting configured after that was
>> released. i removed the test pool and did a ceph osd crush move to move the
>> first node into the main cluster, the hard drives into the root for that
>> tier of storage and the ssds into the root and pool for the cache tier.
>> each set was done about 45 minutes apart and they ran for a couple hours
>> while performing backfill without any issue other than high load on the
>> cluster.
>>
>> we normally run the ssd tier in the forward cache-mode due to the ssds we
>> have not being able to keep up with the io of writeback. this results in io
>> on the hard drives slowing going up and performance of the cluster starting
>> to suffer. about once a week, i change the cache-mode between writeback and
>> forward for short periods of time to promote actively used data to the
>> cache tier. this moves io load from the hard drive tier to the ssd tier and
>> has been done multiple times without issue. i normally don't do this while
>> there are backfills or recoveries happening on the cluster but decided to
>> go ahead while backfill was happening due to the high load.
>>
>> i tried this procedure to change the ssd cache-tier between writeback and
>> forward cache-mode and things seemed okay from the ceph cluster. about 10
>> minutes after the first attempt a changing the mode, vms using the ceph
>> cluster for their storage started seeing corruption in multiple forms. the
>

Re: [ceph-users] data corruption with hammer

2016-03-15 Thread Mike Lovell
there are not any monitors running on the new nodes. the monitors are on
separate nodes and running the 0.94.5 release. i spent some time thinking
about this last night as well and my thoughts went to the recency patches.
i wouldn't think that caused this but its the only thing that seems close.

mike

On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>
> > something weird happened on one of the ceph clusters that i administer
> > tonight which resulted in virtual machines using rbd volumes seeing
> > corruption in multiple forms.
> >
> > when everything was fine earlier in the day, the cluster was a number of
> > storage nodes spread across 3 different roots in the crush map. the first
> > bunch of storage nodes have both hard drives and ssds in them with the
> > hard drives in one root and the ssds in another. there is a pool for
> > each and the pool for the ssds is a cache tier for the hard drives. the
> > last set of storage nodes were in a separate root with their own pool
> > that is being used for burn in testing.
> >
> > these nodes had run for a while with test traffic and we decided to move
> > them to the main root and pools. the main cluster is running 0.94.5 and
> > the new nodes got 0.94.6 due to them getting configured after that was
> > released. i removed the test pool and did a ceph osd crush move to move
> > the first node into the main cluster, the hard drives into the root for
> > that tier of storage and the ssds into the root and pool for the cache
> > tier. each set was done about 45 minutes apart and they ran for a couple
> > hours while performing backfill without any issue other than high load
> > on the cluster.
> >
> Since I glanced what your setup looks like from Robert's posts and yours I
> won't say the obvious thing, as you aren't using EC pools.
>
> > we normally run the ssd tier in the forward cache-mode due to the ssds we
> > have not being able to keep up with the io of writeback. this results in
> > io on the hard drives slowing going up and performance of the cluster
> > starting to suffer. about once a week, i change the cache-mode between
> > writeback and forward for short periods of time to promote actively used
> > data to the cache tier. this moves io load from the hard drive tier to
> > the ssd tier and has been done multiple times without issue. i normally
> > don't do this while there are backfills or recoveries happening on the
> > cluster but decided to go ahead while backfill was happening due to the
> > high load.
> >
> As you might recall, I managed to have "rados bench" break (I/O error) when
> doing these switches with Firefly on my crappy test cluster, but not with
> Hammer.
> However I haven't done any such switches on my production cluster with a
> cache tier, both because the cache pool hasn't even reached 50% capacity
> after 2 weeks of pounding and because I'm sure that everything will hold
> up when it comes to the first flushing.
>
> Maybe the extreme load (as opposed to normal VM ops) of your cluster
> during the backfilling triggered the same or a similar bug.
>
> > i tried this procedure to change the ssd cache-tier between writeback and
> > forward cache-mode and things seemed okay from the ceph cluster. about 10
> > minutes after the first attempt a changing the mode, vms using the ceph
> > cluster for their storage started seeing corruption in multiple forms.
> > the mode was flipped back and forth multiple times in that time frame
> > and its unknown if the corruption was noticed with the first change or
> > subsequent changes. the vms were having issues of filesystems having
> > errors and getting remounted RO and mysql databases seeing corruption
> > (both myisam and innodb). some of this was recoverable but on some
> > filesystems there was corruption that lead to things like lots of data
> > ending up in the lost+found and some of the databases were
> > un-recoverable (backups are helping there).
> >
> > i'm not sure what would have happened to cause this corruption. the
> > libvirt logs for the qemu processes for the vms did not provide any
> > output of problems from the ceph client code. it doesn't look like any
> > of the qemu processes had crashed. also, it has now been several hours
> > since this happened with no additional corruption noticed by the vms. it
> > doesn't appear that we had any corruption happen before i attempted the
> > flipping of the ssd tier cache-mode.
> >
> > the only think i can think of that is different b

[ceph-users] data corruption with hammer

2016-03-14 Thread Mike Lovell
something weird happened on one of the ceph clusters that i administer
tonight which resulted in virtual machines using rbd volumes seeing
corruption in multiple forms.

when everything was fine earlier in the day, the cluster was a number of
storage nodes spread across 3 different roots in the crush map. the first
bunch of storage nodes have both hard drives and ssds in them with the hard
drives in one root and the ssds in another. there is a pool for each and
the pool for the ssds is a cache tier for the hard drives. the last set of
storage nodes were in a separate root with their own pool that is being
used for burn in testing.

these nodes had run for a while with test traffic and we decided to move
them to the main root and pools. the main cluster is running 0.94.5 and the
new nodes got 0.94.6 due to them getting configured after that was
released. i removed the test pool and did a ceph osd crush move to move the
first node into the main cluster, the hard drives into the root for that
tier of storage and the ssds into the root and pool for the cache tier.
each set was done about 45 minutes apart and they ran for a couple hours
while performing backfill without any issue other than high load on the
cluster.

we normally run the ssd tier in the forward cache-mode due to the ssds we
have not being able to keep up with the io of writeback. this results in io
on the hard drives slowing going up and performance of the cluster starting
to suffer. about once a week, i change the cache-mode between writeback and
forward for short periods of time to promote actively used data to the
cache tier. this moves io load from the hard drive tier to the ssd tier and
has been done multiple times without issue. i normally don't do this while
there are backfills or recoveries happening on the cluster but decided to
go ahead while backfill was happening due to the high load.

i tried this procedure to change the ssd cache-tier between writeback and
forward cache-mode and things seemed okay from the ceph cluster. about 10
minutes after the first attempt a changing the mode, vms using the ceph
cluster for their storage started seeing corruption in multiple forms. the
mode was flipped back and forth multiple times in that time frame and its
unknown if the corruption was noticed with the first change or subsequent
changes. the vms were having issues of filesystems having errors and
getting remounted RO and mysql databases seeing corruption (both myisam and
innodb). some of this was recoverable but on some filesystems there was
corruption that lead to things like lots of data ending up in the
lost+found and some of the databases were un-recoverable (backups are
helping there).

i'm not sure what would have happened to cause this corruption. the libvirt
logs for the qemu processes for the vms did not provide any output of
problems from the ceph client code. it doesn't look like any of the qemu
processes had crashed. also, it has now been several hours since this
happened with no additional corruption noticed by the vms. it doesn't
appear that we had any corruption happen before i attempted the flipping of
the ssd tier cache-mode.

the only think i can think of that is different between this time doing
this procedure vs previous attempts was that there was the one storage node
running 0.94.6 where the remainder were running 0.94.5. is is possible that
something changed between these two releases that would have caused
problems with data consistency related to the cache tier? or otherwise? any
other thoughts or suggestions?

thanks in advance for any help you can provide.

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds crashing on Thread::create

2016-03-07 Thread Mike Lovell
i just checked several of the osds running in the environment and the hard
and soft limits for the number of processes is set to 257486. if its
exceeding that, than it seems like there would still be a bug somewhere. i
can't imagine it needing that many.

$ for N in `pidof ceph-osd`; do echo ${N}; sudo grep processes
/proc/${N}/limits; done
8761
Max processes 257486   257486
processes
7744
Max processes 257486   257486
processes
5536
Max processes 257486   257486
processes
4717
Max processes 257486   257486
processes

i did go looking through the ceph init script and didn't find where that
was getting set and no reference to setrlimit in the code. so i'm not sure
how that gets set.

this did lead me into looking at how many threads were getting created per
process and how many there were total on the system. it looks like there
are a total of just over 30k total tasks (pids and threads) on the systems.
i just set kernel.pid_max to 64k and will keep an eye on it. it would make
sense that this is the problem. i'm a little surprised to see it get this
close with only 12 osds running. it looks like they're creating over 2500
threads each. i don't know the internals of the code but that seems like a
lot. oh well. hopefully this fixes it.

mike

On Mon, Mar 7, 2016 at 1:55 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Mon, Mar 7, 2016 at 11:04 AM, Mike Lovell <mike.lov...@endurance.com>
> wrote:
> > first off, hello all. this is my first time posting to the list.
> >
> > i have seen a recurring problem that has starting in the past week or so
> on
> > one of my ceph clusters. osds will crash and it seems to happen whenever
> > backfill or recovery is started. looking at the logs it appears that the
> the
> > osd is asserting in src/common/Thread.cc when it tries to create a new
> > thread. these osds are running 0.94.5 and i believe
> > https://github.com/ceph/ceph/blob/v0.94.5/src/common/Thread.cc#L129 is
> the
> > assert that is being hit. i looked back through the code for a couple
> > minutes and it looks like its asserting on pthread_create returning
> > something besides 0. i'm not sure why pthread_create would be failing
> and it
> > looks like it just writes what the return code is to stderr. i also
> wasn't
> > able to determine where the output of stderr ended up from my osds. it
> looks
> > like from looking at /proc//fd/{0,1,2} and lsof that stderr is a
> unix
> > socket but i don't see where it goes after that. the osds are started by
> > ceph-disk activate.
> >
> > do any of you have any ideas as to what might be causing this? or how i
> > might further troubleshoot this? i'm attaching a trimmed version of the
> osd
> > log. i removed some extraneous bits from after the osds was restarted
> and a
> > large amount of 'recent events' that were from well before the crash.
>
> Usually you just need to increase the ulimits for thread/process
> counts, on the ceph user account or on the system as a whole. Check
> the docs and the startup scripts.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osds crashing on Thread::create

2016-03-07 Thread Mike Lovell
first off, hello all. this is my first time posting to the list.

i have seen a recurring problem that has starting in the past week or so on
one of my ceph clusters. osds will crash and it seems to happen whenever
backfill or recovery is started. looking at the logs it appears that the
the osd is asserting in src/common/Thread.cc when it tries to create a new
thread. these osds are running 0.94.5 and i believe
https://github.com/ceph/ceph/blob/v0.94.5/src/common/Thread.cc#L129 is the
assert that is being hit. i looked back through the code for a couple
minutes and it looks like its asserting on pthread_create returning
something besides 0. i'm not sure why pthread_create would be failing and
it looks like it just writes what the return code is to stderr. i also
wasn't able to determine where the output of stderr ended up from my osds.
it looks like from looking at /proc//fd/{0,1,2} and lsof that stderr
is a unix socket but i don't see where it goes after that. the osds are
started by ceph-disk activate.

do any of you have any ideas as to what might be causing this? or how i
might further troubleshoot this? i'm attaching a trimmed version of the osd
log. i removed some extraneous bits from after the osds was restarted and a
large amount of 'recent events' that were from well before the crash.

thanks

mike
2016-03-07 10:51:08.739907 7fb0c56a1700  0 -- 10.208.16.26:6802/7034 >> 
10.208.16.42:0/3019478 pipe(0x47aab000 sd=1360 :6802 s=0 pgs=0 cs=0 l=1 
c=0x1c96f080).accept replacing existing (lossy) channel (new one lossy=1)
2016-03-07 12:09:27.465221 7fb0d5a7f700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.26:6800/2245 pipe(0x2b1ce000 sd=1228 :6802 s=0 pgs=0 cs=0 l=0 
c=0x55dd020).accept connect_seq 10 vs existing 10 state standby
2016-03-07 12:09:27.465290 7fb0c9f05700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.26:6807/1020674 pipe(0x3bce sd=1259 :6802 s=0 pgs=0 cs=0 l=0 
c=0x55da2c0).accept connect_seq 13 vs existing 13 state standby
2016-03-07 12:09:27.466053 7fb0c9f05700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.26:6807/1020674 pipe(0x3bce sd=1259 :6802 s=0 pgs=0 cs=0 l=0 
c=0x55da2c0).accept connect_seq 14 vs existing 13 state standby
2016-03-07 12:09:27.466106 7fb0d5a7f700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.26:6800/2245 pipe(0x2b1ce000 sd=1228 :6802 s=0 pgs=0 cs=0 l=0 
c=0x55dd020).accept connect_seq 11 vs existing 10 state standby
2016-03-07 12:09:28.358657 7fb135fd8700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.24:6800/26784 pipe(0x534c6000 sd=1311 :56029 s=2 pgs=35 cs=1 l=0 
c=0x359f75a0).fault with nothing to send, going to standby
2016-03-07 12:09:28.359955 7fb0b0e36700  0 -- 10.208.12.26:0/7034 >> 
10.208.12.24:6801/26784 pipe(0x3b466000 sd=1103 :0 s=1 pgs=0 cs=0 l=1 
c=0x535351e0).fault
2016-03-07 12:09:28.360759 7fb0b0d35700  0 -- 10.208.12.26:0/7034 >> 
10.208.16.24:6801/26784 pipe(0x5ddc2000 sd=1157 :0 s=1 pgs=0 cs=0 l=1 
c=0x53535340).fault
2016-03-07 12:09:28.469563 7fb160395700  0 log_channel(cluster) log [INF] : 
13.671 restarting backfill on osd.166 from (116308'3515965,116359'3519022] MAX 
to 117741'3716137
2016-03-07 12:09:28.469613 7fb15fb94700  0 log_channel(cluster) log [INF] : 
13.5e3 restarting backfill on osd.166 from (116308'3293585,116359'329] MAX 
to 117741'3476172
2016-03-07 12:09:28.478230 7fb160395700  0 log_channel(cluster) log [INF] : 
13.42d restarting backfill on osd.166 from (116308'5692257,116359'5695353] MAX 
to 117741'5986232
2016-03-07 12:09:28.479461 7fb15fb94700  0 log_channel(cluster) log [INF] : 
13.6da restarting backfill on osd.166 from (116308'3186858,116359'3189912] MAX 
to 117741'3327862
2016-03-07 12:09:28.493689 7fb15fb94700  0 log_channel(cluster) log [INF] : 
13.6da restarting backfill on osd.190 from (0'0,0'0] MAX to 117741'3327862
2016-03-07 12:09:28.508933 7fb160395700  0 log_channel(cluster) log [INF] : 
13.791 restarting backfill on osd.166 from (116308'4423278,116359'4426295] MAX 
to 117741'4593697
2016-03-07 12:09:28.603153 7fb1338b1700  0 -- 10.208.12.26:6802/7034 >> 
10.208.12.24:6800/26784 pipe(0x534c6000 sd=1157 :56029 s=1 pgs=35 cs=1 l=0 
c=0x359f75a0).fault
2016-03-07 12:09:29.482123 7fb15fb94700  0 log_channel(cluster) log [INF] : 
13.650 restarting backfill on osd.166 from (116308'2664728,116359'2667752] MAX 
to 117741'2770454
2016-03-07 12:09:45.637558 7fb170d22700 -1 osd.175 117745 heartbeat_check: no 
reply from osd.141 since back 2016-03-07 12:09:25.523264 front 2016-03-07 
12:09:25.523264 (cutoff 2016-03-07 12:09:25.637555)
2016-03-07 12:09:46.638025 7fb170d22700 -1 osd.175 117745 heartbeat_check: no 
reply from osd.141 since back 2016-03-07 12:09:25.523264 front 2016-03-07 
12:09:25.523264 (cutoff 2016-03-07 12:09:26.638022)
2016-03-07 12:09:47.638159 7fb170d22700 -1 osd.175 117745 heartbeat_check: no 
reply from osd.141 since back 2016-03-07 12:09:25.523264 front 2016-03-07 
12:09:25.523264 (cutoff 2016-03-07 12:09:27.638154)
2016-03-07 12:09:48.231986 7fb158b86700 -1 osd.175 117745 heartbeat_check: no 
reply