I've a basic related question re: Firefly operation - would appreciate any
insights:
With three replicas, if checksum inconsistencies across replicas are found
during deep-scrub then:
a. does the majority win or is the primary always the winner and used
to overwrite the secondaries
b. is this reconciliation done automatically during deep-scrub
or does each reconciliation have to be executed manually by the administrator?
With 2 replicas - how are things different (if at all):
a. The primary is declared the winner - correct?
b. is this reconciliation done automatically during deep-scrub
or does it have to be done "manually" because there is no majority?
Thanks,
-Sudip
-----Original Message-----
From: ceph-users [mailto:[email protected]] On Behalf Of Samuel
Just
Sent: Thursday, July 10, 2014 10:16 AM
To: Christian Eichelmann
Cc: [email protected]
Subject: Re: [ceph-users] scrub error on firefly
Can you attach your ceph.conf for your osds?
-Sam
On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann
<[email protected]> wrote:
> I can also confirm that after upgrading to firefly both of our
> clusters (test and live) were going from 0 scrub errors each for about
> 6 Month to about 9-12 per week...
> This also makes me kind of nervous, since as far as I know everything
> "ceph pg repair" does, is to copy the primary object to all replicas,
> no matter which object is the correct one.
> Of course the described method of manual checking works (for pools
> with more than 2 replicas), but doing this in a large cluster nearly
> every week is horribly timeconsuming and error prone.
> It would be great to get an explanation for the increased numbers of
> scrub errors since firefly. Were they just not detected correctly in
> previous versions? Or is there maybe something wrong with the new code?
>
> Acutally, our company is currently preventing our projects to move to
> ceph because of this problem.
>
> Regards,
> Christian
> ________________________________
> Von: ceph-users [[email protected]]" im Auftrag von
> "Travis Rhoden [[email protected]]
> Gesendet: Donnerstag, 10. Juli 2014 16:24
> An: Gregory Farnum
> Cc: [email protected]
> Betreff: Re: [ceph-users] scrub error on firefly
>
> And actually just to follow-up, it does seem like there are some
> additional smarts beyond just using the primary to overwrite the
> secondaries... Since I captured md5 sums before and after the repair,
> I can say that in this particular instance, the secondary copy was used to
> overwrite the primary.
> So, I'm just trusting Ceph to the right thing, and so far it seems to,
> but the comments here about needing to determine the correct object
> and place it on the primary PG make me wonder if I've been missing something.
>
> - Travis
>
>
> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden <[email protected]> wrote:
>>
>> I can also say that after a recent upgrade to Firefly, I have
>> experienced massive uptick in scrub errors. The cluster was on
>> cuttlefish for about a year, and had maybe one or two scrub errors.
>> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the
>> last month or so (was getting 2-3 a day for a few weeks until the whole
>> cluster was rescrubbed, it seemed).
>>
>> What I cannot determine, however, is how to know which object is busted?
>> For example, just today I ran into a scrub error. The object has two
>> copies and is an 8MB piece of an RBD, and has identical timestamps,
>> identical xattrs names and values. But it definitely has a different
>> MD5 sum. How to know which one is correct?
>>
>> I've been just kicking off pg repair each time, which seems to just
>> use the primary copy to overwrite the others. Haven't run into any
>> issues with that so far, but it does make me nervous.
>>
>> - Travis
>>
>>
>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum <[email protected]> wrote:
>>>
>>> It's not very intuitive or easy to look at right now (there are
>>> plans from the recent developer summit to improve things), but the
>>> central log should have output about exactly what objects are
>>> busted. You'll then want to compare the copies manually to determine
>>> which ones are good or bad, get the good copy on the primary (make
>>> sure you preserve xattrs), and run repair.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith <[email protected]> wrote:
>>> > Greetings,
>>> >
>>> > I upgraded to firefly last week and I suddenly received this error:
>>> >
>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> >
>>> > ceph health detail shows the following:
>>> >
>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 3.c6 is
>>> > active+clean+inconsistent, acting [2,5]
>>> > 1 scrub errors
>>> >
>>> > The docs say that I can run `ceph pg repair 3.c6` to fix this.
>>> > What I want to know is what are the risks of data loss if I run
>>> > that command in this state and how can I mitigate them?
>>> >
>>> > --
>>> > Randall Smith
>>> > Computing Services
>>> > Adams State University
>>> > http://www.adams.edu/
>>> > 719-587-7741
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > [email protected]
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com