On Wed, Aug 9, 2017 at 3:08 PM, David Turner <drakonst...@gmail.com> wrote:
> When exactly is the timeline of when the io error happened?
The timeline was included in the email, hour:min:sec resolution. I
spared millisecs since it doesn't really change things.
> If the primary
> osd was dead, but not marked down in the cluster yet,
The email showed when the osd went up, so before that it was suposed
to be down, as far as I can tell from the logs, unless there was an
up-down somewhere I have missed. I believe a boot-failed osd won't
> then the cluster would
> sit there and expect that osd too respond.
Suppose the osd would have been up and in (which I believe it wasn't),
and it fails to respond, what is supposed to happen? I thought
librados would see failure or timeout and would try to contact
secondaries, and definitely not send IO error upwards unless all
> If this definitely happened after
> the primary osd was marked down, then it's a different story.
Seems so, based on the logs I was able to correlate, but I cannot be
> I'm confused about you saying 1 osd was down/out and 2 other osds we're down
> but not out.
Okay, that may have been a mistake on my part: there's 2 osds failed
and one was about to be replaced first, and since it have failed we
kind of hesitated to replace the other one. :-/ The email was heavily
trimmed to remove fluff, this info may have been missed. Sorry.
> We're this in the same host whole you were replacing the disk?
The logs were gathered from many hosts and osds and mons, since events
happened simultaneously. The replacement happened on the same host, I
believe this is expected.
> Is your failure domain host or osd?
Host (and datacenter).
> What version of ceph are you running?
See the first line of my mail: version 0.94.10 (hammer)
ceph-users mailing list