Hello David, On Wed, Aug 9, 2017 at 3:08 PM, David Turner <drakonst...@gmail.com> wrote:
> When exactly is the timeline of when the io error happened? The timeline was included in the email, hour:min:sec resolution. I spared millisecs since it doesn't really change things. > If the primary > osd was dead, but not marked down in the cluster yet, The email showed when the osd went up, so before that it was suposed to be down, as far as I can tell from the logs, unless there was an up-down somewhere I have missed. I believe a boot-failed osd won't come up. > then the cluster would > sit there and expect that osd too respond. Suppose the osd would have been up and in (which I believe it wasn't), and it fails to respond, what is supposed to happen? I thought librados would see failure or timeout and would try to contact secondaries, and definitely not send IO error upwards unless all possibilities failed. > If this definitely happened after > the primary osd was marked down, then it's a different story. Seems so, based on the logs I was able to correlate, but I cannot be absolutely sure. > I'm confused about you saying 1 osd was down/out and 2 other osds we're down > but not out. Okay, that may have been a mistake on my part: there's 2 osds failed and one was about to be replaced first, and since it have failed we kind of hesitated to replace the other one. :-/ The email was heavily trimmed to remove fluff, this info may have been missed. Sorry. > We're this in the same host whole you were replacing the disk? The logs were gathered from many hosts and osds and mons, since events happened simultaneously. The replacement happened on the same host, I believe this is expected. > Is your failure domain host or osd? Host (and datacenter). > What version of ceph are you running? See the first line of my mail: version 0.94.10 (hammer) Thanks, Peter _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com