I think I just solved at least part of the problem.

Because of the somewhat peculiar way that I have Docker configured, docker
instances on another system were being assigned my OSD's IP address,
running for a couple seconds, and then failing (for unrelated reasons).
Effectively, there was something sitting on the network throwing random
RSTs at my TCP connections and then vanishing.

Amazingly, Ceph seems to have been able to handle it *just* well enough to
make it non-obvious that the problem was external and network related.

That doesn't quite explain the issues with local OSDs acting up, though.

For now, I've moved all of my OSDs back to Ubuntu; it's more work to
manage, but on the other hand it's actually working.


Scott

On Tue Nov 18 2014 at 3:14:54 PM Gregory Farnum <g...@gregs42.com> wrote:

> It's a little strange, but with just the one-sided log it looks as
> though the OSD is setting up a bunch of connections and then
> deliberately tearing them down again within  second or two (i.e., this
> is not a direct messenger bug, but it might be an OSD one, or it might
> be something else).
> Is it possible that you have some firewalls set up that are allowing
> through some traffic but not others? The OSDs use a bunch of ports and
> it looks like maybe there are at least intermittent issues with them
> heartbeating.
> -Greg
>
> On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird <sc...@sigkill.org> wrote:
> > Here are the first 33k lines or so:
> > https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
> >
> > This is a different (but more or less identical) machine from the past
> set
> > of logs.  This system doesn't have quite as many drives in it, so I
> couldn't
> > spot a same-host error burst, but it's logging tons of the same errors
> while
> > trying to talk to 10.2.0.34.
> >
> > On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum <g...@gregs42.com>
> wrote:
> >>
> >> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird <sc...@sigkill.org> wrote:
> >> > I'm having a problem with my cluster.  It's running 0.87 right now,
> but
> >> > I
> >> > saw the same behavior with 0.80.5 and 0.80.7.
> >> >
> >> > The problem is that my logs are filling up with "replacing existing
> >> > (lossy)
> >> > channel" log lines (see below), to the point where I'm filling drives
> to
> >> > 100% almost daily just with logs.
> >> >
> >> > It doesn't appear to be network related, because it happens even when
> >> > talking to other OSDs on the same host.
> >>
> >> Well, that means it's probably not physical network related, but there
> >> can still be plenty wrong with the networking stack... ;)
> >>
> >> > The logs pretty much all point to
> >> > port 0 on the remote end.  Is this an indicator that it's failing to
> >> > resolve
> >> > port numbers somehow, or is this normal at this point in connection
> >> > setup?
> >>
> >> That's definitely unusual, but I'd need to see a little more to be
> >> sure if it's bad. My guess is that these pipes are connections from
> >> the other OSD's Objecter, which is treated as a regular client and
> >> doesn't bind to a socket for incoming connections.
> >>
> >> The repetitive channel replacements are concerning, though — they can
> >> be harmless in some circumstances but this looks more like the
> >> connection is simply failing to establish and so it's retrying over
> >> and over again. Can you restart the OSDs with "debug ms = 10" in their
> >> config file and post the logs somewhere? (There is not really any
> >> documentation available on what they mean, but the deeper detail ones
> >> might also be more understandable to you.)
> >> -Greg
> >>
> >> >
> >> > The systems that are causing this problem are somewhat unusual;
> they're
> >> > running OSDs in Docker containers, but they *should* be configured to
> >> > run as
> >> > root and have full access to the host's network stack.  They manage to
> >> > work,
> >> > mostly, but things are still really flaky.
> >> >
> >> > Also, is there documentation on what the various fields mean, short of
> >> > digging through the source?  And how does Ceph resolve OSD numbers
> into
> >> > host/port addresses?
> >> >
> >> >
> >> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1e070580).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
> >> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1e070420).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
> >> > c=0x1f3d8420).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1e070840).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1b2d6260).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x675c580).accept replacing existing (lossy) channel (new one
> lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
> >> > c=0x1f3d9600).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to