On 04/19/2011 07:02 AM, Sage Weil wrote:
>> The relation between OSD partitions (/dev/mapper/sda6 in the example above)
>> is another interesting factor. As long as the load is under 100%, the
>> partitions on both nodes grow in almost perfect sync. When the load exceeds
>> 100%, one node starts lagging behind the other. If that continues long
>> enough,
>> the lagging node falls out completely while the other node keeps growing.
> This is really interesting. This is on the partitions that have _just_
> the OSD data?
Yes, with a couple of extra layers. node01 keeps its OSD data on an ext4
filesystem on top of a dm-crypt encrypted native disk partition. node02
on the other hand has an mdadm RAID0 of two partitions on separate disks
with dm-crypt and ext4 on top of that. This layering - in particular the
encryption - consumes CPU and can slow down things, but for the rest it's
rock-solid; I've been running systems with these setups for years and
never had a problem with them even once.
Here's an example from this morning:
node01:
/dev/mapper/sda6 232003 5914 212830 3% /mnt/osd
node02:
/dev/mapper/md4 225716 5704 207112 3% /mnt/osd
client:
192.168.178.100:6789:/
232002 5913 212829 3% /mnt/n01
You can see that the total space on the client corresponds to that of node01,
so the osd of node02 has gone belly up. The load on node01 is creeping upwards
of 200% while rsync on the client keeps smiling and pushing data.
node01 top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24793 root 20 0 1679m 510m 1756 S 9.2 25.7 29:23.11 cosd
30235 root 20 0 0 0 0 S 1.0 0.0 0:01.10 kworker/0:1
637 root 20 0 0 0 0 S 0.7 0.0 4:56.26 jbd2/sda2-8
30468 root 20 0 14988 1152 864 R 0.7 0.1 0:00.14 top
21748 root 20 0 104m 796 504 S 0.3 0.0 1:04.27 watch
29418 root 20 0 0 0 0 S 0.3 0.0 0:02.12 kworker/0:2
node01 iotop:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
24933 be/4 root 109.97 K/s 7.12 K/s 0.00 % 95.49 % cosd -i 0 -c
~ceph/ceph.conf
24934 be/4 root 94.15 K/s 7.12 K/s 0.00 % 92.45 % cosd -i 0 -c
~ceph/ceph.conf
24830 be/4 root 0.00 B/s 36.39 K/s 0.00 % 81.10 % cosd -i 0 -c
~ceph/ceph.conf
637 be/3 root 0.00 B/s 0.00 B/s 0.00 % 80.27 % [jbd2/sda2-8]
256 be/3 root 0.00 B/s 2.37 K/s 0.00 % 72.93 % [jbd2/sda1-8]
24831 be/4 root 0.00 B/s 0.00 B/s 0.00 % 27.85 % cosd -i 0 -c
~ceph/ceph.conf
24826 be/4 root 0.00 B/s 272.94 K/s 0.00 % 19.28 % cosd -i 0 -c
~ceph/ceph.conf
24829 be/4 root 0.00 B/s 45.89 K/s 0.00 % 18.03 % cosd -i 0 -c
~ceph/ceph.conf
24632 be/4 root 0.00 B/s 26.90 K/s 0.00 % 5.99 % cmon -i 0 -c
~ceph/ceph.conf
24556 be/3 root 0.00 B/s 5.54 K/s 0.00 % 2.95 % [jbd2/dm-0-8]
639 be/3 root 0.00 B/s 0.00 B/s 0.00 % 2.32 % [jbd2/sda5-8]
24833 be/4 root 0.00 B/s 10.28 K/s 0.00 % 0.00 % cosd -i 0 -c
~ceph/ceph.conf
At this point I unmounted ceph on the client and restarted ceph. A few minutes
later I see this:
node01:
/dev/mapper/sda6 232003 5907 212837 3% /mnt/osd
node02:
/dev/mapper/md4 225716 5626 207190 3% /mnt/osd
Note how disk usage went down on both nodes, considerably on node02.
Then they start exchanging data and an hour later or so they're back in sync:
node01:
/dev/mapper/sda6 232003 5906 212838 3% /mnt/osd
node02:
/dev/mapper/md4 225716 5906 206910 3% /mnt/osd
> Do you see any OSD flapping (down/up cycles) during this
> period?
I've been running without logs since yesterday, but my experience is that
they don't flap; once an OSD goes down it stays down until ceph is restarted.
> It's possible that the MDS is getting ahead of the OSDs, as there isn't
> currently any throttling of metadata request processing when the
> journaling is slow. (We should fix this.) I don't see how that would
> explain the variance in disk usage, though, unless you are also seeing the
> difference in disk usage reflected in the cosd memory usage on the
> less-disk-used node?
I didn't pay attention to memory usage, but I think I can rule this out
anyway. node01 has 2 GB RAM and 2 GB swap, node02 has 4 GB RAM and no
swap. Since I saw 11 GB on the node02 OSD the other day and 4 GB on the
node01 OSD, the difference could not have been in memory.
Z
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html