Hello all,
I have seen this for a long time, but never investigated further. After stable
test runs for several days, this is our last known show stopper before using
Ceph in production. We are running 0.47.2 on 32 Bit.
If we restart MDS (or all ceph daemons) on all nodes, one after another or all
together, they first recover and then the active one starts to spin with full
cpu and does not answer any more. After a while, the next takes over, starts
to spin, etc., until the whole cluster is unusable. This is completely
reproducable and happens even without any active client.
As ecpected, ceph -w shows lots of
"2012-06-15 11:35:28.588775 mds e959: 1/1/1 up {0=3=up:active(laggy or
crashed)}"
It does not help to stop all services on all nodes for minutes or longer and
to restart them - MDS will restart spinning. But: If we reboot the whole
cluster, everything goes back to work.
Today's MDS log is available at
https://download.m-privacy.de/homeuser-mds.0.log.gz
Is this a known problem? It has been with us for a looong time now, but since
rebooting used to help, we never tracked it down.
Amon Ott
--
Dr. Amon Ott
m-privacy GmbH Tel: +49 30 24342334
Am Köllnischen Park 1 Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de
Amtsgericht Charlottenburg, HRB 84946
Geschäftsführer:
Dipl.-Kfm. Holger Maczkowsky,
Roman Maczkowsky
GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html