All,
We have not been using robinhood for months now, because each time I
attempt to activate changelogs, the MDT begins to disconnect clients.
I came back to robinhood yesterday, and managed to complete a fs scan on
our backup lustre without any complaints from the MDT. Then today I
activated changelogs, and I saw--instantly--the behavior we were seeing
before. The client that immediately loses its connection is the
robinhood machine, which is set to consume the changelogs.
Are there any settings that anyone knows of that could cause this
behavior? While I appreciate the importance of tuning, before we
upgraded to lustre 2.5.5, I never had to tweak any lustre client
settings... robinhood 'just worked'.
Here are my current settings for several parameters listed as important
in the robinhood docs:
llite.lard-ffff880fe9359c00.statahead_max=4
mdc.lard-MDT0000-mdc-ffff880fe9359c00.max_rpcs_in_flight=8
ldlm.namespaces.lard*.lru_size=100
ldlm.namespaces.lard*.lru_max_age=1200
I was tempted to increase the max_rpcs_in_flight to the suggested 64,
but I am uncertain how to interpret the accompanying advice "Make sure
ko2iblnd peer_credits is enough to handle the specified
max_rpcs_in_flight, and that the same parameter on MDS is set
accordingly." What parameter on the MDS, exactly? And also, how do you
set the peer_credits? I couldn't find any info on this, so I left that
alone.
Again, while this could be a tunings issue, the timing of it feels
suspicious...
Any thoughts/suggestions are appreciated. Thanks.
Jessica
Aug 1 13:52:34 kernel: Lustre: lard-MDD0000: changelog on
Aug 1 13:52:34 kernel: Lustre: Modifying parameter
general.mdd.lard-MDT*.changelog_mask in log params
Aug 1 13:52:34 kernel: Lustre: Skipped 1 previous similar message
Aug 1 13:58:42 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug 1 13:58:52 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug 1 13:58:52 kernel: Lustre: Skipped 128731 previous similar messages
Aug 1 13:58:57 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) refused
reconnection, still busy with 1 active RPCs
Aug 1 13:58:57 kernel: Lustre: Skipped 5 previous similar messages
Aug 1 13:59:19 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug 1 13:59:19 kernel: Lustre: Skipped 73857 previous similar messages
Aug 1 13:59:56 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug 1 13:59:56 kernel: Lustre: Skipped 513349 previous similar messages
Aug 1 14:00:22 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) refused
reconnection, still busy with 1 active RPCs
Aug 1 14:01:24 kernel: Lustre: lard-MDT0000: Client
53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug 1 14:01:24 kernel: Lustre: Skipped 349199 previous similar messages
Aug 1 14:01:41 kernel: Lustre: lard-MDD0000: changelog off
--
Jessica Otey
System Administrator II
North American ALMA Science Center (NAASC)
National Radio Astronomy Observatory (NRAO)
Charlottesville, Virginia (USA)
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support