All,

We have not been using robinhood for months now, because each time I attempt to activate changelogs, the MDT begins to disconnect clients.

I came back to robinhood yesterday, and managed to complete a fs scan on our backup lustre without any complaints from the MDT. Then today I activated changelogs, and I saw--instantly--the behavior we were seeing before. The client that immediately loses its connection is the robinhood machine, which is set to consume the changelogs.

Are there any settings that anyone knows of that could cause this behavior? While I appreciate the importance of tuning, before we upgraded to lustre 2.5.5, I never had to tweak any lustre client settings... robinhood 'just worked'.

Here are my current settings for several parameters listed as important in the robinhood docs:

llite.lard-ffff880fe9359c00.statahead_max=4

mdc.lard-MDT0000-mdc-ffff880fe9359c00.max_rpcs_in_flight=8

ldlm.namespaces.lard*.lru_size=100

ldlm.namespaces.lard*.lru_max_age=1200

I was tempted to increase the max_rpcs_in_flight to the suggested 64, but I am uncertain how to interpret the accompanying advice "Make sure ko2iblnd peer_credits is enough to handle the specified max_rpcs_in_flight, and that the same parameter on MDS is set accordingly." What parameter on the MDS, exactly? And also, how do you set the peer_credits? I couldn't find any info on this, so I left that alone.

Again, while this could be a tunings issue, the timing of it feels suspicious...

Any thoughts/suggestions are appreciated. Thanks.

Jessica

Aug  1 13:52:34  kernel: Lustre: lard-MDD0000: changelog on
Aug 1 13:52:34 kernel: Lustre: Modifying parameter general.mdd.lard-MDT*.changelog_mask in log params
Aug  1 13:52:34  kernel: Lustre: Skipped 1 previous similar message
Aug 1 13:58:42 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting Aug 1 13:58:52 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug  1 13:58:52  kernel: Lustre: Skipped 128731 previous similar messages
Aug 1 13:58:57 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) refused reconnection, still busy with 1 active RPCs
Aug  1 13:58:57  kernel: Lustre: Skipped 5 previous similar messages
Aug 1 13:59:19 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug  1 13:59:19  kernel: Lustre: Skipped 73857 previous similar messages
Aug 1 13:59:56 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug  1 13:59:56  kernel: Lustre: Skipped 513349 previous similar messages
Aug 1 14:00:22 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) refused reconnection, still busy with 1 active RPCs Aug 1 14:01:24 kernel: Lustre: lard-MDT0000: Client 53a9c0ec-83d6-88b9-5b8f-68274ceaebe8 (at 10.7.17.122@o2ib) reconnecting
Aug  1 14:01:24  kernel: Lustre: Skipped 349199 previous similar messages
Aug  1 14:01:41  kernel: Lustre: lard-MDD0000: changelog off



--
Jessica Otey
System Administrator II
North American ALMA Science Center (NAASC)
National Radio Astronomy Observatory (NRAO)
Charlottesville, Virginia (USA)


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to