On 04/24/15 14:03, Carmelo Ponti (CSCS) wrote:
> Now, After ca 2h, the migration processes appear again and the
> GET_INFO_FS is increasing slowly slowly (5.38 ms/op in this moment).
This maybe due to the way lustre client manages its inode cache (the 
more it is populated, the slower it is).

You can limit this by setting this on the client:
     lctl set_param ldlm.namespaces.*.lru_size=400

It is also good to run this regularly (we run it on Lustre clients after 
each compute job ends).
     lctl set_param ldlm.namespaces.*.lru_size=clear

Regards.

>
>
>
> On Thu, 2015-04-23 at 15:47 +0200, LEIBOVICI Thomas wrote:
>> top - 12:25:30 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
>> 15.97
>> Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.0%us, 12.9%sy,  0.0%ni, 87.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  131999964k total, 125632212k used,  6367752k free,   207536k
>> buffers
>> Swap:  6291448k total,    16352k used,  6275096k free,  6655152k cached
>>
>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>    6915 mysql     20   0 34.2g 1.5g 5372 S 59.7  1.2   0:55.34
>> mysqld
>>    7012 root      20   0 3189m 1.3g 1468 S 15.1  1.1   2:40.51
>>
>>
>> I really think there is something wrong on your system related to this
>> migration threads.
>> You still have a load of 17 with CPU 87% idle... Strange.
>> And even if they are more active now, mysql and robinhood only produce a
>> load of 0.7.
>>
>> It sounds more like a driver or hardware issue, or a RT kernel mode...
>> Do you run a specific kernel? or with realtime options?
>>
>> If you have a spare node, it would be worthwhile to run robinhood on it
>> and see if you have the same strange load.
>>
>> Regards
>> Thomas.
>>
>> On 04/23/15 12:49, Carmelo Ponti (CSCS) wrote:
>>> I divided the two processes between the two sockets and now I can see
>>> them using some CPU time to time:
>>>
>>> # top -p 7012,6915 -b
>>>    
>>> top - 12:25:27 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
>>> 15.97
>>> Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us, 12.7%sy,  0.0%ni, 87.2%id,  0.0%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  131999964k total, 125626996k used,  6372968k free,   207532k
>>> buffers
>>> Swap:  6291448k total,    16352k used,  6275096k free,  6655084k cached
>>>
>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>>    7012 root      20   0 3189m 1.3g 1468 S  1.6  1.1   2:40.02
>>> robinhood
>>>    6915 mysql     20   0 34.2g 1.5g 5372 R  0.0  1.2   0:53.40
>>> mysqld
>>>
>>> top - 12:25:30 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
>>> 15.97
>>> Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us, 12.9%sy,  0.0%ni, 87.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  131999964k total, 125632212k used,  6367752k free,   207536k
>>> buffers
>>> Swap:  6291448k total,    16352k used,  6275096k free,  6655152k cached
>>>
>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>>    6915 mysql     20   0 34.2g 1.5g 5372 S 59.7  1.2   0:55.34
>>> mysqld
>>>    7012 root      20   0 3189m 1.3g 1468 S 15.1  1.1   2:40.51
>>> robinhood
>>>
>>> top - 12:25:33 up 8 days, 21:57,  6 users,  load average: 16.39, 17.79,
>>> 15.94
>>> Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us, 13.7%sy,  0.0%ni, 86.2%id,  0.0%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  131999964k total, 125631972k used,  6367992k free,   207540k
>>> buffers
>>> Swap:  6291448k total,    16352k used,  6275096k free,  6655116k cached
>>>
>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>>    7012 root      20   0 3189m 1.3g 1468 S 21.3  1.1   2:41.17
>>> robinhood
>>>    6915 mysql     20   0 34.2g 1.5g 5372 S  0.0  1.2   0:55.34
>>> mysqld
>>>
>>> In this moment we have 24 million of Changelog lines so I guest we need
>>> some time to see if there is an improvement. By sure the load average
>>> decreased a lot.
>>>
>>> Today I also noticed many messages as the following on dmesg and
>>> on /var/log/messages:
>>>
>>> Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) message
>>> send failed (-32)
>>> Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) Skipped 1
>>> previous similar message
>>>
>>> I searched on google and I found an old request on robinhood-support
>>> (http://sourceforge.net/p/robinhood/mailman/message/31162194/) which
>>> explain the messages and how to fix it. Could these messages explain in
>>> part the problem we have or it a consequence of the problem?
>>>
>>> Carmelo
>>>
>>> On Thu, 2015-04-23 at 10:00 +0200, LEIBOVICI Thomas wrote:
>>>> On 04/22/15 16:26, Carmelo Ponti (CSCS) wrote:
>>>>> I will wait until tomorrow to see if the situation will go better but I
>>>>> can immediately noticed that the cpu usage of robinhood now is between
>>>>> 40% and 100%. The load of mysql didn't change:
>>>>>
>>>>> 2847 root      20   0 3867m 1.7g 1528 S 63.9  1.4  47:41.73 robinhood
>>>>> 3217 mysql     20   0 37.6g 1.3g 4500 S  0.0  1.0   1603:03 mysqld
>>>> It may be a good sign that robinhood now does something :)
>>>>
>>>> mysqld should be much more active however.
>>>> What about pinning it too?
>>>>
>>>> Thomas.


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to