On 04/24/15 14:03, Carmelo Ponti (CSCS) wrote:
> Now, After ca 2h, the migration processes appear again and the
> GET_INFO_FS is increasing slowly slowly (5.38 ms/op in this moment).
This maybe due to the way lustre client manages its inode cache (the
more it is populated, the slower it is).
You can limit this by setting this on the client:
lctl set_param ldlm.namespaces.*.lru_size=400
It is also good to run this regularly (we run it on Lustre clients after
each compute job ends).
lctl set_param ldlm.namespaces.*.lru_size=clear
Regards.
>
>
>
> On Thu, 2015-04-23 at 15:47 +0200, LEIBOVICI Thomas wrote:
>> top - 12:25:30 up 8 days, 21:57, 6 users, load average: 16.77, 17.89,
>> 15.97
>> Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 0.0%us, 12.9%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si,
>> 0.0%st
>> Mem: 131999964k total, 125632212k used, 6367752k free, 207536k
>> buffers
>> Swap: 6291448k total, 16352k used, 6275096k free, 6655152k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> 6915 mysql 20 0 34.2g 1.5g 5372 S 59.7 1.2 0:55.34
>> mysqld
>> 7012 root 20 0 3189m 1.3g 1468 S 15.1 1.1 2:40.51
>>
>>
>> I really think there is something wrong on your system related to this
>> migration threads.
>> You still have a load of 17 with CPU 87% idle... Strange.
>> And even if they are more active now, mysql and robinhood only produce a
>> load of 0.7.
>>
>> It sounds more like a driver or hardware issue, or a RT kernel mode...
>> Do you run a specific kernel? or with realtime options?
>>
>> If you have a spare node, it would be worthwhile to run robinhood on it
>> and see if you have the same strange load.
>>
>> Regards
>> Thomas.
>>
>> On 04/23/15 12:49, Carmelo Ponti (CSCS) wrote:
>>> I divided the two processes between the two sockets and now I can see
>>> them using some CPU time to time:
>>>
>>> # top -p 7012,6915 -b
>>>
>>> top - 12:25:27 up 8 days, 21:57, 6 users, load average: 16.77, 17.89,
>>> 15.97
>>> Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
>>> Cpu(s): 0.0%us, 12.7%sy, 0.0%ni, 87.2%id, 0.0%wa, 0.0%hi, 0.0%si,
>>> 0.0%st
>>> Mem: 131999964k total, 125626996k used, 6372968k free, 207532k
>>> buffers
>>> Swap: 6291448k total, 16352k used, 6275096k free, 6655084k cached
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>>> COMMAND
>>> 7012 root 20 0 3189m 1.3g 1468 S 1.6 1.1 2:40.02
>>> robinhood
>>> 6915 mysql 20 0 34.2g 1.5g 5372 R 0.0 1.2 0:53.40
>>> mysqld
>>>
>>> top - 12:25:30 up 8 days, 21:57, 6 users, load average: 16.77, 17.89,
>>> 15.97
>>> Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
>>> Cpu(s): 0.0%us, 12.9%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si,
>>> 0.0%st
>>> Mem: 131999964k total, 125632212k used, 6367752k free, 207536k
>>> buffers
>>> Swap: 6291448k total, 16352k used, 6275096k free, 6655152k cached
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>>> COMMAND
>>> 6915 mysql 20 0 34.2g 1.5g 5372 S 59.7 1.2 0:55.34
>>> mysqld
>>> 7012 root 20 0 3189m 1.3g 1468 S 15.1 1.1 2:40.51
>>> robinhood
>>>
>>> top - 12:25:33 up 8 days, 21:57, 6 users, load average: 16.39, 17.79,
>>> 15.94
>>> Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
>>> Cpu(s): 0.0%us, 13.7%sy, 0.0%ni, 86.2%id, 0.0%wa, 0.0%hi, 0.0%si,
>>> 0.0%st
>>> Mem: 131999964k total, 125631972k used, 6367992k free, 207540k
>>> buffers
>>> Swap: 6291448k total, 16352k used, 6275096k free, 6655116k cached
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>>> COMMAND
>>> 7012 root 20 0 3189m 1.3g 1468 S 21.3 1.1 2:41.17
>>> robinhood
>>> 6915 mysql 20 0 34.2g 1.5g 5372 S 0.0 1.2 0:55.34
>>> mysqld
>>>
>>> In this moment we have 24 million of Changelog lines so I guest we need
>>> some time to see if there is an improvement. By sure the load average
>>> decreased a lot.
>>>
>>> Today I also noticed many messages as the following on dmesg and
>>> on /var/log/messages:
>>>
>>> Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) message
>>> send failed (-32)
>>> Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) Skipped 1
>>> previous similar message
>>>
>>> I searched on google and I found an old request on robinhood-support
>>> (http://sourceforge.net/p/robinhood/mailman/message/31162194/) which
>>> explain the messages and how to fix it. Could these messages explain in
>>> part the problem we have or it a consequence of the problem?
>>>
>>> Carmelo
>>>
>>> On Thu, 2015-04-23 at 10:00 +0200, LEIBOVICI Thomas wrote:
>>>> On 04/22/15 16:26, Carmelo Ponti (CSCS) wrote:
>>>>> I will wait until tomorrow to see if the situation will go better but I
>>>>> can immediately noticed that the cpu usage of robinhood now is between
>>>>> 40% and 100%. The load of mysql didn't change:
>>>>>
>>>>> 2847 root 20 0 3867m 1.7g 1528 S 63.9 1.4 47:41.73 robinhood
>>>>> 3217 mysql 20 0 37.6g 1.3g 4500 S 0.0 1.0 1603:03 mysqld
>>>> It may be a good sign that robinhood now does something :)
>>>>
>>>> mysqld should be much more active however.
>>>> What about pinning it too?
>>>>
>>>> Thomas.
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support