Hi Thomas

based on your last comment I found the root cause of the problem. A
couple of weeks ago the server crashed a couple of times a cause of a
kernel panic problem. Analyzing the vmcore-dmesg.txt I found a possible
issue with the lustre client 2.1 installed. So I decided to compile and
install lustre client 2.5.3 but I didn't think about the dependency with
robinhood. Today I was trying to downgrade lustre client to the version
2.1 (same of the server) but I got the problem of dependency and I
realized where it was the problem. I'm sorry but I didn't think about
this before. 

I keep lustre client 2.5.3 and I compiled and installed robinhood again.
I restarted it at beginning it worked very well (GET_INFO_FS decrease
from 1500 to 1.5 ms/op) and all  migration processes load disappeared.

Now, After ca 2h, the migration processes appear again and the
GET_INFO_FS is increasing slowly slowly (5.38 ms/op in this moment). I'm
guesting that this is caused of the number of Changelod entries (ca. 50
million) and for this reason we are deciding to recreate the DB and
perform a new scan during the week-end. I'm also asking myself if keep
the fix I did with numactl or remove it and increase the number of
threads before the new scan. What do you think?

Carmelo



On Thu, 2015-04-23 at 15:47 +0200, LEIBOVICI Thomas wrote:
> top - 12:25:30 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
> 15.97
> Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us, 12.9%sy,  0.0%ni, 87.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  131999964k total, 125632212k used,  6367752k free,   207536k
> buffers
> Swap:  6291448k total,    16352k used,  6275096k free,  6655152k cached
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>   6915 mysql     20   0 34.2g 1.5g 5372 S 59.7  1.2   0:55.34
> mysqld
>   7012 root      20   0 3189m 1.3g 1468 S 15.1  1.1   2:40.51
> 
> 
> I really think there is something wrong on your system related to this 
> migration threads.
> You still have a load of 17 with CPU 87% idle... Strange.
> And even if they are more active now, mysql and robinhood only produce a 
> load of 0.7.
> 
> It sounds more like a driver or hardware issue, or a RT kernel mode...
> Do you run a specific kernel? or with realtime options?
> 
> If you have a spare node, it would be worthwhile to run robinhood on it 
> and see if you have the same strange load.
> 
> Regards
> Thomas.
> 
> On 04/23/15 12:49, Carmelo Ponti (CSCS) wrote:
> > I divided the two processes between the two sockets and now I can see
> > them using some CPU time to time:
> >
> > # top -p 7012,6915 -b
> >   
> > top - 12:25:27 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
> > 15.97
> > Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
> > Cpu(s):  0.0%us, 12.7%sy,  0.0%ni, 87.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> > 0.0%st
> > Mem:  131999964k total, 125626996k used,  6372968k free,   207532k
> > buffers
> > Swap:  6291448k total,    16352k used,  6275096k free,  6655084k cached
> >
> >    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > COMMAND
> >   7012 root      20   0 3189m 1.3g 1468 S  1.6  1.1   2:40.02
> > robinhood
> >   6915 mysql     20   0 34.2g 1.5g 5372 R  0.0  1.2   0:53.40
> > mysqld
> >
> > top - 12:25:30 up 8 days, 21:57,  6 users,  load average: 16.77, 17.89,
> > 15.97
> > Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
> > Cpu(s):  0.0%us, 12.9%sy,  0.0%ni, 87.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> > 0.0%st
> > Mem:  131999964k total, 125632212k used,  6367752k free,   207536k
> > buffers
> > Swap:  6291448k total,    16352k used,  6275096k free,  6655152k cached
> >
> >    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > COMMAND
> >   6915 mysql     20   0 34.2g 1.5g 5372 S 59.7  1.2   0:55.34
> > mysqld
> >   7012 root      20   0 3189m 1.3g 1468 S 15.1  1.1   2:40.51
> > robinhood
> >
> > top - 12:25:33 up 8 days, 21:57,  6 users,  load average: 16.39, 17.79,
> > 15.94
> > Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
> > Cpu(s):  0.0%us, 13.7%sy,  0.0%ni, 86.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> > 0.0%st
> > Mem:  131999964k total, 125631972k used,  6367992k free,   207540k
> > buffers
> > Swap:  6291448k total,    16352k used,  6275096k free,  6655116k cached
> >
> >    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > COMMAND
> >   7012 root      20   0 3189m 1.3g 1468 S 21.3  1.1   2:41.17
> > robinhood
> >   6915 mysql     20   0 34.2g 1.5g 5372 S  0.0  1.2   0:55.34
> > mysqld
> >
> > In this moment we have 24 million of Changelog lines so I guest we need
> > some time to see if there is an improvement. By sure the load average
> > decreased a lot.
> >
> > Today I also noticed many messages as the following on dmesg and
> > on /var/log/messages:
> >
> > Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) message
> > send failed (-32)
> > Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) Skipped 1
> > previous similar message
> >
> > I searched on google and I found an old request on robinhood-support
> > (http://sourceforge.net/p/robinhood/mailman/message/31162194/) which
> > explain the messages and how to fix it. Could these messages explain in
> > part the problem we have or it a consequence of the problem?
> >
> > Carmelo
> >
> > On Thu, 2015-04-23 at 10:00 +0200, LEIBOVICI Thomas wrote:
> >> On 04/22/15 16:26, Carmelo Ponti (CSCS) wrote:
> >>> I will wait until tomorrow to see if the situation will go better but I
> >>> can immediately noticed that the cpu usage of robinhood now is between
> >>> 40% and 100%. The load of mysql didn't change:
> >>>
> >>> 2847 root      20   0 3867m 1.7g 1528 S 63.9  1.4  47:41.73 robinhood
> >>> 3217 mysql     20   0 37.6g 1.3g 4500 S  0.0  1.0   1603:03 mysqld
> >> It may be a good sign that robinhood now does something :)
> >>
> >> mysqld should be much more active however.
> >> What about pinning it too?
> >>
> >> Thomas.
> 

-- 
----------------------------------------------------------------------
Carmelo Ponti           System Engineer                             
CSCS                    Swiss Center for Scientific Computing 
Via Trevano 131         Email: [email protected]                  
CH-6900 Lugano          http://www.cscs.ch              
                        Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to