Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-20 Thread Rafael Weingärtner
First I thought the problem could be a consequence of over commitment, but
per your answers I do not see problems with over provisioning of resources.
I would check the MS and KVM agent code, but I believe the problem is
number of MS, imagine a singles MS to manage, monitor and orchestrate 250+
physical servers, networking devices and probably a huge number of VMs. In
the environment that we have in our lab, we use 2 MS to manage a pretty
small environment, and sometimes they get slow.

On Fri, Nov 20, 2015 at 6:25 AM, Daan Hoogland 
wrote:

> On Thu, Nov 19, 2015 at 9:31 PM, ilya 
> wrote:
>
> > Maybe I need to ping LeaseWeb and ExtremePC folks..
> >
>
> ​We have many smaller installations, nothing near those limits.​ (we being
> LeaseWeb)
>
>
>
> --
> Daan
>



-- 
Rafael Weingärtner


Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-20 Thread ilya
2 MS is typical.

There is also an issue with starting stopped MS. For example if you
start 1MS and within next 30 seconds start the second MS (first MS is
loading all hosts), there is a chance they will come up - but really not
functional.

Specifically sync jobs would fail and status updates from cloudstack
managed resource may not update.

Solution was to stagger the start of the second MS, but its just work
around solution.




On 11/20/15 2:19 AM, Rafael Weingärtner wrote:
> First I thought the problem could be a consequence of over commitment, but
> per your answers I do not see problems with over provisioning of resources.
> I would check the MS and KVM agent code, but I believe the problem is
> number of MS, imagine a singles MS to manage, monitor and orchestrate 250+
> physical servers, networking devices and probably a huge number of VMs. In
> the environment that we have in our lab, we use 2 MS to manage a pretty
> small environment, and sometimes they get slow.
> 
> On Fri, Nov 20, 2015 at 6:25 AM, Daan Hoogland 
> wrote:
> 
>> On Thu, Nov 19, 2015 at 9:31 PM, ilya 
>> wrote:
>>
>>> Maybe I need to ping LeaseWeb and ExtremePC folks..
>>>
>>
>> ​We have many smaller installations, nothing near those limits.​ (we being
>> LeaseWeb)
>>
>>
>>
>> --
>> Daan
>>
> 
> 
> 


RE: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-20 Thread Paul Angus
Hi Rafael,

These are clients of our support or consulting services, therefore I can only 
ever talk in very broad terms regarding their architectures.

One client had a single management server, the other had a pair which were load 
balanced behind a physical appliance.

Regards,

Paul Angus
VP Technology/Cloud Architect
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.an...@shapeblue.com

-Original Message-
From: Rafael Weingärtner [mailto:rafaelweingart...@gmail.com]
Sent: 19 November 2015 22:32
To: dev@cloudstack.apache.org
Subject: Re: [STABILITY]} Large KVM Infrastructure with ACS

How many MS do you have in your environment?

On Thu, Nov 19, 2015 at 7:56 PM, Paul Angus <paul.an...@shapeblue.com>
wrote:

> Hi,
>
> In the past a couple of clients of our have had issues with indirect
> agents (KVM hosts and system VMs) connecting over port 8250,
> particularly if connectivity was lost to the management server(s).
> They both had 300+ indirect agents active.
>
> In these circumstances we have found that running a netstat to see
> connections to port 8250 on the mgmt server(s) revealed many open but
> unused connections to port 8250.
>
> I recall at one time we found the agent connection code had been
> altered to attempt to reconnect it the connection didn't complete with 10secs.
> However the failed connection would take 60 seconds to time out.
>
> Another time we found that management server and mysql db were both
> being starved of enough connections to the mysql db to process the
> reconnections faster enough. The default from the mgmt server is 100
> connections and the documented setting for mysql is 350 connections.
> However external connections (and additional mgmt servers)  require these to 
> be adjusted.
>
> -- just some ideas...
>
>
> Regards,
>
> Paul Angus
> VP Technology/Cloud Architect
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.an...@shapeblue.com
>
> -Original Message-
> From: ilya [mailto:ilya.mailing.li...@gmail.com]
> Sent: 19 November 2015 20:32
> To: dev@cloudstack.apache.org
> Subject: Re: [STABILITY]} Large KVM Infrastructure with ACS
>
> Rafael,
>
> Please see response in-line:
>
> On 11/18/15 4:16 PM, Rafael Weingärtner wrote:
> > When you say 250+, you mean 250+ host spread in lots of cluster, right?
> > If I am not mistaken, ACS limits the number of KVM hosts in a
> > cluster, something like 50? I do not remember now if that value can
> > be configured, may it can be.
>
> Yes lots of clusters, way less than 50 per cluster.
>
> > I recall to have read something in a Red Hat doc about the KVM that
> > it does not have limit of hosts in a cluster. Actually, it does not
> > seem to have the figure of cluster at all. That is created solely in
> > ACS, to facilitate the management.
> >
> > To debug the problem, I would start with the following questions:
> >
> > Is every single cluster of your environment is presenting that problem?
>
> No, few clusters with some nodes within the cluster - not all.
>
> > What is the size of physical hosts that you have in your environment?
> > Do all of them have the same configuration?
> Yes, all hosts have the same configuration. Cant go into details, but
> its rather large.
>
> > Do you know the load (resource allocated and used) that is being
> > imposed in those hosts that had shown those problems?
> > What is your over commitment/provisioning factor that you are using?
> Servers are not heavily taxed, we dont over commit memory, other
> components could be over committed by 2 or less. Overall, we still
> have capacity to accommodate more VMs if needed, we just don't max it out.
>
> 
>
> Both Marcus and myself are looking through this, it could be just our
> specific implementation - hence, I wanted to see if anyone else in the
> community with heavy KVM usage came across this issue.
>
> Maybe I need to ping LeaseWeb and ExtremePC folks..
>
> Thanks,
> ilya
> >
> > On Wed, Nov 18, 2015 at 8:19 PM, Daan Hoogland
> > <daan.hoogl...@gmail.com>
> > wrote:
> >
> >> sounds like a bad limit Ilya, i'll keep an eye out.
> >>
> >> On Wed, Nov 18, 2015 at 10:10 PM, ilya
> >> <ilya.mailing.li...@gmail.com>
> >> wrote:
> >>
> >>> I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
> >>>
> >>> We've been noticing weird issues with KVM where occasionally lots
> >>> of KVM agents get Nio connection closed issue followed by barrage
> >>> of
> alerts.
> >>>
> >>> In some instances the agent reconnects right away and 

Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-20 Thread Daan Hoogland
On Thu, Nov 19, 2015 at 9:31 PM, ilya  wrote:

> Maybe I need to ping LeaseWeb and ExtremePC folks..
>

​We have many smaller installations, nothing near those limits.​ (we being
LeaseWeb)



-- 
Daan


Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-19 Thread ilya
Rafael,

Please see response in-line:

On 11/18/15 4:16 PM, Rafael Weingärtner wrote:
> When you say 250+, you mean 250+ host spread in lots of cluster, right?
> If I am not mistaken, ACS limits the number of KVM hosts in a cluster,
> something like 50? I do not remember now if that value can be configured,
> may it can be.

Yes lots of clusters, way less than 50 per cluster.

> I recall to have read something in a Red Hat doc about the KVM that it does
> not have limit of hosts in a cluster. Actually, it does not seem to have
> the figure of cluster at all. That is created solely in ACS, to facilitate
> the management.
> 
> To debug the problem, I would start with the following questions:
> 
> Is every single cluster of your environment is presenting that problem?

No, few clusters with some nodes within the cluster - not all.

> What is the size of physical hosts that you have in your environment? Do
> all of them have the same configuration?
Yes, all hosts have the same configuration. Cant go into details, but
its rather large.

> Do you know the load (resource allocated and used) that is being imposed in
> those hosts that had shown those problems?
> What is your over commitment/provisioning factor that you are using?
Servers are not heavily taxed, we dont over commit memory, other
components could be over committed by 2 or less. Overall, we still have
capacity to accommodate more VMs if needed, we just don't max it out.



Both Marcus and myself are looking through this, it could be just our
specific implementation - hence, I wanted to see if anyone else in the
community with heavy KVM usage came across this issue.

Maybe I need to ping LeaseWeb and ExtremePC folks..

Thanks,
ilya
> 
> On Wed, Nov 18, 2015 at 8:19 PM, Daan Hoogland 
> wrote:
> 
>> sounds like a bad limit Ilya, i'll keep an eye out.
>>
>> On Wed, Nov 18, 2015 at 10:10 PM, ilya 
>> wrote:
>>
>>> I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
>>>
>>> We've been noticing weird issues with KVM where occasionally lots of KVM
>>> agents get Nio connection closed issue followed by barrage of alerts.
>>>
>>> In some instances the agent reconnects right away and in other - it
>>> attempts to reconnect but never receives an ACK from MS.
>>>
>>> Please let me know if you notice anything like it and if you found a
>>> solution.
>>>
>>> Also, it would help to know what global settings have been tuned to make
>>> things work better (aside from direct.agent.*) and how MS are running.
>>>
>>> Thanks
>>> ilya
>>>
>>
>>
>>
>> --
>> Daan
>>
> 
> 
> 


RE: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-19 Thread Paul Angus
Hi,

In the past a couple of clients of our have had issues with indirect agents 
(KVM hosts and system VMs) connecting over port 8250, particularly if 
connectivity was lost to the management server(s). They both had 300+ indirect 
agents active.

In these circumstances we have found that running a netstat to see connections 
to port 8250 on the mgmt server(s) revealed many open but unused connections to 
port 8250.

I recall at one time we found the agent connection code had been altered to 
attempt to reconnect it the connection didn't complete with 10secs. However the 
failed connection would take 60 seconds to time out.

Another time we found that management server and mysql db were both being 
starved of enough connections to the mysql db to process the reconnections 
faster enough. The default from the mgmt server is 100 connections and the 
documented setting for mysql is 350 connections.  However external connections 
(and additional mgmt servers)  require these to be adjusted.

-- just some ideas...


Regards,

Paul Angus
VP Technology/Cloud Architect
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.an...@shapeblue.com

-Original Message-
From: ilya [mailto:ilya.mailing.li...@gmail.com]
Sent: 19 November 2015 20:32
To: dev@cloudstack.apache.org
Subject: Re: [STABILITY]} Large KVM Infrastructure with ACS

Rafael,

Please see response in-line:

On 11/18/15 4:16 PM, Rafael Weingärtner wrote:
> When you say 250+, you mean 250+ host spread in lots of cluster, right?
> If I am not mistaken, ACS limits the number of KVM hosts in a cluster,
> something like 50? I do not remember now if that value can be
> configured, may it can be.

Yes lots of clusters, way less than 50 per cluster.

> I recall to have read something in a Red Hat doc about the KVM that it
> does not have limit of hosts in a cluster. Actually, it does not seem
> to have the figure of cluster at all. That is created solely in ACS,
> to facilitate the management.
>
> To debug the problem, I would start with the following questions:
>
> Is every single cluster of your environment is presenting that problem?

No, few clusters with some nodes within the cluster - not all.

> What is the size of physical hosts that you have in your environment?
> Do all of them have the same configuration?
Yes, all hosts have the same configuration. Cant go into details, but its 
rather large.

> Do you know the load (resource allocated and used) that is being
> imposed in those hosts that had shown those problems?
> What is your over commitment/provisioning factor that you are using?
Servers are not heavily taxed, we dont over commit memory, other components 
could be over committed by 2 or less. Overall, we still have capacity to 
accommodate more VMs if needed, we just don't max it out.



Both Marcus and myself are looking through this, it could be just our specific 
implementation - hence, I wanted to see if anyone else in the community with 
heavy KVM usage came across this issue.

Maybe I need to ping LeaseWeb and ExtremePC folks..

Thanks,
ilya
>
> On Wed, Nov 18, 2015 at 8:19 PM, Daan Hoogland
> <daan.hoogl...@gmail.com>
> wrote:
>
>> sounds like a bad limit Ilya, i'll keep an eye out.
>>
>> On Wed, Nov 18, 2015 at 10:10 PM, ilya <ilya.mailing.li...@gmail.com>
>> wrote:
>>
>>> I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
>>>
>>> We've been noticing weird issues with KVM where occasionally lots of
>>> KVM agents get Nio connection closed issue followed by barrage of alerts.
>>>
>>> In some instances the agent reconnects right away and in other - it
>>> attempts to reconnect but never receives an ACK from MS.
>>>
>>> Please let me know if you notice anything like it and if you found a
>>> solution.
>>>
>>> Also, it would help to know what global settings have been tuned to
>>> make things work better (aside from direct.agent.*) and how MS are running.
>>>
>>> Thanks
>>> ilya
>>>
>>
>>
>>
>> --
>> Daan
>>
>
>
>
Find out more about ShapeBlue and our range of CloudStack related services

IaaS Cloud Design & Build<http://shapeblue.com/iaas-cloud-design-and-build//>
CSForge – rapid IaaS deployment framework<http://shapeblue.com/csforge/>
CloudStack Consulting<http://shapeblue.com/cloudstack-consultancy/>
CloudStack Software 
Engineering<http://shapeblue.com/cloudstack-software-engineering/>
CloudStack Infrastructure 
Support<http://shapeblue.com/cloudstack-infrastructure-support/>
CloudStack Bootcamp Training Courses<http://shapeblue.com/cloudstack-training/>

This email and any attachments to it may be confidential and are intended 
solely for the use of the individual to whom 

Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-19 Thread Rafael Weingärtner
How many MS do you have in your environment?

On Thu, Nov 19, 2015 at 7:56 PM, Paul Angus <paul.an...@shapeblue.com>
wrote:

> Hi,
>
> In the past a couple of clients of our have had issues with indirect
> agents (KVM hosts and system VMs) connecting over port 8250, particularly
> if connectivity was lost to the management server(s). They both had 300+
> indirect agents active.
>
> In these circumstances we have found that running a netstat to see
> connections to port 8250 on the mgmt server(s) revealed many open but
> unused connections to port 8250.
>
> I recall at one time we found the agent connection code had been altered
> to attempt to reconnect it the connection didn't complete with 10secs.
> However the failed connection would take 60 seconds to time out.
>
> Another time we found that management server and mysql db were both being
> starved of enough connections to the mysql db to process the reconnections
> faster enough. The default from the mgmt server is 100 connections and the
> documented setting for mysql is 350 connections.  However external
> connections (and additional mgmt servers)  require these to be adjusted.
>
> -- just some ideas...
>
>
> Regards,
>
> Paul Angus
> VP Technology/Cloud Architect
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.an...@shapeblue.com
>
> -Original Message-
> From: ilya [mailto:ilya.mailing.li...@gmail.com]
> Sent: 19 November 2015 20:32
> To: dev@cloudstack.apache.org
> Subject: Re: [STABILITY]} Large KVM Infrastructure with ACS
>
> Rafael,
>
> Please see response in-line:
>
> On 11/18/15 4:16 PM, Rafael Weingärtner wrote:
> > When you say 250+, you mean 250+ host spread in lots of cluster, right?
> > If I am not mistaken, ACS limits the number of KVM hosts in a cluster,
> > something like 50? I do not remember now if that value can be
> > configured, may it can be.
>
> Yes lots of clusters, way less than 50 per cluster.
>
> > I recall to have read something in a Red Hat doc about the KVM that it
> > does not have limit of hosts in a cluster. Actually, it does not seem
> > to have the figure of cluster at all. That is created solely in ACS,
> > to facilitate the management.
> >
> > To debug the problem, I would start with the following questions:
> >
> > Is every single cluster of your environment is presenting that problem?
>
> No, few clusters with some nodes within the cluster - not all.
>
> > What is the size of physical hosts that you have in your environment?
> > Do all of them have the same configuration?
> Yes, all hosts have the same configuration. Cant go into details, but its
> rather large.
>
> > Do you know the load (resource allocated and used) that is being
> > imposed in those hosts that had shown those problems?
> > What is your over commitment/provisioning factor that you are using?
> Servers are not heavily taxed, we dont over commit memory, other
> components could be over committed by 2 or less. Overall, we still have
> capacity to accommodate more VMs if needed, we just don't max it out.
>
> 
>
> Both Marcus and myself are looking through this, it could be just our
> specific implementation - hence, I wanted to see if anyone else in the
> community with heavy KVM usage came across this issue.
>
> Maybe I need to ping LeaseWeb and ExtremePC folks..
>
> Thanks,
> ilya
> >
> > On Wed, Nov 18, 2015 at 8:19 PM, Daan Hoogland
> > <daan.hoogl...@gmail.com>
> > wrote:
> >
> >> sounds like a bad limit Ilya, i'll keep an eye out.
> >>
> >> On Wed, Nov 18, 2015 at 10:10 PM, ilya <ilya.mailing.li...@gmail.com>
> >> wrote:
> >>
> >>> I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
> >>>
> >>> We've been noticing weird issues with KVM where occasionally lots of
> >>> KVM agents get Nio connection closed issue followed by barrage of
> alerts.
> >>>
> >>> In some instances the agent reconnects right away and in other - it
> >>> attempts to reconnect but never receives an ACK from MS.
> >>>
> >>> Please let me know if you notice anything like it and if you found a
> >>> solution.
> >>>
> >>> Also, it would help to know what global settings have been tuned to
> >>> make things work better (aside from direct.agent.*) and how MS are
> running.
> >>>
> >>> Thanks
> >>> ilya
> >>>
> >>
> >>
> >>
> >> --
> >> Daan
> >>
> >
> >
> >
> F

Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-18 Thread Rafael Weingärtner
When you say 250+, you mean 250+ host spread in lots of cluster, right?
If I am not mistaken, ACS limits the number of KVM hosts in a cluster,
something like 50? I do not remember now if that value can be configured,
may it can be.

I recall to have read something in a Red Hat doc about the KVM that it does
not have limit of hosts in a cluster. Actually, it does not seem to have
the figure of cluster at all. That is created solely in ACS, to facilitate
the management.

To debug the problem, I would start with the following questions:

Is every single cluster of your environment is presenting that problem?
What is the size of physical hosts that you have in your environment? Do
all of them have the same configuration?
Do you know the load (resource allocated and used) that is being imposed in
those hosts that had shown those problems?
What is your over commitment/provisioning factor that you are using?


On Wed, Nov 18, 2015 at 8:19 PM, Daan Hoogland 
wrote:

> sounds like a bad limit Ilya, i'll keep an eye out.
>
> On Wed, Nov 18, 2015 at 10:10 PM, ilya 
> wrote:
>
> > I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
> >
> > We've been noticing weird issues with KVM where occasionally lots of KVM
> > agents get Nio connection closed issue followed by barrage of alerts.
> >
> > In some instances the agent reconnects right away and in other - it
> > attempts to reconnect but never receives an ACK from MS.
> >
> > Please let me know if you notice anything like it and if you found a
> > solution.
> >
> > Also, it would help to know what global settings have been tuned to make
> > things work better (aside from direct.agent.*) and how MS are running.
> >
> > Thanks
> > ilya
> >
>
>
>
> --
> Daan
>



-- 
Rafael Weingärtner


Re: [STABILITY]} Large KVM Infrastructure with ACS

2015-11-18 Thread Daan Hoogland
sounds like a bad limit Ilya, i'll keep an eye out.

On Wed, Nov 18, 2015 at 10:10 PM, ilya  wrote:

> I'm curious if anyone runs ACS with atleast 250+ KVM hosts.
>
> We've been noticing weird issues with KVM where occasionally lots of KVM
> agents get Nio connection closed issue followed by barrage of alerts.
>
> In some instances the agent reconnects right away and in other - it
> attempts to reconnect but never receives an ACK from MS.
>
> Please let me know if you notice anything like it and if you found a
> solution.
>
> Also, it would help to know what global settings have been tuned to make
> things work better (aside from direct.agent.*) and how MS are running.
>
> Thanks
> ilya
>



-- 
Daan