Re: [DISCUSS] CloudStack graceful shutdown

2018-04-20 Thread Rafael Weingärtner
Is that management server load balancing feature using static
configurations? I heard about it on the mailing list, but I did not follow
the implementation.

I do not see many problems with agents reconnecting. We can implement in
agents (not just KVM, but also system VMs) a logic that instead of using a
static pool of management servers configured in a properties file, they
dynamically request a list of available management servers via that list
management servers API method. This would require us to configure agents
with a load balancer URL that executes the balancing between multiple
management servers.

I am +1 to remove the need for that VIP, which executes the load balance
for connecting agents to management servers.

On Fri, Apr 20, 2018 at 4:41 PM, ilya musayev 
wrote:

> Rafael and Community
>
> All is well and good and i think we are thinking along the similar lines -
> the only issue that i see right now with any approach is KVM Agents (or
> direct agents) and using LoadBalancer on 8250.
>
> Here is a scenario:
>
> You have 2 Management Server setup fronted with a VIP on 8250.
> The LB Algorithm is either Round Robin or Least Connections used.
> You initiate a maintenance mode operation on one of the MS servers (call it
> MS1) - assume you have a long running migration job that needs 60 minutes
> to complete.
> We attempt to evacuate the agents by telling them to disconnect and
> reconnect again
> If we are using LB on 8250 with
> 1) Least Connection used - then all agents will continuously try to connect
> to a MS1 node that is attempting to go down for maintenance. Essentially
> with this  LB configuration this operation will never
> 2) Round Robin - this will take a while - but eventually - you will get all
> nodes connected to MS2
>
> The current limitation is usage of external LB on 8250. For this operation
> to work without issue - would mean agents must connect to MS server without
> an LB. This is a recent feature we've developed with ShapeBlue - where we
> maintain the list of CloudStack Management Servers in the agent.properties
> file.
>
> Unless you can think of other solution - it appears we may have to forced
> to bypass the 8250 VIP LB and use the new feature to maintain the list of
> management servers within agent.properties.
>
>
> I need to run now, let me know what your thoughts are.
>
> Regards
> ilya
>
>
>
> On Tue, Apr 17, 2018 at 8:27 AM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Ilya and others,
> >
> > We have been discussing this idea of graceful/nicely shutdown.  Our
> feeling
> > is that we (in CloudStack community) might have been trying to solve this
> > problem with too much scripting. What if we developed a more integrated
> > (native) solution?
> >
> > Let me explain our idea.
> >
> > ACS has a table called “mshost”, which is used to store management server
> > information. During balancing and when jobs are dispatched to other
> > management servers this table is consulted/queried.  Therefore, we have
> > been discussing the idea of creating a management API for management
> > servers.  We could have an API method that changes the state of
> management
> > servers to “prepare to maintenance” and then “maintenance” (as soon as
> all
> > of the task/jobs it is managing finish). The idea is that during
> > rebalancing we would remove the hosts of servers that are not in “Up”
> state
> > (of course we would also ignore hosts in the aforementioned state to
> > receive hosts to manage).  Moreover, when we send/dispatch jobs to other
> > management servers, we could ignore the ones that are not in “Up” state
> > (which is something already done).
> >
> > By doing this, the nicely shutdown could be executed in a few steps.
> >
> > 1 – issue the maintenance method for the management server you desire
> > 2 – wait until the MS goes into maintenance mode, while there are still
> > running jobs it (the management server) will be maintained in prepare for
> > maintenance
> > 3 – execute the Linux shutdown command
> >
> > We would need other APIs methods to manage MSs then. An (i) API method to
> > list MSs, and we could even create an (ii) API to remove old/de-activated
> > management servers, which we currently do not have (forcing users to
> apply
> > changed directly in the database).
> >
> > Moreover, in this model, we would not kill hanging jobs; we would wait
> > until they expire and ACS expunges them. Of course, it is possible to
> > develop a forceful maintenance method as well. Then, when the “prepare
> for
> > maintenance” takes longer than a parameter, we could kill hanging jobs.
> >
> > All of this would allow the MS to be kept up and receiving requests until
> > it can be safely shutdown. What do you guys about this approach?
> >
> > On Tue, Apr 10, 2018 at 6:52 PM, Yiping Zhang 
> wrote:
> >
> > > As a cloud admin, I would love to have this feature.
> > >
> > > It so happens that I just 

Re: [DISCUSS] CloudStack graceful shutdown

2018-04-20 Thread ilya musayev
Rafael and Community

All is well and good and i think we are thinking along the similar lines -
the only issue that i see right now with any approach is KVM Agents (or
direct agents) and using LoadBalancer on 8250.

Here is a scenario:

You have 2 Management Server setup fronted with a VIP on 8250.
The LB Algorithm is either Round Robin or Least Connections used.
You initiate a maintenance mode operation on one of the MS servers (call it
MS1) - assume you have a long running migration job that needs 60 minutes
to complete.
We attempt to evacuate the agents by telling them to disconnect and
reconnect again
If we are using LB on 8250 with
1) Least Connection used - then all agents will continuously try to connect
to a MS1 node that is attempting to go down for maintenance. Essentially
with this  LB configuration this operation will never
2) Round Robin - this will take a while - but eventually - you will get all
nodes connected to MS2

The current limitation is usage of external LB on 8250. For this operation
to work without issue - would mean agents must connect to MS server without
an LB. This is a recent feature we've developed with ShapeBlue - where we
maintain the list of CloudStack Management Servers in the agent.properties
file.

Unless you can think of other solution - it appears we may have to forced
to bypass the 8250 VIP LB and use the new feature to maintain the list of
management servers within agent.properties.


I need to run now, let me know what your thoughts are.

Regards
ilya



On Tue, Apr 17, 2018 at 8:27 AM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> Ilya and others,
>
> We have been discussing this idea of graceful/nicely shutdown.  Our feeling
> is that we (in CloudStack community) might have been trying to solve this
> problem with too much scripting. What if we developed a more integrated
> (native) solution?
>
> Let me explain our idea.
>
> ACS has a table called “mshost”, which is used to store management server
> information. During balancing and when jobs are dispatched to other
> management servers this table is consulted/queried.  Therefore, we have
> been discussing the idea of creating a management API for management
> servers.  We could have an API method that changes the state of management
> servers to “prepare to maintenance” and then “maintenance” (as soon as all
> of the task/jobs it is managing finish). The idea is that during
> rebalancing we would remove the hosts of servers that are not in “Up” state
> (of course we would also ignore hosts in the aforementioned state to
> receive hosts to manage).  Moreover, when we send/dispatch jobs to other
> management servers, we could ignore the ones that are not in “Up” state
> (which is something already done).
>
> By doing this, the nicely shutdown could be executed in a few steps.
>
> 1 – issue the maintenance method for the management server you desire
> 2 – wait until the MS goes into maintenance mode, while there are still
> running jobs it (the management server) will be maintained in prepare for
> maintenance
> 3 – execute the Linux shutdown command
>
> We would need other APIs methods to manage MSs then. An (i) API method to
> list MSs, and we could even create an (ii) API to remove old/de-activated
> management servers, which we currently do not have (forcing users to apply
> changed directly in the database).
>
> Moreover, in this model, we would not kill hanging jobs; we would wait
> until they expire and ACS expunges them. Of course, it is possible to
> develop a forceful maintenance method as well. Then, when the “prepare for
> maintenance” takes longer than a parameter, we could kill hanging jobs.
>
> All of this would allow the MS to be kept up and receiving requests until
> it can be safely shutdown. What do you guys about this approach?
>
> On Tue, Apr 10, 2018 at 6:52 PM, Yiping Zhang  wrote:
>
> > As a cloud admin, I would love to have this feature.
> >
> > It so happens that I just accidentally restarted my ACS management server
> > while two instances are migrating to another Xen cluster (via storage
> > migration, not live migration).  As results, both instances
> > ends up with corrupted data disk which can't be reattached or migrated.
> >
> > Any feature which prevents this from happening would be great.  A low
> > hanging fruit is simply checking for
> > if there are any async jobs running, especially any kind of migration
> jobs
> > or other known long running type of
> > jobs and warn the operator  so that he has a chance to abort server
> > shutdowns.
> >
> > Yiping
> >
> > On 4/5/18, 3:13 PM, "ilya musayev" 
> wrote:
> >
> > Andrija
> >
> > This is a tough scenario.
> >
> > As an admin, they way i would have handled this situation, is to
> > advertise
> > the upcoming outage and then take away specific API commands from a
> > user a
> > day before - so he does not cause any long running async jobs. Once
> > 

Re: [DISCUSS] Why we MARK packets?

2018-04-20 Thread Ron Wheeler

https://markandruth.co.uk/2016/08/08/testing-the-performance-of-the-linux-firewall
Does not directly address marking but does benchmark a number of 
iptables filtering tasks which may give some insight into the 
performance implications of using iptables for routing and filtering.


https://www.digitalocean.com/community/tutorials/a-deep-dive-into-iptables-and-netfilter-architecture
A nice article on iptables. No performance info but a good read for 
anyone who needs to tackle routing and marking.


 https://www.markusschade.com/papers/firewall.pdf
Has some performance info and a discussion about the host limit with MARK.

I am a bit skeptical that MARK adds a lot of overhead but this is only 
based on the belief that CPUs are orders of magnitude faster than networks.
The decision process on entry and exit are both pretty simple and 
depending on the implementation of the internal table should not take a 
lot of memory of CPU to do the table lookups


I did not read all 60,400 links returned by Google but in the ones that 
I did read did not warn of any big problem with performance using MARK.


I use a simple static MARK configuration in my firewall to solve a 
routing problem but the performance requirements are not relevant to an 
datacentre with hundreds of CPUs and hundreds of entries in the MARK 
routing map.


I hope that this helps.

Ron


On 20/04/2018 9:04 AM, Rohit Yadav wrote:

Thanks Jayapal. I don't have any comparative study yet, but I'll explore this 
in future if we can get away without marking (mangling) packets which is 
generally an expensive task.


- Rohit






From: Jayapal Uradi 
Sent: Thursday, April 19, 2018 10:33:25 AM
To: dev@cloudstack.apache.org
Cc: us...@cloudstack.apache.org
Subject: Re: [DISCUSS] Why we MARK packets?

Rohit,

My comments inline.


rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
   
  


On Apr 19, 2018, at 1:52 AM, Rohit Yadav 
> wrote:

Nevermind, found the use of custom routing tables. In case someone want to 
refer, hints are here:

https://github.com/apache/cloudstack/pull/2514#issuecomment-382510915


Jayapal and others - I've another one, is there a way to do routing without 
marking packets at all, even in case of VRs with additional public interfaces?

AFAIK marking is the only way to do it.
Do you have any performance numbers with and without mark rules.

- Rohit

>




From: Rohit Yadav >
Sent: Wednesday, April 18, 2018 10:39:02 PM
To: dev@cloudstack.apache.org; 
us...@cloudstack.apache.org
Subject: [DISCUSS] Why we MARK packets?

All,


I could not find any history around 'why' we MARK or CONNMARK packets in mangle 
table in VRs? I found an issue in case of VPCs where `MARK` iptable rules 
failed hair-pin nat (as described in this PR: 
https://github.com/apache/cloudstack/pull/2514)


The valid usage I found was wrt VPN_STATS, however, the usage is not exported 
at all, it is commented:

https://github.com/apache/cloudstack/blob/master/systemvm/debian/opt/cloud/bin/vpc_netusage.sh#L141


Other than for debugging purposes in the VR, marking packets and connections I 
could not find any valid use. Please do share if you're using marked packets 
(such as VPN ones etc) outside of VR scope?


I propose we remove MARK on packets which is cpu intensive and slows the 
traffic (a bit), instead CONNMARK can still be used to mark connections and 
debug VRs without actually changing the packet marking permanently. Thoughts?


- Rohit





rohit.ya...@shapeblue.com
www.shapeblue.com>
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Accelerite, a Persistent Systems business. It is intended only for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient, you are not authorized to read, retain, copy, print, 
distribute or use this message. If you have received this communication in 
error, please notify the sender and delete all copies of this message. 
Accelerite, a Persistent Systems business does not accept any liability for 
virus infected mails.



--
Ron Wheeler
President
Artifact Software Inc
email: rwhee...@artifact-software.com
skype: 

[DISCUSS] 4.11.1.0 release

2018-04-20 Thread Rohit Yadav
All,


I would like to kick a discussion around the next 4.11 LTS minor/bugfix release 
- 4.11.1.0.


Many of us have tried to discuss, triage and fix issues that you've reported 
both on users and dev MLs. It is possible that some of those discussions have 
not met a conclusion/solution yet. Therefore, first check if your reported 
issue has been reported or fixed towards the 4.11.1.0 milestone from here:

https://github.com/apache/cloudstack/milestone/5


In case it is not reported/fixed, please report your issues with details here 
(perhaps giving link to an existing email thread):

https://github.com/apache/cloudstack/issues


When you do report an issue (or send a PR), especially upgrade or regression 
related, please do use the milestone 4.11.1.0 along with suitable labels which 
will make it easier to triage them.


I still don't have an exact timeline to share yet but I'll try to get that 
shared as soon as next week. Thanks.


- Rohit

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



[ASK][PARTICIPATE] Test new cloudmonkey v6.0.0-alpha1

2018-04-20 Thread Rohit Yadav
All,


The go-based port is functionally complete! I would like to invite interested 
users to help test it.


You can build yourself the binaries from here:

github.com/apache/cloudstack-cloudmonkey


Alternatively, you can test using pre-built binaries from here:

https://lab.yadav.cloud/testing/cmk/

Please report issues here:
https://github.com/apache/cloudstack-cloudmonkey/issues

You can use the tool in an environment will (python based) cloudmonkey 
installed. The new port will make directories and store config at ~/.cmk/. The 
tool is assumed to be drop-in replacement, most of the old doc is still 
applicable:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+cloudmonkey+CLI

Known issues and limitations:
- No in-built API cache, run sync on first run.
- Only json output supported now, without any filtering.
- Parameter completion with single tab may not show/display properly sometimes.
- API parameter completion causes a new network request every time, no caching 
is done.
- Windows port (exe) will fail unicode printing and parameter completion
- Command line flags are not implemented

Removed features:
- shell execution using shell keyword or !
- api keyword based raw API execution
- Several set options removed, defaults to be used

Things to be done on roadmap:
- Add support for csv, table, text and xml output, with filtering support.
- Add parameter completion caching
- Add history support and execution
- Add command line flags support
- Refactor and add tests, dockerfile
- Update README and docs


- Rohit





rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



Re: [DISCUSS] Why we MARK packets?

2018-04-20 Thread Rohit Yadav
Thanks Jayapal. I don't have any comparative study yet, but I'll explore this 
in future if we can get away without marking (mangling) packets which is 
generally an expensive task.


- Rohit






From: Jayapal Uradi 
Sent: Thursday, April 19, 2018 10:33:25 AM
To: dev@cloudstack.apache.org
Cc: us...@cloudstack.apache.org
Subject: Re: [DISCUSS] Why we MARK packets?

Rohit,

My comments inline.


rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

On Apr 19, 2018, at 1:52 AM, Rohit Yadav 
> wrote:

Nevermind, found the use of custom routing tables. In case someone want to 
refer, hints are here:

https://github.com/apache/cloudstack/pull/2514#issuecomment-382510915


Jayapal and others - I've another one, is there a way to do routing without 
marking packets at all, even in case of VRs with additional public interfaces?

AFAIK marking is the only way to do it.
Do you have any performance numbers with and without mark rules.

- Rohit

>




From: Rohit Yadav >
Sent: Wednesday, April 18, 2018 10:39:02 PM
To: dev@cloudstack.apache.org; 
us...@cloudstack.apache.org
Subject: [DISCUSS] Why we MARK packets?

All,


I could not find any history around 'why' we MARK or CONNMARK packets in mangle 
table in VRs? I found an issue in case of VPCs where `MARK` iptable rules 
failed hair-pin nat (as described in this PR: 
https://github.com/apache/cloudstack/pull/2514)


The valid usage I found was wrt VPN_STATS, however, the usage is not exported 
at all, it is commented:

https://github.com/apache/cloudstack/blob/master/systemvm/debian/opt/cloud/bin/vpc_netusage.sh#L141


Other than for debugging purposes in the VR, marking packets and connections I 
could not find any valid use. Please do share if you're using marked packets 
(such as VPN ones etc) outside of VR scope?


I propose we remove MARK on packets which is cpu intensive and slows the 
traffic (a bit), instead CONNMARK can still be used to mark connections and 
debug VRs without actually changing the packet marking permanently. Thoughts?


- Rohit





rohit.ya...@shapeblue.com
www.shapeblue.com>
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Accelerite, a Persistent Systems business. It is intended only for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient, you are not authorized to read, retain, copy, print, 
distribute or use this message. If you have received this communication in 
error, please notify the sender and delete all copies of this message. 
Accelerite, a Persistent Systems business does not accept any liability for 
virus infected mails.


Re: Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT

2018-04-20 Thread Andrei Mikhailovsky
Thanks



- Original Message -
> From: "Rohit Yadav" 
> To: "dev" , "dev" 
> Sent: Friday, 20 April, 2018 10:35:55
> Subject: Re: Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT

> Hi Andrei,
> 
> I've fixed this recently, please see
> https://github.com/apache/cloudstack/pull/2579
> 
> As a workaround you can add routing rules manually. On the PR, there is a link
> to a comment that explains the issue and suggests manual workaround. Let me
> know if that works for you.
> 
> Regards.
> 
> 
> From: Andrei Mikhailovsky
> Sent: Friday, 20 April, 2:21 PM
> Subject: Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT
> To: dev
> 
> 
> Hello, I have been posting to the users thread about this issue. here is a 
> quick
> summary in case if people contributing to the source nat code on the VPC side
> would like to fix this issue. Problem summary: no connectivity between virtual
> machines behind two Static NAT networks. Problem case: When one virtual 
> machine
> sends a packet to the external address of the another virtual machine that are
> handled by the same router and both are behind the Static NAT the traffic does
> not work. 10.1.10.100 10.1.10.1:eth2 eth3:10.1.20.1 10.1.20.100 virt1 router
> virt2 178.248.108.77:eth1:178.248.108.113 a single packet is send from virt1 
> to
> virt2. stage1: it arrives to the router on eth2 and enters "nat_PREROUTING"
> IN=eth2 OUT= SRC=10.1.10.100 DST=178.248.108.113) goes through the "10 1K DNAT
> all -- * * 0.0.0.0/0 178.248.108.113 to:10.1.20.100 " rule and has the DST
> DNATED to the internal IP of the virt2 stage2: Enters the FORWARDING chain and
> is being DROPPED by the default policy. DROPPED:IN=eth2 OUT=eth1
> SRC=10.1.10.100 DST=10.1.20.100 The reason being is that the OUT interface is
> not correctly changed from eth1 to eth3 during the nat_PREROUTING so the 
> packet
> is not intercepted by the FORWARD rule and thus not accepted. "24 14K
> ACL_INBOUND_eth3 all -- * eth3 0.0.0.0/0 10.1.20.0/24" stage3: manually
> inserted rule to accept this packet for FORWARDING. the packet enters the
> "nat_POSTROUTING" chain IN= OUT=eth1 SRC=10.1.10.100 DST=10.1.20.100 and has
> the SRC changed to the external IP 16 1320 SNAT all -- * eth1 10.1.10.100
> 0.0.0.0/0 to:178.248.108.77 and is sent to the external network on eth1.
> 13:37:44.834341 IP 178.248.108.77 > 10.1.20.100: ICMP echo request, id 2644,
> seq 2, length 64 For some reason, during the nat_PREROUTING stage the DST_IP 
> is
> changed, but the OUT interface still reflects the interface associated with 
> the
> old DST_IP. Here is the routing table # ip route list default via 
> 178.248.108.1
> dev eth1 10.1.10.0/24 dev eth2 proto kernel scope link src 10.1.10.1
> 10.1.20.0/24 dev eth3 proto kernel scope link src 10.1.20.1 169.254.0.0/16 dev
> eth0 proto kernel scope link src 169.254.0.5 178.248.108.0/25 dev eth1 proto
> kernel scope link src 178.248.108.101 # ip rule list 0: from all lookup local
> 32761: from all fwmark 0x3 lookup Table_eth3 32762: from all fwmark 0x2 lookup
> Table_eth2 32763: from all fwmark 0x1 lookup Table_eth1 32764: from 
> 10.1.0.0/16
> lookup static_route_back 32765: from 10.1.0.0/16 lookup static_route 32766:
> from all lookup main 32767: from all lookup default Further into the
> investigation, the problem was pinned down to those rules. All the traffic 
> from
> internal IP on the static NATed connection were forced to go to the outside
> interface (eth1), by setting the mark 0x1 and then using the matching # ip 
> rule
> to direct it. #iptables -t mangle -L PREROUTING -vn Chain PREROUTING (policy
> ACCEPT 97 packets, 11395 bytes) pkts bytes target prot opt in out source
> destination 49 3644 CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW
> CONNMARK save 37 2720 MARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW MARK set
> 0x1 37 2720 CONNMARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW CONNMARK save
> 114 8472 MARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW MARK set 0x1 114 8472
> CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW CONNMARK save # ip rule 0:
> from all lookup local 32761: from all fwmark 0x3 lookup Table_eth3 32762: from
> all fwmark 0x2 lookup Table_eth2 32763: from all fwmark 0x1 lookup Table_eth1
> 32764: from 10.1.0.0/16 lookup static_route_back 32765: from 10.1.0.0/16 
> lookup
> static_route 32766: from all lookup main 32767: from all lookup default The
> acceptable solution is to delete those rules all together.? The problem with
> such approach is that the inter VPC traffic will use the internal IP 
> addresses,
> so the packets going from 178.248.108.77 to 178.248.108.113 would be seen as
> communication between 10.1.10.100 and 10.1.20.100 thus we need to apply 
> further
> two rules # iptables -t nat -I POSTROUTING -o eth3 -s 10.1.10.0/24 -d
> 10.1.20.0/24 -j SNAT --to-source 178.248.108.77 # iptables -t nat -I
> POSTROUTING -o eth2 -s 10.1.20.0/24 -d 

Re: Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT

2018-04-20 Thread Rohit Yadav
Hi Andrei,

I've fixed this recently, please see
https://github.com/apache/cloudstack/pull/2579

As a workaround you can add routing rules manually. On the PR, there is a link 
to a comment that explains the issue and suggests manual workaround. Let me 
know if that works for you.

Regards.


From: Andrei Mikhailovsky
Sent: Friday, 20 April, 2:21 PM
Subject: Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT
To: dev


Hello, I have been posting to the users thread about this issue. here is a 
quick summary in case if people contributing to the source nat code on the VPC 
side would like to fix this issue. Problem summary: no connectivity between 
virtual machines behind two Static NAT networks. Problem case: When one virtual 
machine sends a packet to the external address of the another virtual machine 
that are handled by the same router and both are behind the Static NAT the 
traffic does not work. 10.1.10.100 10.1.10.1:eth2 eth3:10.1.20.1 10.1.20.100 
virt1 router virt2 178.248.108.77:eth1:178.248.108.113 a single packet is send 
from virt1 to virt2. stage1: it arrives to the router on eth2 and enters 
"nat_PREROUTING" IN=eth2 OUT= SRC=10.1.10.100 DST=178.248.108.113) goes through 
the "10 1K DNAT all -- * * 0.0.0.0/0 178.248.108.113 to:10.1.20.100 " rule and 
has the DST DNATED to the internal IP of the virt2 stage2: Enters the 
FORWARDING chain and is being DROPPED by the default policy. DROPPED:IN=eth2 
OUT=eth1 SRC=10.1.10.100 DST=10.1.20.100 The reason being is that the OUT 
interface is not correctly changed from eth1 to eth3 during the nat_PREROUTING 
so the packet is not intercepted by the FORWARD rule and thus not accepted. "24 
14K ACL_INBOUND_eth3 all -- * eth3 0.0.0.0/0 10.1.20.0/24" stage3: manually 
inserted rule to accept this packet for FORWARDING. the packet enters the 
"nat_POSTROUTING" chain IN= OUT=eth1 SRC=10.1.10.100 DST=10.1.20.100 and has 
the SRC changed to the external IP 16 1320 SNAT all -- * eth1 10.1.10.100 
0.0.0.0/0 to:178.248.108.77 and is sent to the external network on eth1. 
13:37:44.834341 IP 178.248.108.77 > 10.1.20.100: ICMP echo request, id 2644, 
seq 2, length 64 For some reason, during the nat_PREROUTING stage the DST_IP is 
changed, but the OUT interface still reflects the interface associated with the 
old DST_IP. Here is the routing table # ip route list default via 178.248.108.1 
dev eth1 10.1.10.0/24 dev eth2 proto kernel scope link src 10.1.10.1 
10.1.20.0/24 dev eth3 proto kernel scope link src 10.1.20.1 169.254.0.0/16 dev 
eth0 proto kernel scope link src 169.254.0.5 178.248.108.0/25 dev eth1 proto 
kernel scope link src 178.248.108.101 # ip rule list 0: from all lookup local 
32761: from all fwmark 0x3 lookup Table_eth3 32762: from all fwmark 0x2 lookup 
Table_eth2 32763: from all fwmark 0x1 lookup Table_eth1 32764: from 10.1.0.0/16 
lookup static_route_back 32765: from 10.1.0.0/16 lookup static_route 32766: 
from all lookup main 32767: from all lookup default Further into the 
investigation, the problem was pinned down to those rules. All the traffic from 
internal IP on the static NATed connection were forced to go to the outside 
interface (eth1), by setting the mark 0x1 and then using the matching # ip rule 
to direct it. #iptables -t mangle -L PREROUTING -vn Chain PREROUTING (policy 
ACCEPT 97 packets, 11395 bytes) pkts bytes target prot opt in out source 
destination 49 3644 CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW 
CONNMARK save 37 2720 MARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW MARK set 
0x1 37 2720 CONNMARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW CONNMARK save 
114 8472 MARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW MARK set 0x1 114 8472 
CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW CONNMARK save # ip rule 0: 
from all lookup local 32761: from all fwmark 0x3 lookup Table_eth3 32762: from 
all fwmark 0x2 lookup Table_eth2 32763: from all fwmark 0x1 lookup Table_eth1 
32764: from 10.1.0.0/16 lookup static_route_back 32765: from 10.1.0.0/16 lookup 
static_route 32766: from all lookup main 32767: from all lookup default The 
acceptable solution is to delete those rules all together.? The problem with 
such approach is that the inter VPC traffic will use the internal IP addresses, 
so the packets going from 178.248.108.77 to 178.248.108.113 would be seen as 
communication between 10.1.10.100 and 10.1.20.100 thus we need to apply further 
two rules # iptables -t nat -I POSTROUTING -o eth3 -s 10.1.10.0/24 -d 
10.1.20.0/24 -j SNAT --to-source 178.248.108.77 # iptables -t nat -I 
POSTROUTING -o eth2 -s 10.1.20.0/24 -d 10.1.10.0/24 -j SNAT --to-source 
178.248.108.113 in order to make sure that the packets leaving the router would 
have correct source IP. This way it is possible to have static NAT on all of 
the IPS within the VPC and ensure a successful communication between them. So, 
for a quick and dirty fix, we ran this command on the VR: for i in iptables -t 
mangle -L PREROUTING -vn | awk '/0x1/ && !/eth1/ {print 

Upgrade from ACS 4.9.X to 4.11.0 broke VPC source NAT

2018-04-20 Thread Andrei Mikhailovsky
Hello, 

I have been posting to the users thread about this issue. here is a quick 
summary in case if people contributing to the source nat code on the VPC side 
would like to fix this issue. 


Problem summary: no connectivity between virtual machines behind two Static NAT 
networks. 

Problem case: When one virtual machine sends a packet to the external address 
of the another virtual machine that are handled by the same router and both are 
behind the Static NAT the traffic does not work. 



10.1.10.100 10.1.10.1:eth2 eth3:10.1.20.1 10.1.20.100 
virt1 <---> router <---> virt2 
178.248.108.77:eth1:178.248.108.113 


a single packet is send from virt1 to virt2. 


stage1: it arrives to the router on eth2 and enters "nat_PREROUTING" 
IN=eth2 OUT= SRC=10.1.10.100 DST=178.248.108.113) 

goes through the "10 1K DNAT all -- * * 0.0.0.0/0 178.248.108.113 
to:10.1.20.100 
" rule and has the DST DNATED to the internal IP of the virt2 


stage2: Enters the FORWARDING chain and is being DROPPED by the default policy. 
DROPPED:IN=eth2 OUT=eth1 SRC=10.1.10.100 DST=10.1.20.100 

The reason being is that the OUT interface is not correctly changed from eth1 
to eth3 during the nat_PREROUTING 
so the packet is not intercepted by the FORWARD rule and thus not accepted. 
"24 14K ACL_INBOUND_eth3 all -- * eth3 0.0.0.0/0 10.1.20.0/24" 


stage3: manually inserted rule to accept this packet for FORWARDING. 
the packet enters the "nat_POSTROUTING" chain 
IN= OUT=eth1 SRC=10.1.10.100 DST=10.1.20.100 

and has the SRC changed to the external IP 
16 1320 SNAT all -- * eth1 10.1.10.100 0.0.0.0/0 to:178.248.108.77 

and is sent to the external network on eth1. 
13:37:44.834341 IP 178.248.108.77 > 10.1.20.100: ICMP echo request, id 2644, 
seq 2, length 64 


For some reason, during the nat_PREROUTING stage the DST_IP is changed, but the 
OUT interface still reflects the interface associated with the old DST_IP. 

Here is the routing table 
# ip route list 
default via 178.248.108.1 dev eth1 
10.1.10.0/24 dev eth2 proto kernel scope link src 10.1.10.1 
10.1.20.0/24 dev eth3 proto kernel scope link src 10.1.20.1 
169.254.0.0/16 dev eth0 proto kernel scope link src 169.254.0.5 
178.248.108.0/25 dev eth1 proto kernel scope link src 178.248.108.101 

# ip rule list 
0: from all lookup local 
32761: from all fwmark 0x3 lookup Table_eth3 
32762: from all fwmark 0x2 lookup Table_eth2 
32763: from all fwmark 0x1 lookup Table_eth1 
32764: from 10.1.0.0/16 lookup static_route_back 
32765: from 10.1.0.0/16 lookup static_route 
32766: from all lookup main 
32767: from all lookup default 


Further into the investigation, the problem was pinned down to those rules. 
All the traffic from internal IP on the static NATed connection were forced to 
go to the outside interface (eth1), by setting the mark 0x1 and then using the 
matching # ip rule to direct it. 

#iptables -t mangle -L PREROUTING -vn 
Chain PREROUTING (policy ACCEPT 97 packets, 11395 bytes) 
pkts bytes target prot opt in out source destination 
49 3644 CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW CONNMARK save 
37 2720 MARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW MARK set 0x1 
37 2720 CONNMARK all -- * * 10.1.20.100 0.0.0.0/0 state NEW CONNMARK save 
114 8472 MARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW MARK set 0x1 
114 8472 CONNMARK all -- * * 10.1.10.100 0.0.0.0/0 state NEW CONNMARK save 


# ip rule 
0: from all lookup local 
32761: from all fwmark 0x3 lookup Table_eth3 
32762: from all fwmark 0x2 lookup Table_eth2 
32763: from all fwmark 0x1 lookup Table_eth1 
32764: from 10.1.0.0/16 lookup static_route_back 
32765: from 10.1.0.0/16 lookup static_route 
32766: from all lookup main 
32767: from all lookup default 


The acceptable solution is to delete those rules all together.? 

The problem with such approach is that the inter VPC traffic will use the 
internal IP addresses, 
so the packets going from 178.248.108.77 to 178.248.108.113 
would be seen as communication between 10.1.10.100 and 10.1.20.100 

thus we need to apply further two rules 
# iptables -t nat -I POSTROUTING -o eth3 -s 10.1.10.0/24 -d 10.1.20.0/24 -j 
SNAT --to-source 178.248.108.77 
# iptables -t nat -I POSTROUTING -o eth2 -s 10.1.20.0/24 -d 10.1.10.0/24 -j 
SNAT --to-source 178.248.108.113 

in order to make sure that the packets leaving the router would have correct 
source IP. 

This way it is possible to have static NAT on all of the IPS within the VPC and 
ensure a successful communication between them. 


So, for a quick and dirty fix, we ran this command on the VR: 

for i in iptables -t mangle -L PREROUTING -vn | awk '/0x1/ && !/eth1/ {print 
$8}'; do iptables -t mangle -D PREROUTING -s $i -m state —state NEW -j MARK 
—set-mark "0x1" ; done 



The issue has been introduced around early 4.9.x releases I believe. 


Thanks 

Andrei 





- Original Message - 
> From: "Andrei Mikhailovsky"  
> To: "users"  
> Sent: Monday, 16