I think having a test that would give us a baseline of what is expected is 
great. We can use the exact same test to validate whatever proxy we want to put 
in the middle.
Personally I wouldn’t ditch nginx just yet, it’s being used with much bigger 
loads than our 4Mil requests a day and if our configurations are to blame, we 
should try to fix them. I could take on the task of testing it, by deploying a 
vanilla version and slowly adding all other components and see where it breaks.

Thx,
Cosmin

From: Tyson Norris <tnor...@adobe.com>
Date: Thursday, November 7, 2019 at 5:48 PM
To: Grp-bladerunner-dev <grp-bladerunner-...@adobe.com>
Subject: Re: Load testing ethos k8s

Hi –
I’ve had some good success with load testing k8s today without nginx. (not 
testing capacity issues, just lots of users, lots of activations).
I’ve wondering about brainstorming about how to deal with nginx – so here is 
some brainstorming.

  *   Is fixing nginx an option? I’m not sure how to comment, so I will ask if 
anyone thinks this is worth trying? (beyond all the things we have already 
tried)
  *   If not, what does caching and /apis look like?
     *   /apis I think can potentially look like one of these (non-exhaustive 
list):
        *   Contour IngressRoutes + a service for CRUD operations (sync?) – 
that is, use the ethos contour ingress directly for routing /apis, by creating 
an API that does the CRUD, which is called from apigmnt actions
        *   Custom envoy instance for /apis, where we stuff the users api data 
to db, and expose the db to envoy RDS, and the route gets configured to send to 
controller web action endpoint
        *   Other?
     *   Caching
        *   Since currently caching is enabled inside actions per-response 
currently, it is not easy to do anything except route everyting (all 
/api/v1/web and /apis) to caching service – one question is whether it would be 
better to be explicit about this at the action config level instead of forcing 
devs to code this into their action? E.g. wsk action create --annotation cache 
30s
        *   I googled for and didn’t find a caching filter for envoy – what 
about using a “standalone” cache like varnish? E.g. /api/v1/web -> varnish -> 
controller. There is some downside here where all requests route through extra 
hops, even if only a small portion use caching – but this is similar to what 
happens today (all web requests will hit redis to check cache), afaik (correct 
me?)
        *   There is a http cache filter “in progress” for envoy here 
https://github.com/envoyproxy/envoy/pull/7198
        *   Would be interesting to know what the apigateway team is doing 
here, if anything?



Thanks

Tyson



From: Tyson Norris <tnor...@adobe.com>
Date: Wednesday, November 6, 2019 at 2:52 PM
To: Grp-bladerunner-dev <grp-bladerunner-...@adobe.com>
Subject: Re: Load testing ethos k8s

I update the EON issue with as simplified details as possible to communicate 
the requirements and questions. Let me know if there are questions from runtime 
side (or comment in the issue)

Thanks
Tyson

From: Tyson Norris <tnor...@adobe.com>
Date: Wednesday, November 6, 2019 at 9:44 AM
To: Grp-bladerunner-dev <grp-bladerunner-...@adobe.com>
Subject: Load testing ethos k8s

Hi –
I’ve been able to reasonably get some load tests going against ethos k8s (with 
nginx removed, so no /apis, and no caching support), where we end up with 
resourcequota exceeded errors.
There are still some issues to fix on our side (e.g. to make sure failed 
activations are always properly retried), but we can effectively push on the 
cluster node scaling issue now.

I asked about resourcequota updates in this issue 
https://git.corp.adobe.com/adobe-platform/k8s-infrastructure/issues/1859#issuecomment-2048596
And Dharma mentioned inviting someone (thanks Misha!) to the azure capacity 
planning meeting, but in the meantime while resourcequota is fixed, I think we 
will only be able to test load that reaches scaling on a dedicated cluster.

I think we should update https://jira.corp.adobe.com/browse/EON-5854 to 
indicate the need for overprovisioning+scaling type of setup, as opposed to a 
fixed “20 nodes” size, so that we can begin to exercise cluster scaling issues.

In the meantime, we can continue to work on invoker issues related to these 
cases.
Thanks
Tyson

Reply via email to