If we are going to have a dynamic signing key, how is that shared between all of the services?
Whatever mechanism that is must also be a secured API which introduces another point of attack for a potential hacker. I have a very strong preference for something like TLS client auth over hardcoding usernames/passwords into config files. —Eric On May 11, 2017, at 12:23 PM, Chris Lemmons <[email protected]<mailto:[email protected]>> wrote: invalidate ALL tokens by changing the token signing key Interesting idea. That does mean that the signing key has to be retrieved every time from the authentication authority, or it'd be subject to the exact same set of attacks. But a nearly-constant rarely changing key could be communicated very efficiently, I suspect. And if the authentication system is a web API, it can even use Modified-Since to 304 99% of the time for maximum efficiency. It does have the downside that key-invalidation events are fairly significant. You'd need to invalidate the keys whenever someone's access was reduced or removed. As the number of accounts in the system increases, that might not wind up being as infrequent as one might hope. It's easy to implement, though. On Thu, May 11, 2017 at 10:12 AM Jeremy Mitchell <[email protected]<mailto:[email protected]>> wrote: Regarding the TTL on the JWT token. a 5 minute TTL seems silly. What's the point? Unless we get into refresh tokens but that sounds like oauth...blah. What about this and maybe i'm oversimplifying. the TTL on the jwt token is 24 hours. If we become aware that a token has been compromised, invalidate ALL tokens by changing the token signing key. maybe this is a good idea or maybe this is a terrible idea. I have no idea. just a thought.. jeremy On Wed, May 10, 2017 at 12:23 PM, Chris Lemmons <[email protected]<mailto:[email protected]>> wrote: Responding to a few people: Often times every auth action must be accompanied by DB writes for audit logs or callback functions. True. But a) if logging is too expensive it should probably be made cheaper and b) the answer to "audits are too expensive" probably isn't "lets just do less authentication". If the audit log is genuinely the bottle-neck, it would still be better to re-auth without the audit log. The API gateway can poll for the latest list of tokens at a regular interval Yeah, datastore replication for local performance is great. Though if you can reasonably query for a list of all valid tokens every second, it's probably cheaper to to just query for the token you need every time you need it. If there are massive batches of queries that are coming through, it's probably not unreasonable to choose not to re-validate a token that's been validated in the last second. Regarding maliciously delayed message or such - I don't fully understand the point; if an attacker has such capabilities she can simply prevent/delay devop users from updating the auth database itself thus enabling the attack. In a typical attack, an attacker might gain control of a box on the local network, but not necessarily the Gateway, Traffic Ops, or Auth Server. Those are probably better hardened. But lots of networks have a squishy test box that everyone forgot was there or something. The bad guy wants to use the CDN to DOS someone, or redirect traffic to somewhere malicious, or just cause mayhem. The longer he can keep control, the better for him. So this attacker uses the local box to sniff the token off the network. If the communication with the Gateway is encrypted, he might have to do some ARP poisoning or something else to trick a host into talking to the local box instead. (Properly implemented TLS also migates this angle.) He knows that as soon as he starts his nefarious deed, alarms are going to go off, so he also uses this local box to DOS the Auth Server. It's a lot easier to take a box down from the outside than to actually gain control. If the Gateway "fails open" when it can't contact the Auth server, the attacker remains in control. If it "fails closed", the attacker has to actually compromise the auth server (which is harder) to remain in control. Do we block all API calls if the auth service is temporarily down (being upgraded, container restarting, etc…)? Yes, I think we have to. Authentication is integral to reliable operation. We've been talking in some fairly wild hypotheticals, though. Is there a specific auth service you're envisioning? On Wed, May 10, 2017 at 12:50 AM Shmulik Asafi <[email protected]<mailto:[email protected]>> wrote: Regarding the communication issue Chris raised - there is more than one possible pattern to this, e.g.: - Blacklisted tokens can be communicated via a pub-sub mechanism - The API gateway can poll for the latest list of tokens at a regular interval (which can be very short ~1sec, much shorter than the time it takes devops to detect and react to malign tokens) Regarding hitting the blacklist datastore - this only sounds similar to hitting to auth database; but the simplicity of a blacklist function allows you to employ more efficient datastores, e.g. Redis or just a hashmap in the API gateway process memory. Regarding maliciously delayed message or such - I don't fully understand the point; if an attacker has such capabilities she can simply prevent/delay devop users from updating the auth database itself thus enabling the attack. On Wed, May 10, 2017 at 4:25 AM, Eric Friedrich (efriedri) < [email protected]<mailto:[email protected]>> wrote: Our current management wrapper around Traffic Control (called OMD Director, demo’d at last TC summit) uses a very similar approach to authentication. We have an auth service that issues a JWT. The JWT is then provided along with all API calls. A few comments on our practical experience: - I am a supported of validating tokens both in the API gateway and in the service. We have several examples of services- Grafana for example, that require external authentication. Similarly, we have other services that need finer grained authentication than API Gateway policy can handle. Specifically, a given user may have permissions to view/modify some delivery services but not others. The API gateway presumably would not understand the semantics of payload so this decision would need to be made by auth within the service. - As brought up earlier, auth in the gateway is both a strength and a risk. Additional layer of security is also positive, but for my case of Grafana above it can present an opportunity to bypass authentication. This is a risk, but it can be mitigated by adding auth to the service where needed. - Verifying tokens on every access may potentially be more a little expensive than discussed. Often times every auth action must be accompanied by DB writes for audit logs or callback functions. Not the straw to break the camel’s back, but something to keep in mind. - There is also the problem of what to do if the underlying auth service is temporarily unavailable. Do we block all API calls if the auth service is temporarily down (being upgraded, container restarting, etc…)? - I’d like to see what we can do to use a pre-existing package as an API Gateway. As we decompose TO into microservices, something like nginx can provide additional benefits like TLS termination and load balancing between service endpoints. I’d hate to see us have to reimplement these functions later. - I’d also like to see us give some consideration to how an API gateway is deployed. We raised the bar for new users by unbundling Traffic Ops from the database and it could further complicate the installation if we don’t provide enough guidance on how to deploy the API gateway in a lab trial, if not best practices for production deployment. Should we recommend to deploy as an new RPM/systemd service, an immutable container, or as part of the existing TO RPM? —Eric On May 9, 2017, at 5:05 PM, Chris Lemmons <[email protected]<mailto:[email protected]>> wrote: Blacklisting requires proactive communication between the authentication system and the gateway. Furthermore, the client can't be sure that something hasn't been blacklisted recently (and the message lost or perhaps maliciously delayed) unless it checks whatever system it is that does the blacklisting. And if you're checking a datastore of some sort for the validity of the token every time, you might as well just check each time and skip the blacklisting step. On Tue, May 9, 2017 at 1:27 PM Shmulik Asafi <[email protected]<mailto:[email protected]>> wrote: Hi, Maybe a missing link here is another component in a jwt stateless architecture which is *blacklisting* malign tokens when necessary. This is obviously a sort of state which needs to be handled in a datastore; but it's quite different and easy to scale and has less performance impact (I guess especially under DDOS) than doing full auth queries. I believe this should be the approach on the API Gateway roadmap Thanks On 9 May 2017 21:14, "Chris Lemmons" <[email protected]<mailto:[email protected]>> wrote: I'll second the principle behind "start with security, optimize when there's a problem". It seems to me that in order to maintain security, basically everyone would need to dial the revalidate time so close to zero that it does very little good as a cache on the credentials. Otherwise, as Rob as pointed out, the TTL on your credential cache is effectively "how long am I ok with hackers in control after I find them". Practically, it also means that much lag on adding or removing permissions. That effectively means a database hit for every query, or near enough to every query as not to matter. That said, you can get the best of multiple worlds, I think. The only DB query that really has to be done is "give me the last update time for this user". Compare that to the generation time in the token and 99% of the time, it's the only query you need. With that check, you can even use fairly long-lived tokens. If anything about the user has changed, reject the token, generate a new one, send that to the user and use it. The regenerate step is somewhat expensive, but still well inside reasonable, I think. On Tue, May 9, 2017 at 11:31 AM Robert Butts < [email protected]<mailto:[email protected]> wrote: The TO service (and any other service that requires auth) MUST hit the database (or the auth service, which itself hits the database) to verify valid tokens' users still have the permissions they did when the token was created. Otherwise, it's impossible to revoke tokens, e.g. if an employee quits, or an attacker gains a token, or a user changes their password. I'm elaborating on this, and moving a discussion from a PR review here. From the code submissions to the repo, it appears the current plan is for the API Gateway to create a JWT, and then for that JWT to be accepted by all Traffic Ops microservices, with no database authentication. It's a common misconception that JWT allows you authenticate without hitting the database. This is an exceedingly dangerous misconception. If you don't check the database when every authenticated route is requested, it's impossible to revoke access. In practice, this means the JWT TTL becomes the length of time _after you discover an attacker is manipulating your production system_, before it's _possible_ to evict them. How long do you feel is acceptable to have a hacker in and manipulating your system, after you discover them? A day? An hour? Five minutes? Whatever your TTL, that's the length of time you're willing to allow a hacker to steal and destroy you and your customers' data. Worse, because this is a CDN, it's the length of time you're willing to allow your CDN to be used to DDOS a target. Are you going to explain in court that the DDOS your system executed lasted 24 hours, or 1 hour, or 10 minutes after you discovered it, because that's the TTL you hard-coded? Are you going to explain to a judge and prosecuting attorney exactly which sensitive data was stolen in the ten minutes after you discovered the attacker in your system, before their JWT expired? If you're willing to accept the legal consequences, that's your business. Apache Traffic Control should not require users to accept those consequences, and ideally shouldn't make it possible, as many users won't understand the security risks. The argument has been made "authorization does not check the database to avoid congestion" -- Has anyone tested this in practice? The database query itself is 50ms. Assuming your database and service are 2500km apart, that's another 50ms network latency. Traffic Ops has endpoints that take 10s to generate. Worst-case scenario, this will double the time of tiny endpoints to 200ms, and increase large endpoints inconsequentially. It's highly unlikely performance is an issue in practice. As Jan said, we can still have the services check the auth as well after the proxy auth. Moreover, the services don't even have to know about the auth service, they can hit a mapped route on the API Gateway, which gives us better modularisation and separation of concerns. It's not difficult, it can be a trivial endpoint on the auth service, remapped in the API Gateway, which takes the JWT token and returns true if it's still authorized in the database. To be clear, this is not a problem today. Traffic Ops still uses the Mojolicious cookie today, so this would only need done if and when we remove that, or if we move authorized endpoints out of Traffic Ops into their own microservices. Considering the significant security and legal risks, we should always hit the database to validate requests of authorized endpoints, and reconsider if and when someone observes performance issues in practice. On Tue, May 9, 2017 at 6:56 AM, Dewayne Richardson < [email protected]<mailto:[email protected]> wrote: If only the API GW authenticates/authorizes we also have a single point of entry to test for security instead of having it sprinkled across services in different ways. It also simplifies the code on the service side and makes them easier to test with automation. -Dew On Mon, May 8, 2017 at 8:42 AM, Robert Butts < [email protected]<mailto:[email protected]> wrote: couldn't make nginx or http do what we need. I was suggesting a different architecture. Not making the proxy do auth, only standard proxying. We can still have the services check the auth as well after the proxy auth +1 On Mon, May 8, 2017 at 3:36 AM, Amir Yeshurun < [email protected]<mailto:[email protected]>> wrote: Hi, Let me elaborate some more on the purpose of the API GW. I will put up a wiki page following our discussions here. Main purpose is to allow innovation by creating new services that handle TO functionality, not as a part of the monolithic Mojo app. The long term vision is to de-compose TO into multiple microservices, allowing new functionality easily added. Indeed, the goal it to eventually deprecate the current AAA model, and replace it with the new AAA model currently under work (user-roles, role-capabilities) I think that handling authorization in the API layer is a valid approach. Security wise, I don't see much difference between that, and having each module access the auth service, as long as the auth service is deployed in the backend. Having another proxy (nginx?) fronting the world and forwarding all requests to the backend GW mitigates the risk for compromising the authorization service. However, as mentioned above, we can still have the services check the auth as well after the proxy auth. It is a standalone process, completely optional at this point. One can choose to deploy it in order to allow integration with additional services. Deployment and management are still T.B.D, and feedback on this is most welcome. Regarding token validation and revocation: Tokens have expiration time. Expired tokens do not pass token validation. In production, expiration should be set to relatively short time, say 5 minute. This way revocation is automatic. Re-authentication is handled via refresh tokens (not implemented yet). Hitting the DB upon every API call cause congestion on users DB. To avoid that, we chose to have all user information self-contained inside the JWT. Thanks /amiry On Mon, May 8, 2017 at 5:42 AM Jan van Doorn < [email protected]<mailto:[email protected]>> wrote: It's the reverse proxy we've discussed for the "micro services" version for a while now (as in https://cwiki.apache.org/confluence/display/TC/Design+ Overview+v3.0 ). On Sun, May 7, 2017 at 7:22 PM Eric Friedrich (efriedri) < [email protected]<mailto:[email protected]>> wrote: From a higher level- what is purpose of the API Gateway? It seems like there may have been some previous discussions about API Gateway. Are there any notes or description that I can catch up on? How will it be deployed? (Is it a standalone service or something that runs inside the experimental Traffic Ops)? Is this new component required or optional? —Eric On May 7, 2017, at 8:28 PM, Jan van Doorn < [email protected]<mailto:[email protected]> wrote: I looked into this a year or so ago, and I couldn't make nginx or http do what we need. We can still have the services check the auth as well after the proxy auth, and make things better than today, where we have the same problem that if the TO mojo app is compromised, everything is compromised. If we always route to TO, we don't untangle the mess of being dependent on the monolithic TO for everything. Many services today, and more in the future really just need a check to see if the user is authorized, and nothing more. On Sun, May 7, 2017 at 11:55 AM Robert Butts < [email protected] wrote: What are the advantages of these config files, over an existing reverse proxy, like Nginx or httpd? It's just as much work as configuring and deploying an existing product, but more code we have to write and maintain. I'm having trouble seeing the advantage. -1 on auth rules as a part of the proxy. Making a proxy care about auth violates the Single Responsibility Principle, and further, is a security risk. It creates unnecessary attack surface. If your proxy app or server is compromised, the entire framework is now compromised. An attacker could simply rewrite the proxy config to make all routes no-auth. The simple alternative is for the proxy to always route to TO, and TO checks the token against the auth service (which may also be proxied), and redirects unauthorized requests to a login endpoint (which may also be proxied). The TO service (and any other service that requires auth) MUST hit the database (or the auth service, which itself hits the database) to verify valid tokens' users still have the permissions they did when the token was created. Otherwise, it's impossible to revoke tokens, e.g. if an employee quits, or an attacker gains a token, or a user changes their password. On Sun, May 7, 2017 at 4:35 AM, Amir Yeshurun < [email protected]> wrote: Seems that attachments are stripped on this list. Examples pasted below *rules.json* [ { "host": "localhost", "path": "/login", "forward": "localhost:9004", "scheme": "https", "auth": false }, { "host": "localhost", "path": "/api/1.2/innovation/", "forward": "localhost:8004", "scheme": "http", "auth": true, "routes-file": "innovation.json" }, { "host": "localhost", "path": "/api/1.2/", "forward": "localhost:3000", "scheme": "http", "auth": true, "routes-file": "traffic-ops-routes.json" }, { "host": "localhost", "path": "/internal/api/1.2/", "forward": "localhost:3000", "scheme": "http", "auth": true, "routes-file": "internal-routes.json" } ] *traffic-ops-routes.json (partial)* . . . { "match": "/cdns/health", "auth": { "GET": ["cdn-health-read"] }}, { "match": "/cdns/capacity", "auth": { "GET": ["cdn-health-read"] }}, { "match": "/cdns/usage/overview", "auth": { "GET": ["cdn-stats-read"] }}, { "match": "/cdns/name/dnsseckeys/generate", "auth": { "GET": ["cdn-security-keys-read"] }}, { "match": "/cdns/name/[^\/]+/?", "auth": { "GET": ["cdn-read"] }}, { "match": "/cdns/name/[^\/]+/sslkeys", "auth": { "GET": ["cdn-security-keys-read"] }}, { "match": "/cdns/name/[^\/]+/dnsseckeys", "auth": { "GET": ["cdn-security-keys-read"] }}, { "match": "/cdns/name/[^\/]+/dnsseckeys/delete", "auth": { "GET": ["cdn-security-keys-write"] }}, { "match": "/cdns/[^\/]+/queue_update", "auth": { "POST": ["queue-updates-write"] }}, { "match": "/cdns/[^\/]+/snapshot", "auth": { "PUT": ["cdn-config-snapshot-write"] }}, { "match": "/cdns/[^\/]+/health", "auth": { "GET": ["cdn-health-read"] }}, { "match": "/cdns/[^\/]+/?", "auth": { "GET": ["cdn-read"], "PUT": ["cdn-write"], "PATCH": ["cdn-write"], "DELETE": ["cdn-write"] }}, { "match": "/cdns", "auth": { "GET": ["cdn-read"], "POST": ["cdn-write"] }}, . . . On Sun, May 7, 2017 at 12:39 PM Amir Yeshurun < [email protected] wrote: Attached please find examples for forwarding rules file (rules.json) and the authorization rules file (traffic-ops-routes.json) On Sun, May 7, 2017 at 10:39 AM Amir Yeshurun < [email protected]> wrote: Hi all, I am about to submit a PR with a first operational version of the API GW, to the "experimental" code base. The API GW forwarding logic is as follow: 1. Find host to forward the request: Prefix match on the request path against a list of forwarding rules. The matched forwarding rule defines the target's host, and the target's *authorization rules*. 2. Authorization: Regex match on the request path against a list of *authorization rules*. The matched rule defines the required capabilities to perform the HTTP method on the route. These capabilities are compared against the user's capabilities in the user's JWT At this moment, the 2 sets of rules are hard-coded in json files. The files are provided with the API GW distribution and contain definitions for TC 2.0 API routes. I have tested parts of the API, however, there might be mistakes in some of the routes. Please be warned. Considering manageability and high availability, I am aware that using local files for storing the set of authorization rules is inferior to centralized configuration. We are considering different approaches for centralized configuration, having the following points in mind - Microservice world: API GW will front multiple services, not only Mojo. It can also front other TC components like Traffic Stats and Traffic Monitor. Each service defines its own routes and capabilities. Here comes the question of what is the "source of truth" for the route definitions. - Handling private routes. API GW may front non-TC services. - User changes to the AAA scheme. The ability for admin user to makes changes in the required capabilities of a route, maybe even define new capability names, was raised in the past as a use case that should be supported. - Easy development and deployment of new services. - Using TO DB for expediency. I would appreciate any feedback and views on your approach to manage route definitions. Thanks /amiry -- *Shmulik Asafi* Qwilt | Work: +972-72-2221692 <+972%2072-222-1692> <+972%2072-222-1692>| Mobile: +972-54-6581595 <+972%2054-658-1595> <+972%2054-658-1595>| [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>>
