Re: Backup Cache Group Selection
backupList is not planned because the coordinates approach was sufficient —Eric On May 9, 2017, at 6:57 AM, Ori Finkelman> wrote: Hi again, I understand now that the "backupList" feature does not exist yet. What is the status of this feature ? is it planned ? Thanks, Ori On Mon, May 8, 2017 at 4:18 PM, Ori Finkelman > wrote: Hi, Following up on this one, it seems that both czf attributes described in this thread, the "coordinates" and the "backupList" are not documented in the official docs in http://trafficcontrol.incubator.apache.org/docs/latest/admin/traffic_ops_ using.html#the-coverage-zone-file-and-asn-table Is there a plan to update the documentation ? should I open a JIRA for it ? Thanks, Ori On Thu, Mar 30, 2017 at 8:45 PM, Jeff Elsloo wrote: Yes, that's correct. -- Thanks, Jeff On Thu, Mar 30, 2017 at 11:20 AM, Eric Friedrich (efriedri) wrote: Thanks Jeff- Could I think of it as the following? Echoing back to be sure I understand... If there is a lat/long for a cache group in the CZF file, any client hit to that CG should use the CZF lat/long as it client’s lat/long instead of using geolocation. For the purposes of finding closest cache group, the client’s location (from CZF as above or from Geolocation provider) will be compared against the location of the cache’s as configuration in Traffic Op’s CG record? —Eric On Mar 30, 2017, at 1:07 PM, Jeff Elsloo wrote: It could now be considered the "average" of the location of the clients within that section of the CZF, however, it should be noted that the addition of the geo coordinates to the CZF is relatively new. Previously we never had the ability to specify lat/long on those cachegroups, and we solely relied on those specified in edgeLocations, meaning that the matches had to be 1:1. Adding the coordinates allowed us to cover edge cases and miss scenarios and stick to the CZF whenever possible. Previously when we had no coordinates, and we had a hit in the CZF but not corresponding hit within the edgeLocations (health, assignments, etc), we would fall back to the Geolocation provider. -- Thanks, Jeff On Thu, Mar 30, 2017 at 5:29 AM, John Shen (weifensh) wrote: Thanks Jeff and Oren for the discussion. I agree now that lat/long from CZF is the “average” location of clients, and lat/long from Ops is the location of a certain Cache Group. So it appears to be reasonable to use them as source and dest to calculate the distance. Thanks, John On 30/03/2017, 6:55 PM, "Oren Shemesh" wrote: Jeff, having read this conversation more than once, I believe there is a misunderstanding regarding the ability to provide coordinates for cache groups both in the CZF and in the TO DB. Here is what I believe is a description which may help understanding the current behaviour: The coordinates specified in the CZF for a cache group are not supposed to be the exactly same as the coordinates in the TO DB for the same cache group. This is because they do not represent the location of the caches of the group. They represent the (average) location of clients found in the subnets specified for this cache group. This, I believe, explains both the behaviour of the code (Why the coordinates from the CZF are used for the source, but the coordinates from the TO DB are used for the various candidate cache groups), and the fact that there is a 'duplication'. Is this description true ? On Wed, Mar 29, 2017 at 7:02 PM, Jeff Elsloo wrote: The cachegroup settings in the Traffic Ops GUI end up in the `edgeLocations` section of the CRConfig. This is the source of truth for where caches are deployed, logically or physically. We do not provide a means to generate a CZF in Traffic Ops, so it's up to the end user to craft one to match what is in Traffic Ops. There are several cases that need to be accounted for where a hit in the CZF does match what's in `edgeLocations`, but cannot be served there due to cache health, delivery service health, or delivery service assignments. The other edge case is a hit where no `edgeLocation` exists, which again, must be accounted for. Presumably we have higher fidelity data in our CZF than we would in our Geolocation provider and we should use it whenever possible. Think about this: what if you use the same CZF for two configured CDNs, but one of the two CDNs only has caches deployed to 50% of the cache groups defined in the CZF. Would we want to use the Geolocation provider in the event that our source address matches a cachegroup that does not have any assigned caches? We would ideally have as much granularity as possible in the CZF, then use that to inform the decision about which cachegroup should service the request instead of falling back to a lower fidelity datasource. This is
Re: API GW route configuration
Blacklisting requires proactive communication between the authentication system and the gateway. Furthermore, the client can't be sure that something hasn't been blacklisted recently (and the message lost or perhaps maliciously delayed) unless it checks whatever system it is that does the blacklisting. And if you're checking a datastore of some sort for the validity of the token every time, you might as well just check each time and skip the blacklisting step. On Tue, May 9, 2017 at 1:27 PM Shmulik Asafiwrote: > Hi, > Maybe a missing link here is another component in a jwt stateless > architecture which is *blacklisting* malign tokens when necessary. > This is obviously a sort of state which needs to be handled in a datastore; > but it's quite different and easy to scale and has less performance impact > (I guess especially under DDOS) than doing full auth queries. > I believe this should be the approach on the API Gateway roadmap > Thanks > > On 9 May 2017 21:14, "Chris Lemmons" wrote: > > > I'll second the principle behind "start with security, optimize when > > there's a problem". > > > > It seems to me that in order to maintain security, basically everyone > would > > need to dial the revalidate time so close to zero that it does very > little > > good as a cache on the credentials. Otherwise, as Rob as pointed out, the > > TTL on your credential cache is effectively "how long am I ok with > hackers > > in control after I find them". Practically, it also means that much lag > on > > adding or removing permissions. That effectively means a database hit for > > every query, or near enough to every query as not to matter. > > > > That said, you can get the best of multiple worlds, I think. The only DB > > query that really has to be done is "give me the last update time for > this > > user". Compare that to the generation time in the token and 99% of the > > time, it's the only query you need. With that check, you can even use > > fairly long-lived tokens. If anything about the user has changed, reject > > the token, generate a new one, send that to the user and use it. The > > regenerate step is somewhat expensive, but still well inside reasonable, > I > > think. > > > > On Tue, May 9, 2017 at 11:31 AM Robert Butts > > wrote: > > > > > > The TO service (and any other service that requires auth) MUST hit > the > > > database (or the auth service, which itself hits the database) to > verify > > > valid tokens' users still have the permissions they did when the token > > was > > > created. Otherwise, it's impossible to revoke tokens, e.g. if an > employee > > > quits, or an attacker gains a token, or a user changes their password. > > > > > > I'm elaborating on this, and moving a discussion from a PR review here. > > > > > > From the code submissions to the repo, it appears the current plan is > for > > > the API Gateway to create a JWT, and then for that JWT to be accepted > by > > > all Traffic Ops microservices, with no database authentication. > > > > > > It's a common misconception that JWT allows you authenticate without > > > hitting the database. This is an exceedingly dangerous misconception. > If > > > you don't check the database when every authenticated route is > requested, > > > it's impossible to revoke access. In practice, this means the JWT TTL > > > becomes the length of time _after you discover an attacker is > > manipulating > > > your production system_, before it's _possible_ to evict them. > > > > > > How long do you feel is acceptable to have a hacker in and manipulating > > > your system, after you discover them? A day? An hour? Five minutes? > > > Whatever your TTL, that's the length of time you're willing to allow a > > > hacker to steal and destroy you and your customers' data. Worse, > because > > > this is a CDN, it's the length of time you're willing to allow your CDN > > to > > > be used to DDOS a target. > > > > > > Are you going to explain in court that the DDOS your system executed > > lasted > > > 24 hours, or 1 hour, or 10 minutes after you discovered it, because > > that's > > > the TTL you hard-coded? Are you going to explain to a judge and > > prosecuting > > > attorney exactly which sensitive data was stolen in the ten minutes > after > > > you discovered the attacker in your system, before their JWT expired? > > > > > > If you're willing to accept the legal consequences, that's your > business. > > > Apache Traffic Control should not require users to accept those > > > consequences, and ideally shouldn't make it possible, as many users > won't > > > understand the security risks. > > > > > > The argument has been made "authorization does not check the database > to > > > avoid congestion" -- Has anyone tested this in practice? The database > > query > > > itself is 50ms. Assuming your database and service are 2500km apart, > > that's > > > another 50ms network latency. Traffic Ops has endpoints
Re: API GW route configuration
Hi, Maybe a missing link here is another component in a jwt stateless architecture which is *blacklisting* malign tokens when necessary. This is obviously a sort of state which needs to be handled in a datastore; but it's quite different and easy to scale and has less performance impact (I guess especially under DDOS) than doing full auth queries. I believe this should be the approach on the API Gateway roadmap Thanks On 9 May 2017 21:14, "Chris Lemmons"wrote: > I'll second the principle behind "start with security, optimize when > there's a problem". > > It seems to me that in order to maintain security, basically everyone would > need to dial the revalidate time so close to zero that it does very little > good as a cache on the credentials. Otherwise, as Rob as pointed out, the > TTL on your credential cache is effectively "how long am I ok with hackers > in control after I find them". Practically, it also means that much lag on > adding or removing permissions. That effectively means a database hit for > every query, or near enough to every query as not to matter. > > That said, you can get the best of multiple worlds, I think. The only DB > query that really has to be done is "give me the last update time for this > user". Compare that to the generation time in the token and 99% of the > time, it's the only query you need. With that check, you can even use > fairly long-lived tokens. If anything about the user has changed, reject > the token, generate a new one, send that to the user and use it. The > regenerate step is somewhat expensive, but still well inside reasonable, I > think. > > On Tue, May 9, 2017 at 11:31 AM Robert Butts > wrote: > > > > The TO service (and any other service that requires auth) MUST hit the > > database (or the auth service, which itself hits the database) to verify > > valid tokens' users still have the permissions they did when the token > was > > created. Otherwise, it's impossible to revoke tokens, e.g. if an employee > > quits, or an attacker gains a token, or a user changes their password. > > > > I'm elaborating on this, and moving a discussion from a PR review here. > > > > From the code submissions to the repo, it appears the current plan is for > > the API Gateway to create a JWT, and then for that JWT to be accepted by > > all Traffic Ops microservices, with no database authentication. > > > > It's a common misconception that JWT allows you authenticate without > > hitting the database. This is an exceedingly dangerous misconception. If > > you don't check the database when every authenticated route is requested, > > it's impossible to revoke access. In practice, this means the JWT TTL > > becomes the length of time _after you discover an attacker is > manipulating > > your production system_, before it's _possible_ to evict them. > > > > How long do you feel is acceptable to have a hacker in and manipulating > > your system, after you discover them? A day? An hour? Five minutes? > > Whatever your TTL, that's the length of time you're willing to allow a > > hacker to steal and destroy you and your customers' data. Worse, because > > this is a CDN, it's the length of time you're willing to allow your CDN > to > > be used to DDOS a target. > > > > Are you going to explain in court that the DDOS your system executed > lasted > > 24 hours, or 1 hour, or 10 minutes after you discovered it, because > that's > > the TTL you hard-coded? Are you going to explain to a judge and > prosecuting > > attorney exactly which sensitive data was stolen in the ten minutes after > > you discovered the attacker in your system, before their JWT expired? > > > > If you're willing to accept the legal consequences, that's your business. > > Apache Traffic Control should not require users to accept those > > consequences, and ideally shouldn't make it possible, as many users won't > > understand the security risks. > > > > The argument has been made "authorization does not check the database to > > avoid congestion" -- Has anyone tested this in practice? The database > query > > itself is 50ms. Assuming your database and service are 2500km apart, > that's > > another 50ms network latency. Traffic Ops has endpoints that take 10s to > > generate. Worst-case scenario, this will double the time of tiny > endpoints > > to 200ms, and increase large endpoints inconsequentially. It's highly > > unlikely performance is an issue in practice. > > > > As Jan said, we can still have the services check the auth as well after > > the proxy auth. Moreover, the services don't even have to know about the > > auth service, they can hit a mapped route on the API Gateway, which gives > > us better modularisation and separation of concerns. > > > > It's not difficult, it can be a trivial endpoint on the auth service, > > remapped in the API Gateway, which takes the JWT token and returns true > if > > it's still authorized in the database. To be clear, this is
Re: API GW route configuration
I'll second the principle behind "start with security, optimize when there's a problem". It seems to me that in order to maintain security, basically everyone would need to dial the revalidate time so close to zero that it does very little good as a cache on the credentials. Otherwise, as Rob as pointed out, the TTL on your credential cache is effectively "how long am I ok with hackers in control after I find them". Practically, it also means that much lag on adding or removing permissions. That effectively means a database hit for every query, or near enough to every query as not to matter. That said, you can get the best of multiple worlds, I think. The only DB query that really has to be done is "give me the last update time for this user". Compare that to the generation time in the token and 99% of the time, it's the only query you need. With that check, you can even use fairly long-lived tokens. If anything about the user has changed, reject the token, generate a new one, send that to the user and use it. The regenerate step is somewhat expensive, but still well inside reasonable, I think. On Tue, May 9, 2017 at 11:31 AM Robert Buttswrote: > > The TO service (and any other service that requires auth) MUST hit the > database (or the auth service, which itself hits the database) to verify > valid tokens' users still have the permissions they did when the token was > created. Otherwise, it's impossible to revoke tokens, e.g. if an employee > quits, or an attacker gains a token, or a user changes their password. > > I'm elaborating on this, and moving a discussion from a PR review here. > > From the code submissions to the repo, it appears the current plan is for > the API Gateway to create a JWT, and then for that JWT to be accepted by > all Traffic Ops microservices, with no database authentication. > > It's a common misconception that JWT allows you authenticate without > hitting the database. This is an exceedingly dangerous misconception. If > you don't check the database when every authenticated route is requested, > it's impossible to revoke access. In practice, this means the JWT TTL > becomes the length of time _after you discover an attacker is manipulating > your production system_, before it's _possible_ to evict them. > > How long do you feel is acceptable to have a hacker in and manipulating > your system, after you discover them? A day? An hour? Five minutes? > Whatever your TTL, that's the length of time you're willing to allow a > hacker to steal and destroy you and your customers' data. Worse, because > this is a CDN, it's the length of time you're willing to allow your CDN to > be used to DDOS a target. > > Are you going to explain in court that the DDOS your system executed lasted > 24 hours, or 1 hour, or 10 minutes after you discovered it, because that's > the TTL you hard-coded? Are you going to explain to a judge and prosecuting > attorney exactly which sensitive data was stolen in the ten minutes after > you discovered the attacker in your system, before their JWT expired? > > If you're willing to accept the legal consequences, that's your business. > Apache Traffic Control should not require users to accept those > consequences, and ideally shouldn't make it possible, as many users won't > understand the security risks. > > The argument has been made "authorization does not check the database to > avoid congestion" -- Has anyone tested this in practice? The database query > itself is 50ms. Assuming your database and service are 2500km apart, that's > another 50ms network latency. Traffic Ops has endpoints that take 10s to > generate. Worst-case scenario, this will double the time of tiny endpoints > to 200ms, and increase large endpoints inconsequentially. It's highly > unlikely performance is an issue in practice. > > As Jan said, we can still have the services check the auth as well after > the proxy auth. Moreover, the services don't even have to know about the > auth service, they can hit a mapped route on the API Gateway, which gives > us better modularisation and separation of concerns. > > It's not difficult, it can be a trivial endpoint on the auth service, > remapped in the API Gateway, which takes the JWT token and returns true if > it's still authorized in the database. To be clear, this is not a problem > today. Traffic Ops still uses the Mojolicious cookie today, so this would > only need done if and when we remove that, or if we move authorized > endpoints out of Traffic Ops into their own microservices. > > Considering the significant security and legal risks, we should always hit > the database to validate requests of authorized endpoints, and reconsider > if and when someone observes performance issues in practice. > > > On Tue, May 9, 2017 at 6:56 AM, Dewayne Richardson > wrote: > > > If only the API GW authenticates/authorizes we also have a single point > of > > entry to test for security instead of having
Re: API GW route configuration
If only the API GW authenticates/authorizes we also have a single point of entry to test for security instead of having it sprinkled across services in different ways. It also simplifies the code on the service side and makes them easier to test with automation. -Dew On Mon, May 8, 2017 at 8:42 AM, Robert Buttswrote: > > couldn't make nginx or http do what we need. > > I was suggesting a different architecture. Not making the proxy do auth, > only standard proxying. > > > We can still have the services check the auth as well after the proxy > auth > > +1 > > > On Mon, May 8, 2017 at 3:36 AM, Amir Yeshurun wrote: > > > Hi, > > > > Let me elaborate some more on the purpose of the API GW. I will put up a > > wiki page following our discussions here. > > > > Main purpose is to allow innovation by creating new services that handle > TO > > functionality, not as a part of the monolithic Mojo app. > > The long term vision is to de-compose TO into multiple microservices, > > allowing new functionality easily added. > > Indeed, the goal it to eventually deprecate the current AAA model, and > > replace it with the new AAA model currently under work (user-roles, > > role-capabilities) > > > > I think that handling authorization in the API layer is a valid approach. > > Security wise, I don't see much difference between that, and having each > > module access the auth service, as long as the auth service is deployed > in > > the backend. > > Having another proxy (nginx?) fronting the world and forwarding all > > requests to the backend GW mitigates the risk for compromising the > > authorization service. > > However, as mentioned above, we can still have the services check the > auth > > as well after the proxy auth. > > > > It is a standalone process, completely optional at this point. One can > > choose to deploy it in order to allow integration with additional > > services. Deployment > > and management are still T.B.D, and feedback on this is most welcome. > > > > Regarding token validation and revocation: > > Tokens have expiration time. Expired tokens do not pass token validation. > > In production, expiration should be set to relatively short time, say 5 > > minute. > > This way revocation is automatic. Re-authentication is handled via > refresh > > tokens (not implemented yet). Hitting the DB upon every API call cause > > congestion on users DB. > > To avoid that, we chose to have all user information self-contained > inside > > the JWT. > > > > Thanks > > /amiry > > > > On Mon, May 8, 2017 at 5:42 AM Jan van Doorn wrote: > > > > > It's the reverse proxy we've discussed for the "micro services" version > > for > > > a while now (as in > > > https://cwiki.apache.org/confluence/display/TC/Design+Overview+v3.0). > > > > > > On Sun, May 7, 2017 at 7:22 PM Eric Friedrich (efriedri) < > > > efrie...@cisco.com> > > > wrote: > > > > > > > From a higher level- what is purpose of the API Gateway? It seems > like > > > > there may have been some previous discussions about API Gateway. Are > > > there > > > > any notes or description that I can catch up on? > > > > > > > > How will it be deployed? (Is it a standalone service or something > that > > > > runs inside the experimental Traffic Ops)? > > > > > > > > Is this new component required or optional? > > > > > > > > —Eric > > > > > > > > > > > > > > > > > On May 7, 2017, at 8:28 PM, Jan van Doorn wrote: > > > > > > > > > > I looked into this a year or so ago, and I couldn't make nginx or > > http > > > do > > > > > what we need. > > > > > > > > > > We can still have the services check the auth as well after the > proxy > > > > auth, > > > > > and make things better than today, where we have the same problem > > that > > > if > > > > > the TO mojo app is compromised, everything is compromised. > > > > > > > > > > If we always route to TO, we don't untangle the mess of being > > dependent > > > > on > > > > > the monolithic TO for everything. Many services today, and more in > > the > > > > > future really just need a check to see if the user is authorized, > and > > > > > nothing more. > > > > > > > > > > On Sun, May 7, 2017 at 11:55 AM Robert Butts < > > robert.o.bu...@gmail.com > > > > > > > > > wrote: > > > > > > > > > >> What are the advantages of these config files, over an existing > > > reverse > > > > >> proxy, like Nginx or httpd? It's just as much work as configuring > > and > > > > >> deploying an existing product, but more code we have to write and > > > > maintain. > > > > >> I'm having trouble seeing the advantage. > > > > >> > > > > >> -1 on auth rules as a part of the proxy. Making a proxy care about > > > auth > > > > >> violates the Single Responsibility Principle, and further, is a > > > security > > > > >> risk. It creates unnecessary attack surface. If your proxy app or > > > > server is > > > > >> compromised, the entire framework is now compromised. An
Re: Backup Cache Group Selection
Hi again, I understand now that the "backupList" feature does not exist yet. What is the status of this feature ? is it planned ? Thanks, Ori On Mon, May 8, 2017 at 4:18 PM, Ori Finkelmanwrote: > Hi, > Following up on this one, it seems that both czf attributes described in > this thread, the "coordinates" and the "backupList" are not documented in > the official docs in > http://trafficcontrol.incubator.apache.org/docs/latest/admin/traffic_ops_ > using.html#the-coverage-zone-file-and-asn-table > > Is there a plan to update the documentation ? should I open a JIRA for it ? > > Thanks, > Ori > > On Thu, Mar 30, 2017 at 8:45 PM, Jeff Elsloo > wrote: > >> Yes, that's correct. >> -- >> Thanks, >> Jeff >> >> >> On Thu, Mar 30, 2017 at 11:20 AM, Eric Friedrich (efriedri) >> wrote: >> > Thanks Jeff- >> > Could I think of it as the following? Echoing back to be sure I >> understand... >> > >> > If there is a lat/long for a cache group in the CZF file, any client >> hit to that CG should use the CZF lat/long as it client’s lat/long instead >> of using geolocation. >> > >> > For the purposes of finding closest cache group, the client’s location >> (from CZF as above or from Geolocation provider) will be compared against >> the location of the cache’s as configuration in Traffic Op’s CG record? >> > >> > —Eric >> > >> > >> >> On Mar 30, 2017, at 1:07 PM, Jeff Elsloo >> wrote: >> >> >> >> It could now be considered the "average" of the location of the >> >> clients within that section of the CZF, however, it should be noted >> >> that the addition of the geo coordinates to the CZF is relatively new. >> >> Previously we never had the ability to specify lat/long on those >> >> cachegroups, and we solely relied on those specified in edgeLocations, >> >> meaning that the matches had to be 1:1. Adding the coordinates allowed >> >> us to cover edge cases and miss scenarios and stick to the CZF >> >> whenever possible. Previously when we had no coordinates, and we had a >> >> hit in the CZF but not corresponding hit within the edgeLocations >> >> (health, assignments, etc), we would fall back to the Geolocation >> >> provider. >> >> -- >> >> Thanks, >> >> Jeff >> >> >> >> >> >> On Thu, Mar 30, 2017 at 5:29 AM, John Shen (weifensh) >> >> wrote: >> >>> Thanks Jeff and Oren for the discussion. I agree now that lat/long >> from CZF is the “average” location of clients, and lat/long from Ops is the >> location of a certain Cache Group. So it appears to be reasonable to use >> them as source and dest to calculate the distance. >> >>> >> >>> Thanks, >> >>> John >> >>> >> >>> >> >>> On 30/03/2017, 6:55 PM, "Oren Shemesh" wrote: >> >>> >> >>>Jeff, having read this conversation more than once, I believe >> there is a >> >>>misunderstanding regarding the ability to provide coordinates for >> cache >> >>>groups both in the CZF and in the TO DB. >> >>> >> >>>Here is what I believe is a description which may help >> understanding the >> >>>current behaviour: >> >>> >> >>>The coordinates specified in the CZF for a cache group are not >> supposed to >> >>>be the exactly same as the coordinates in the TO DB for the same >> cache >> >>>group. >> >>>This is because they do not represent the location of the caches >> of the >> >>>group. >> >>>They represent the (average) location of clients found in the >> subnets >> >>>specified for this cache group. >> >>> >> >>>This, I believe, explains both the behaviour of the code (Why the >> >>>coordinates from the CZF are used for the source, but the >> coordinates from >> >>>the TO DB are used for the various candidate cache groups), and >> the fact >> >>>that there is a 'duplication'. >> >>> >> >>>Is this description true ? >> >>> >> >>> >> >>> >> >>>On Wed, Mar 29, 2017 at 7:02 PM, Jeff Elsloo >> wrote: >> >>> >> The cachegroup settings in the Traffic Ops GUI end up in the >> `edgeLocations` section of the CRConfig. This is the source of truth >> for where caches are deployed, logically or physically. We do not >> provide a means to generate a CZF in Traffic Ops, so it's up to the >> end user to craft one to match what is in Traffic Ops. >> >> There are several cases that need to be accounted for where a hit in >> the CZF does match what's in `edgeLocations`, but cannot be served >> there due to cache health, delivery service health, or delivery >> service assignments. The other edge case is a hit where no >> `edgeLocation` exists, which again, must be accounted for. Presumably >> we have higher fidelity data in our CZF than we would in our >> Geolocation provider and we should use it whenever possible. >> >> Think about this: what if you use the same CZF for two configured >> CDNs, but one of the two