Re: API GW route configuration

Eric Friedrich (efriedri) Thu, 11 May 2017 18:09:27 -0700

If we are going to have a dynamic signing key, how is that shared between all 
of the services?


Whatever mechanism that is must also be a secured API which introduces another 
point of attack for a potential hacker.

I have a very strong preference for something like TLS client auth over 
hardcoding usernames/passwords into config files.
—Eric

On May 11, 2017, at 12:23 PM, Chris Lemmons 
<[email protected]<mailto:[email protected]>> wrote:

invalidate ALL tokens by changing the token signing key

Interesting idea. That does mean that the signing key has to be retrieved
every time from the authentication authority, or it'd be subject to the
exact same set of attacks. But a nearly-constant rarely changing key could
be communicated very efficiently, I suspect. And if the authentication
system is a web API, it can even use Modified-Since to 304 99% of the time
for maximum efficiency.

It does have the downside that key-invalidation events are fairly
significant. You'd need to invalidate the keys whenever someone's access
was reduced or removed. As the number of accounts in the system increases,
that might not wind up being as infrequent as one might hope. It's easy to
implement, though.

On Thu, May 11, 2017 at 10:12 AM Jeremy Mitchell 
<[email protected]<mailto:[email protected]>>
wrote:

Regarding the TTL on the JWT token. a 5 minute TTL seems silly. What's the
point? Unless we get into refresh tokens but that sounds like oauth...blah.

What about this and maybe i'm oversimplifying. the TTL on the jwt token is
24 hours. If we become aware that a token has been compromised, invalidate
ALL tokens by changing the token signing key. maybe this is a good idea or
maybe this is a terrible idea. I have no idea. just a thought..

jeremy

On Wed, May 10, 2017 at 12:23 PM, Chris Lemmons 
<[email protected]<mailto:[email protected]>>
wrote:

Responding to a few people:

Often times every auth action must be accompanied by DB writes for
audit
logs or callback functions.

True. But a) if logging is too expensive it should probably be made
cheaper
and b) the answer to "audits are too expensive" probably isn't "lets just
do less authentication". If the audit log is genuinely the bottle-neck,
it
would still be better to re-auth without the audit log.

The API gateway can poll for the latest list of tokens at a regular
interval

Yeah, datastore replication for local performance is great. Though if you
can reasonably query for a list of all valid tokens every second, it's
probably cheaper to to just query for the token you need every time you
need it. If there are massive batches of queries that are coming through,
it's probably not unreasonable to choose not to re-validate a token
that's
been validated in the last second.

Regarding maliciously delayed message or such - I don't fully
understand
the
point; if an attacker has such capabilities she can simply prevent/delay
devop users from updating the auth database itself thus enabling the
attack.

In a typical attack, an attacker might gain control of a box on the local
network, but not necessarily the Gateway, Traffic Ops, or Auth Server.
Those are probably better hardened. But lots of networks have a squishy
test box that everyone forgot was there or something. The bad guy wants
to
use the CDN to DOS someone, or redirect traffic to somewhere malicious,
or
just cause mayhem. The longer he can keep control, the better for him.

So this attacker uses the local box to sniff the token off the network.
If
the communication with the Gateway is encrypted, he might have to do some
ARP poisoning or something else to trick a host into talking to the local
box instead. (Properly implemented TLS also migates this angle.) He knows
that as soon as he starts his nefarious deed, alarms are going to go off,
so he also uses this local box to DOS the Auth Server. It's a lot easier
to
take a box down from the outside than to actually gain control.

If the Gateway "fails open" when it can't contact the Auth server, the
attacker remains in control. If it "fails closed", the attacker has to
actually compromise the auth server (which is harder) to remain in
control.

Do we block all API calls if the auth service is temporarily down
(being
upgraded, container restarting, etc…)?

Yes, I think we have to. Authentication is integral to reliable
operation.

We've been talking in some fairly wild hypotheticals, though. Is there a
specific auth service you're envisioning?

On Wed, May 10, 2017 at 12:50 AM Shmulik Asafi 
<[email protected]<mailto:[email protected]>>
wrote:

Regarding the communication issue Chris raised - there is more than one
possible pattern to this, e.g.:

  - Blacklisted tokens can be communicated via a pub-sub mechanism
  - The API gateway can poll for the latest list of tokens at a
regular
  interval (which can be very short ~1sec, much shorter than the time
it
  takes devops to detect and react to malign tokens)

Regarding hitting the blacklist datastore - this only sounds similar to
hitting to auth database; but the simplicity of a blacklist function
allows
you to employ more efficient datastores, e.g. Redis or just a hashmap
in
the API gateway process memory.

Regarding maliciously delayed message or such - I don't fully
understand
the point; if an attacker has such capabilities she can simply
prevent/delay devop users from updating the auth database itself thus
enabling the attack.


On Wed, May 10, 2017 at 4:25 AM, Eric Friedrich (efriedri) <
[email protected]<mailto:[email protected]>> wrote:

Our current management wrapper around Traffic Control (called OMD
Director, demo’d at last TC summit) uses a very similar approach to
authentication.

We have an auth service that issues a JWT. The JWT is then provided
along
with all API calls. A few comments on our practical experience:

- I am a supported of validating tokens both in the API gateway and
in
the
service. We have several examples of services- Grafana for example,
that
require external authentication. Similarly, we have other services
that
need finer grained authentication than API Gateway policy can handle.
Specifically, a given user may have permissions to view/modify some
delivery services but not others. The API gateway presumably would
not
understand the semantics of payload so this decision would need to be
made
by auth within the service.

- As brought up earlier, auth in the gateway is both a strength and a
risk. Additional layer of security is also positive, but for my case
of
Grafana above it can present an opportunity to bypass authentication.
This
is a risk, but it can be mitigated by adding auth to the service
where
needed.

- Verifying tokens on every access may potentially be more a little
expensive than discussed. Often times every auth action must be
accompanied
by DB writes for audit logs or callback functions. Not the straw to
break
the camel’s back, but something to keep in mind.

- There is also the problem of what to do if the underlying auth
service
is temporarily unavailable. Do we block all API calls if the auth
service
is temporarily down (being upgraded, container restarting, etc…)?

- I’d like to see what we can do to use a pre-existing package as an
API
Gateway. As we decompose TO into microservices, something like nginx
can
provide additional benefits like TLS termination and load balancing
between
service endpoints. I’d hate to see us have to reimplement these
functions
later.

- I’d also like to see us give some consideration to how an API
gateway
is
deployed. We raised the bar for new users by unbundling Traffic Ops
from
the database and it could further complicate the installation if we
don’t
provide enough guidance on how to deploy the API gateway in a lab
trial,
if
not best practices for production deployment. Should we recommend to
deploy
as an new RPM/systemd service, an immutable container, or as part of
the
existing TO RPM?

—Eric


On May 9, 2017, at 5:05 PM, Chris Lemmons 
<[email protected]<mailto:[email protected]>>
wrote:

Blacklisting requires proactive communication between the
authentication
system and the gateway. Furthermore, the client can't be sure that
something hasn't been blacklisted recently (and the message lost or
perhaps
maliciously delayed) unless it checks whatever system it is that
does
the
blacklisting. And if you're checking a datastore of some sort for
the
validity of the token every time, you might as well just check each
time
and skip the blacklisting step.

On Tue, May 9, 2017 at 1:27 PM Shmulik Asafi 
<[email protected]<mailto:[email protected]>>
wrote:

Hi,
Maybe a missing link here is another component in a jwt stateless
architecture which is *blacklisting* malign tokens when necessary.
This is obviously a sort of state which needs to be handled in a
datastore;
but it's quite different and easy to scale and has less
performance
impact
(I guess especially under DDOS) than doing full auth queries.
I believe this should be the approach on the API Gateway roadmap
Thanks

On 9 May 2017 21:14, "Chris Lemmons" 
<[email protected]<mailto:[email protected]>> wrote:

I'll second the principle behind "start with security, optimize
when
there's a problem".

It seems to me that in order to maintain security, basically
everyone
would
need to dial the revalidate time so close to zero that it does
very
little
good as a cache on the credentials. Otherwise, as Rob as pointed
out,
the
TTL on your credential cache is effectively "how long am I ok
with
hackers
in control after I find them". Practically, it also means that
much
lag
on
adding or removing permissions. That effectively means a database
hit
for
every query, or near enough to every query as not to matter.

That said, you can get the best of multiple worlds, I think. The
only
DB
query that really has to be done is "give me the last update time
for
this
user". Compare that to the generation time in the token and 99%
of
the
time, it's the only query you need. With that check, you can even
use
fairly long-lived tokens. If anything about the user has changed,
reject
the token, generate a new one, send that to the user and use it.
The
regenerate step is somewhat expensive, but still well inside
reasonable,
I
think.

On Tue, May 9, 2017 at 11:31 AM Robert Butts <
[email protected]<mailto:[email protected]>

wrote:

The TO service (and any other service that requires auth) MUST
hit
the
database (or the auth service, which itself hits the database)
to
verify
valid tokens' users still have the permissions they did when the
token
was
created. Otherwise, it's impossible to revoke tokens, e.g. if an
employee
quits, or an attacker gains a token, or a user changes their
password.

I'm elaborating on this, and moving a discussion from a PR
review
here.

From the code submissions to the repo, it appears the current
plan
is
for
the API Gateway to create a JWT, and then for that JWT to be
accepted
by
all Traffic Ops microservices, with no database authentication.

It's a common misconception that JWT allows you authenticate
without
hitting the database. This is an exceedingly dangerous
misconception.
If
you don't check the database when every authenticated route is
requested,
it's impossible to revoke access. In practice, this means the
JWT
TTL
becomes the length of time _after you discover an attacker is
manipulating
your production system_, before it's _possible_ to evict them.

How long do you feel is acceptable to have a hacker in and
manipulating
your system, after you discover them? A day? An hour? Five
minutes?
Whatever your TTL, that's the length of time you're willing to
allow a
hacker to steal and destroy you and your customers' data. Worse,
because
this is a CDN, it's the length of time you're willing to allow
your
CDN
to
be used to DDOS a target.

Are you going to explain in court that the DDOS your system
executed
lasted
24 hours, or 1 hour, or 10 minutes after you discovered it,
because
that's
the TTL you hard-coded? Are you going to explain to a judge and
prosecuting
attorney exactly which sensitive data was stolen in the ten
minutes
after
you discovered the attacker in your system, before their JWT
expired?

If you're willing to accept the legal consequences, that's your
business.
Apache Traffic Control should not require users to accept those
consequences, and ideally shouldn't make it possible, as many
users
won't
understand the security risks.

The argument has been made "authorization does not check the
database
to
avoid congestion" -- Has anyone tested this in practice? The
database
query
itself is 50ms. Assuming your database and service are 2500km
apart,
that's
another 50ms network latency. Traffic Ops has endpoints that
take
10s
to
generate. Worst-case scenario, this will double the time of tiny
endpoints
to 200ms, and increase large endpoints inconsequentially. It's
highly
unlikely performance is an issue in practice.

As Jan said, we can still have the services check the auth as
well
after
the proxy auth. Moreover, the services don't even have to know
about
the
auth service, they can hit a mapped route on the API Gateway,
which
gives
us better modularisation and separation of concerns.

It's not difficult, it can be a trivial endpoint on the auth
service,
remapped in the API Gateway, which takes the JWT token and
returns
true
if
it's still authorized in the database. To be clear, this is not
a
problem
today. Traffic Ops still uses the Mojolicious cookie today, so
this
would
only need done if and when we remove that, or if we move
authorized
endpoints out of Traffic Ops into their own microservices.

Considering the significant security and legal risks, we should
always
hit
the database to validate requests of authorized endpoints, and
reconsider
if and when someone observes performance issues in practice.


On Tue, May 9, 2017 at 6:56 AM, Dewayne Richardson <
[email protected]<mailto:[email protected]>

wrote:

If only the API GW authenticates/authorizes we also have a
single
point
of
entry to test for security instead of having it sprinkled
across
services
in different ways.  It also simplifies the code on the service
side
and
makes them easier to test with automation.

-Dew

On Mon, May 8, 2017 at 8:42 AM, Robert Butts <
[email protected]<mailto:[email protected]>

wrote:

couldn't make nginx or http do what we need.

I was suggesting a different architecture. Not making the
proxy
do
auth,
only standard proxying.

We can still have the services check the auth as well after
the
proxy
auth

+1


On Mon, May 8, 2017 at 3:36 AM, Amir Yeshurun <
[email protected]<mailto:[email protected]>>
wrote:

Hi,

Let me elaborate some more on the purpose of the API GW. I
will
put
up
a
wiki page following our discussions here.

Main purpose is to allow innovation by creating new services
that
handle
TO
functionality, not as a part of the monolithic Mojo app.
The long term vision is to de-compose TO into multiple
microservices,
allowing new functionality easily added.
Indeed, the goal it to eventually deprecate the current AAA
model,
and
replace it with the new AAA model currently under work
(user-roles,
role-capabilities)

I think that handling authorization in the API layer is a
valid
approach.
Security wise, I don't see much difference between that, and
having
each
module access the auth service, as long as the auth service
is
deployed
in
the backend.
Having another proxy (nginx?) fronting the world and
forwarding
all
requests to the backend GW mitigates the risk for
compromising
the
authorization service.
However, as mentioned above, we can still have the services
check
the
auth
as well after the proxy auth.

It is a standalone process, completely optional at this
point.
One
can
choose to deploy it in order to allow integration with
additional
services. Deployment
and management are still T.B.D, and feedback on this is most
welcome.

Regarding token validation and revocation:
Tokens have expiration time. Expired tokens do not pass token
validation.
In production, expiration should be set to relatively short
time,
say 5
minute.
This way revocation is automatic. Re-authentication is
handled
via
refresh
tokens (not implemented yet). Hitting the DB upon every API
call
cause
congestion on users DB.
To avoid that, we chose to have all user information
self-contained
inside
the JWT.

Thanks
/amiry

On Mon, May 8, 2017 at 5:42 AM Jan van Doorn <
[email protected]<mailto:[email protected]>>
wrote:

It's the reverse proxy we've discussed for the "micro
services"
version
for
a while now (as in

https://cwiki.apache.org/confluence/display/TC/Design+
Overview+v3.0
).

On Sun, May 7, 2017 at 7:22 PM Eric Friedrich (efriedri) <
[email protected]<mailto:[email protected]>>
wrote:

From a higher level- what is purpose of the API Gateway?
It
seems
like
there may have been some previous discussions about API
Gateway.
Are
there
any notes or description that I can catch up on?

How will it be deployed? (Is it a standalone service or
something
that
runs inside the experimental Traffic Ops)?

Is this new component required or optional?

—Eric



On May 7, 2017, at 8:28 PM, Jan van Doorn <
[email protected]<mailto:[email protected]>

wrote:

I looked into this a year or so ago, and I couldn't make
nginx
or
http
do
what we need.

We can still have the services check the auth as well
after
the
proxy
auth,
and make things better than today, where we have the same
problem
that
if
the TO mojo app is compromised, everything is compromised.

If we always route to TO, we don't untangle the mess of
being
dependent
on
the monolithic TO for everything. Many services today, and
more
in
the
future really just need a check to see if the user is
authorized,
and
nothing more.

On Sun, May 7, 2017 at 11:55 AM Robert Butts <
[email protected]

wrote:

What are the advantages of these config files, over an
existing
reverse
proxy, like Nginx or httpd? It's just as much work as
configuring
and
deploying an existing product, but more code we have to
write
and
maintain.
I'm having trouble seeing the advantage.

-1 on auth rules as a part of the proxy. Making a proxy
care
about
auth
violates the Single Responsibility Principle, and
further,
is
a
security
risk. It creates unnecessary attack surface. If your
proxy
app
or
server is
compromised, the entire framework is now compromised. An
attacker
could
simply rewrite the proxy config to make all routes
no-auth.

The simple alternative is for the proxy to always route
to
TO,
and
TO
checks the token against the auth service (which may also
be
proxied),
and
redirects unauthorized requests to a login endpoint
(which
may
also
be
proxied).

The TO service (and any other service that requires auth)
MUST
hit
the
database (or the auth service, which itself hits the
database)
to
verify
valid tokens' users still have the permissions they did
when
the
token
was
created. Otherwise, it's impossible to revoke tokens,
e.g.
if
an
employee
quits, or an attacker gains a token, or a user changes
their
password.


On Sun, May 7, 2017 at 4:35 AM, Amir Yeshurun <
[email protected]>
wrote:

Seems that attachments are stripped on this list.
Examples
pasted
below

*rules.json*
[
 { "host": "localhost", "path": "/login",
"forward":
"localhost:9004", "scheme": "https", "auth": false },
 { "host": "localhost", "path": "/api/1.2/innovation/",
"forward":
"localhost:8004", "scheme": "http",  "auth": true,
"routes-file":
"innovation.json" },
 { "host": "localhost", "path": "/api/1.2/",
"forward":
"localhost:3000", "scheme": "http",  "auth": true,
"routes-file":
"traffic-ops-routes.json" },
 { "host": "localhost", "path": "/internal/api/1.2/",
"forward":
"localhost:3000", "scheme": "http",  "auth": true,
"routes-file":
"internal-routes.json" }
]

*traffic-ops-routes.json (partial)*
.
.
.
 { "match": "/cdns/health",
"auth":
{
"GET":
["cdn-health-read"] }},
 { "match": "/cdns/capacity",
"auth":
{
"GET":
["cdn-health-read"] }},
 { "match": "/cdns/usage/overview",
"auth":
{
"GET":
["cdn-stats-read"] }},
 { "match": "/cdns/name/dnsseckeys/generate",
"auth":
{
"GET":
["cdn-security-keys-read"] }},
 { "match": "/cdns/name/[^\/]+/?",
"auth":
{
"GET":
["cdn-read"] }},
 { "match": "/cdns/name/[^\/]+/sslkeys",
"auth":
{
"GET":
["cdn-security-keys-read"] }},
 { "match": "/cdns/name/[^\/]+/dnsseckeys",
"auth":
{
"GET":
["cdn-security-keys-read"] }},
 { "match": "/cdns/name/[^\/]+/dnsseckeys/delete",
"auth":
{
"GET":
["cdn-security-keys-write"] }},
 { "match": "/cdns/[^\/]+/queue_update",
"auth":
{
"POST":
["queue-updates-write"] }},
 { "match": "/cdns/[^\/]+/snapshot",
"auth":
{
"PUT":
["cdn-config-snapshot-write"] }},
 { "match": "/cdns/[^\/]+/health",
"auth":
{
"GET":
["cdn-health-read"] }},
 { "match": "/cdns/[^\/]+/?",
"auth":
{
"GET":
["cdn-read"], "PUT":  ["cdn-write"], "PATCH":
["cdn-write"],
"DELETE":
["cdn-write"] }},
 { "match": "/cdns",
"auth":
{
"GET":
["cdn-read"], "POST": ["cdn-write"] }},

.
.
.


On Sun, May 7, 2017 at 12:39 PM Amir Yeshurun <
[email protected]

wrote:

Attached please find examples for forwarding rules file
(rules.json)
and
the authorization rules file (traffic-ops-routes.json)


On Sun, May 7, 2017 at 10:39 AM Amir Yeshurun <
[email protected]>
wrote:

Hi all,

I am about to submit a PR with a first operational
version
of
the
API
GW,
to the "experimental" code base.

The API GW forwarding logic is as follow:

1. Find host to forward the request: Prefix match on
the
request
path
against a list of forwarding rules. The matched
forwarding
rule
defines the
target's host, and the target's *authorization
rules*.
2. Authorization: Regex match on the request path
against a
list
of
*authorization
rules*. The matched rule defines the required
capabilities
to
perform
the HTTP method on the route. These capabilities are
compared
against the
user's capabilities in the user's JWT

At this moment, the 2 sets of rules are hard-coded in
json
files.
The
files are provided with the API GW distribution and
contain
definitions
for
TC 2.0 API routes. I have tested parts of the API,
however,
there
might
be
mistakes in some of the routes. Please be warned.

Considering manageability and high availability, I am
aware
that
using
local files for storing the set of authorization rules
is
inferior
to
centralized configuration.

We are considering different approaches for
centralized
configuration,
having the following points in mind

- Microservice world: API GW will front multiple
services,
not
only
Mojo. It can also front other TC components like
Traffic
Stats
and
Traffic
Monitor. Each service defines its own routes and
capabilities.
Here
comes
the question of what is the "source of truth" for the
route
definitions.
- Handling private routes. API GW may front non-TC
services.
- User changes to the AAA scheme. The ability for
admin
user
to
makes
changes in the required capabilities of a route,
maybe
even
define
new
capability names, was raised in the past as a use
case
that
should
be
supported.
- Easy development and deployment of new services.
- Using TO DB for expediency.

I would appreciate any feedback and views on your
approach
to
manage
route definitions.

Thanks
/amiry

















--
*Shmulik Asafi*
Qwilt | Work: +972-72-2221692 <+972%2072-222-1692>
<+972%2072-222-1692>| Mobile:
+972-54-6581595 <+972%2054-658-1595> <+972%2054-658-1595>|
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>

Re: API GW route configuration

Reply via email to