dwreeves opened a new pull request, #25432:
URL: https://github.com/apache/airflow/pull/25432
This PR addresses #25104
---
# Changes
## 1. JSON secrets do not need to be URL-encoded when `full_url_mode=False`.
^ The whole reason for this PR. This is the main behavior that is being
implemented.
Specifically, this behavior is implemented for when `get_connection()` is
called. This method returns a `Optional[Connection]` object. The `Connection()`
can be built either using a `uri=?`, or with kwargs corresponding with the
parts of the URI, e.g. `host=?`, `port=?`, etc. In the former case,
URL-encoding is required; in the latter case, it is not.
Users will receive a DeprecationWarning if the code detects that their
secrets are URL-encoded (more on that in section 2 below). In most cases, the
user should be able to simply decode their secret and everything will continue
working normally.
I tried to make this change, as well as the other changes, as quietly as
possible for typical Airflow runtimes. Basically, if the user isn't doing
anything weird, the only thing they will be required to do is change their
secrets from being URL-encoded to decoded.
To support this behavior, a few additional things were either needed to be
changed, or made a lot of sense to change. The rest of this message will
describe what those additional changes are.
## 2. Added `secret_values_are_urlencoded=?` kwarg for
`SecretsManagerBackend.__init__`, albeit most users do not need to touch this.
@potiuk suggested adding a kwarg for assisting in migration. I wanted to
avoid this if necessary because it is very obtrusive and causes a negative user
experience.
Is it possible to avoid needing to reconfigure the secrets manager backend?
Yes, in most cases!
What I dicovered is that **if decoding the URL-encoded values is idempotent,
the user needs to decode their secrets, but the `backend_kwargs` config option
does not need to be adjusted.** The reason why idempotency obviates a need to
touch the config is that idempotency means the intended value of the secret is
unambiguous once the values are decoded. This is a pretty big win from a user
experience perspective because adjusting the Airflow configuration can be a
nuisance in practice (e.g. a developer might need to get a system administrator
involved).
I add a longer explanation in the comments for the code:
```python
# When ``unquote(v) == v``, then removing unquote won't affect the user,
regardless of
# whether or not ``v`` is URL-encoded. For example, "foo bar" is not
URL-encoded. But
# because decoding it doesn't affect the value, then it will migrate safely
when
# ``unquote`` gets removed.
#
# When parameters are URL-encoded, but decoding is idempotent, we need to
warn the user
# to un-escape their secrets. For example, if "foo%20bar" is a URL-encoded
string, then
# decoding is idempotent because ``unquote(unquote("foo%20bar")) ==
unquote("foo%20bar")``.
#
# In the rare situation that value is URL-encoded but the decoding is _not_
idempotent,
# this causes a major issue. For example, if "foo%2520bar" is URL-encoded,
then decoding is
# _not_ idempotent because ``unquote(unquote("foo%2520bar")) !=
unquote("foo%2520bar")``
#
# This causes a problem for migration because if the user decodes their
value, we cannot
# infer that is the case by looking at the decoded value (from our vantage
point, it will
# look to be URL-encoded.)
#
# So when this uncommon situation occurs, the user _must_ adjust the
configuration and set
# ``parameters_are_urlencoded`` to False to migrate safely. In all other
cases, we do not
# need the user to adjust this object to migrate; they can transition their
secrets with
# the default configuration.
```
The kwarg is needed in the very rare case that the kwarg is not idempotent,
i.e. the string literal for the decoded secret contains a `%`. In this case,
the user will need to manually set `secret_values_are_urlencoded` to `False`.
## 3. Send DeprecationWarning in a niche situation for `get_conn_value()`.
`get_conn_value()` is allowed to return a JSON as of 2.3.0.
However, in the unlikely case that someone is both (1) using
`get_conn_value()` directly, and (2) is using `full_url_mode=False`, we want to
warn them that they will no longer receive a URL in the future.
2. The base implementation of `get_connection()` will generate a
`Connection` object when `get_conn_value()` returns a JSON-- or more
conceptually, when the secret is a valid JSON.
2. When `get_conn_value()` returns a JSON, `get_connection()` is able to
create a `Connection` object from the JSON.
a. Crucially, this does _not_ require URL-encoding for the base Airflow
APIs.
This is really challenging to do elegantly if the method
`SecretsManagerBackend.get_conn_value()` needs to retain 100% backwards
compatibility. By that I mean: if it is never allowed to return a JSON string
representation of the secret.
For example, if `_get_secret()` returns a string `'{"host": "foo",
"conn_type": "postgres", "schema": "mydb"}'`, then the current behavior is that
`SecretsManagerBackend.get_conn_value()` will return a string
`'postgres://:@foo/mydb'`.
Under the base class's implementation, `BaseSecretsBackend.get_conn_value()`
_is_ allowed to return a JSON string. But `SecretsManagerBackend` _never_ does
that. If the behavior of the overridden is relaxed to allow for JSON strings as
per the base implementation, then this becomes a little easier to do without
writing complete spaghetti.
For "typical" Airflow API usage there is no harm because `get_conn_value()`
is not typically called directly. However, this is a pretty big package, and
lots of people do lots of things you might not expect, so I opted to go with
smoothly transitioning away from returning a URI.
## 4. Deprecate `ast.literal_eval` (i.e. support for dict reprs) for JSON
decoding.
I want to be diplomatic, but I also want to get to the point, so please do
not interpret this negatively. Here is the original code:
```python
try:
secret_string = self._get_secret(self.connections_prefix, conn_id)
# json.loads gives error
secret = ast.literal_eval(secret_string) if secret_string else None
except ValueError: # 'malformed node or string: ' error, for empty conns
connection = None
secret = None
```
^ This is a somewhat common "mistake" in Python. The "mistake" being that
someone `print()`s a dict, and then they CTRL+C the `repr` for the dict, and
they think it's a JSON. So when they then CTRL+V the string somewhere that
expects a JSON, they hit a JSON decode error. And then instead of thinking, "I
made a mistake," they believe that they've uncovered a bug in their first
couple hundred hours of programming in a core Python module that has existed
for decades. (Notably, the comment in the Python code says "json.loads gives
error", which confirms this story.)
There are two reasons to remove `ast.literal_eval` and just use
`json.loads`. The first reason is a bit more philosophical, which is that the
API should not support an odd implementation and should not hand-hold for
mistakes at this level of simplicity.
The second reason is to provide a little more consistency across the Airflow
API:
- `{'conn_type': 'postgres', 'host': 'postgres'}` is not a valid secret
elsewhere in Airflow, but `{"conn_type": "postgres", "host": "postgres"}` is.
- `get_conn_value()` is allowed to return a JSON string, but not a dict
repr. When we migrate toward `get_conn_value()` returning a JSON string, we
should discourage use of dict reprs.
The original PR that implemented the `ast.literal_eval()` approach (#15104)
was primarily focused on adding more support for various aliases for keys.
Overall, this a fine addition to the code. For various reasons, maintainers
should be committed to retaining this feature. However, the use of
`ast.literal_eval` was never part of the discussion for this specification;
there was not a PR specifically devoted to migrating `json.loads` to
`ast.literal_eval`.
Removing it also should not be disruptive to the vast majority of people,
since:
- The Secrets Manager UI defaults to parsing manually input key-value pairs
as a JSON.
- If you input a custom string that is a dict repr but not a JSON, the UI
will show an error. It would be unusual for a user to submit this as a secret,
since the user would be ignoring the UI's warnings. (I just tested and
confirmed manually that this is the case by creating a secret that is `{'foo':
'bar'}`.)
## 5. Support `extra` being either a JSON string or a dict
AWS Secrets Manager allows for storage of arbitrary strings. For example,
both of these are valid secrets to store within the Secrets Manager:
```json
{"extra": "{\\"foo\\": \\"bar\\"}"}
```
and
```json
{"extra": {"foo": "bar"}}
```
Right now, the former is supported but not the latter. There doesn't seem to
be a good reason why the latter should not be supported, other than that the
AWS UI will fail to parse key-value pairs. But it's still a valid secret!
## 6. Added some type annotations.
Some function signatures were missing type annotations, so I added them in.
All of the type annotations for overridden methods adhere to the signatures
from the base implementation of the class.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]