dwreeves opened a new pull request, #25432:
URL: https://github.com/apache/airflow/pull/25432

   This PR addresses #25104
   
   ---
   
   # Changes
   
   ## 1. JSON secrets do not need to be URL-encoded when `full_url_mode=False`.
   
   ^ The whole reason for this PR. This is the main behavior that is being 
implemented.
   
   Specifically, this behavior is implemented for when `get_connection()` is 
called. This method returns a `Optional[Connection]` object. The `Connection()` 
can be built either using a `uri=?`, or with kwargs corresponding with the 
parts of the URI, e.g. `host=?`, `port=?`, etc. In the former case, 
URL-encoding is required; in the latter case, it is not.
   
   Users will receive a DeprecationWarning if the code detects that their 
secrets are URL-encoded (more on that in section 2 below). In most cases, the 
user should be able to simply decode their secret and everything will continue 
working normally.
   
   I tried to make this change, as well as the other changes, as quietly as 
possible for typical Airflow runtimes. Basically, if the user isn't doing 
anything weird, the only thing they will be required to do is change their 
secrets from being URL-encoded to decoded.
   
   To support this behavior, a few additional things were either needed to be 
changed, or made a lot of sense to change. The rest of this message will 
describe what those additional changes are.
   
   ## 2. Added `secret_values_are_urlencoded=?` kwarg for 
`SecretsManagerBackend.__init__`, albeit most users do not need to touch this.
   
   @potiuk suggested adding a kwarg for assisting in migration. I wanted to 
avoid this if necessary because it is very obtrusive and causes a negative user 
experience.
   
   Is it possible to avoid needing to reconfigure the secrets manager backend? 
Yes, in most cases!
   
   What I dicovered is that **if decoding the URL-encoded values is idempotent, 
the user needs to decode their secrets, but the `backend_kwargs` config option 
does not need to be adjusted.** The reason why idempotency obviates a need to 
touch the config is that idempotency means the intended value of the secret is 
unambiguous once the values are decoded. This is a pretty big win from a user 
experience perspective because adjusting the Airflow configuration can be a 
nuisance in practice (e.g. a developer might need to get a system administrator 
involved).
   
   I add a longer explanation in the comments for the code:
   
   ```python
   # When ``unquote(v) == v``, then removing unquote won't affect the user, 
regardless of
   # whether or not ``v`` is URL-encoded. For example, "foo bar" is not 
URL-encoded. But
   # because decoding it doesn't affect the value, then it will migrate safely 
when
   # ``unquote`` gets removed.
   #
   # When parameters are URL-encoded, but decoding is idempotent, we need to 
warn the user
   # to un-escape their secrets. For example, if "foo%20bar" is a URL-encoded 
string, then
   # decoding is idempotent because ``unquote(unquote("foo%20bar")) == 
unquote("foo%20bar")``.
   #
   # In the rare situation that value is URL-encoded but the decoding is _not_ 
idempotent,
   # this causes a major issue. For example, if "foo%2520bar" is URL-encoded, 
then decoding is
   # _not_ idempotent because ``unquote(unquote("foo%2520bar")) != 
unquote("foo%2520bar")``
   #
   # This causes a problem for migration because if the user decodes their 
value, we cannot
   # infer that is the case by looking at the decoded value (from our vantage 
point, it will
   # look to be URL-encoded.)
   #
   # So when this uncommon situation occurs, the user _must_ adjust the 
configuration and set
   # ``parameters_are_urlencoded`` to False to migrate safely. In all other 
cases, we do not
   # need the user to adjust this object to migrate; they can transition their 
secrets with
   # the default configuration.
   ```
   
   The kwarg is needed in the very rare case that the kwarg is not idempotent, 
i.e. the string literal for the decoded secret contains a `%`. In this case, 
the user will need to manually set `secret_values_are_urlencoded` to `False`.
   
   ## 3. Send DeprecationWarning in a niche situation for `get_conn_value()`.
   
   `get_conn_value()` is allowed to return a JSON as of 2.3.0.
   
   However, in the unlikely case that someone is both (1) using 
`get_conn_value()` directly, and (2) is using `full_url_mode=False`, we want to 
warn them that they will no longer receive a URL in the future.
   
   2. The base implementation of `get_connection()` will generate a 
`Connection` object when `get_conn_value()` returns a JSON-- or more 
conceptually, when the secret is a valid JSON.
   
   2. When `get_conn_value()` returns a JSON, `get_connection()` is able to 
create a `Connection` object from the JSON.
      a. Crucially, this does _not_ require URL-encoding for the base Airflow 
APIs.
   
   This is really challenging to do elegantly if the method 
`SecretsManagerBackend.get_conn_value()` needs to retain 100% backwards 
compatibility. By that I mean: if it is never allowed to return a JSON string 
representation of the secret.
   
   For example, if `_get_secret()` returns a string `'{"host": "foo", 
"conn_type": "postgres", "schema": "mydb"}'`, then the current behavior is that 
`SecretsManagerBackend.get_conn_value()` will return a string 
`'postgres://:@foo/mydb'`.
   
   Under the base class's implementation, `BaseSecretsBackend.get_conn_value()` 
_is_ allowed to return a JSON string. But `SecretsManagerBackend` _never_ does 
that. If the behavior of the overridden is relaxed to allow for JSON strings as 
per the base implementation, then this becomes a little easier to do without 
writing complete spaghetti.
   
   For "typical" Airflow API usage there is no harm because `get_conn_value()` 
is not typically called directly. However, this is a pretty big package, and 
lots of people do lots of things you might not expect, so I opted to go with 
smoothly transitioning away from returning a URI.
   
   ## 4. Deprecate `ast.literal_eval` (i.e. support for dict reprs) for JSON 
decoding.
   
   I want to be diplomatic, but I also want to get to the point, so please do 
not interpret this negatively. Here is the original code:
   
   ```python
   try:
       secret_string = self._get_secret(self.connections_prefix, conn_id)
       # json.loads gives error
       secret = ast.literal_eval(secret_string) if secret_string else None
   except ValueError:  # 'malformed node or string: ' error, for empty conns
       connection = None
       secret = None
   ```
   
   ^ This is a somewhat common "mistake" in Python. The "mistake" being that 
someone `print()`s a dict, and then they CTRL+C the `repr` for the dict, and 
they think it's a JSON. So when they then CTRL+V the string somewhere that 
expects a JSON, they hit a JSON decode error. And then instead of thinking, "I 
made a mistake," they believe that they've uncovered a bug in their first 
couple hundred hours of programming in a core Python module that has existed 
for decades. (Notably, the comment in the Python code says "json.loads gives 
error", which confirms this story.)
   
   There are two reasons to remove `ast.literal_eval` and just use 
`json.loads`. The first reason is a bit more philosophical, which is that the 
API should not support an odd implementation and should not hand-hold for 
mistakes at this level of simplicity.
   
   The second reason is to provide a little more consistency across the Airflow 
API:
   
   - `{'conn_type': 'postgres', 'host': 'postgres'}` is not a valid secret 
elsewhere in Airflow, but `{"conn_type": "postgres", "host": "postgres"}` is.
   - `get_conn_value()` is allowed to return a JSON string, but not a dict 
repr. When we migrate toward `get_conn_value()` returning a JSON string, we 
should discourage use of dict reprs.
   
   The original PR that implemented the `ast.literal_eval()` approach (#15104) 
was primarily focused on adding more support for various aliases for keys. 
Overall, this a fine addition to the code. For various reasons, maintainers 
should be committed to retaining this feature. However, the use of 
`ast.literal_eval` was never part of the discussion for this specification; 
there was not a PR specifically devoted to migrating `json.loads` to 
`ast.literal_eval`.
   
   Removing it also should not be disruptive to the vast majority of people, 
since:
   
   - The Secrets Manager UI defaults to parsing manually input key-value pairs 
as a JSON.
   - If you input a custom string that is a dict repr but not a JSON, the UI 
will show an error. It would be unusual for a user to submit this as a secret, 
since the user would be ignoring the UI's warnings. (I just tested and 
confirmed manually that this is the case by creating a secret that is `{'foo': 
'bar'}`.)
   
   ## 5. Support `extra` being either a JSON string or a dict
   
   AWS Secrets Manager allows for storage of arbitrary strings. For example, 
both of these are valid secrets to store within the Secrets Manager:
   
   ```json
   {"extra": "{\\"foo\\": \\"bar\\"}"}
   ```
   
   and
   
   ```json
   {"extra": {"foo": "bar"}}
   ```
   
   Right now, the former is supported but not the latter. There doesn't seem to 
be a good reason why the latter should not be supported, other than that the 
AWS UI will fail to parse key-value pairs. But it's still a valid secret!
   
   ## 6. Added some type annotations.
   
   Some function signatures were missing type annotations, so I added them in. 
All of the type annotations for overridden methods adhere to the signatures 
from the base implementation of the class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to