codeant-ai-for-open-source[bot] commented on code in PR #37067:
URL: https://github.com/apache/superset/pull/37067#discussion_r2743520507
##########
superset/superset_typing.py:
##########
@@ -354,7 +354,7 @@ class OAuth2TokenResponse(TypedDict, total=False):
refresh_token: str
-class OAuth2State(TypedDict):
+class OAuth2State(TypedDict, total=False):
Review Comment:
**Suggestion:** The `OAuth2State` TypedDict is declared with `total=False`,
implying that all keys are optional, but all callers (such as
`encode_oauth2_state` in `superset/utils/oauth2.py`) unconditionally index
`database_id`, `user_id`, `default_redirect_uri`, and `tab_id`, so if any
future caller constructs a partial state that type-checks, it will trigger a
`KeyError` at runtime when encoding the state. [possible bug]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ DB OAuth2 start flow can crash with KeyError.
- ⚠️ Tests or custom callers constructing partial states.
- ⚠️ Encoded-state creation fails blocking OAuth2 redirects.
```
</details>
```suggestion
class OAuth2State(TypedDict):
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Trigger the OAuth2 encoding logic by calling encode_oauth2_state(state) in
superset/utils/oauth2.py:152 (function defined at lines 152-174). This
function reads
state["database_id"], state["user_id"], state["default_redirect_uri"], and
state["tab_id"]
unconditionally (see lines 156-162).
2. Construct an OAuth2State instance with one or more missing keys (e.g.,
{"database_id":
1, "user_id": 2} ) in application code or tests and pass it to
encode_oauth2_state.
Because OAuth2State is declared total=False, type checkers allow the partial
dict to
type-check, but at runtime encode_oauth2_state will attempt to index missing
keys and
raise KeyError. The exact failure happens at superset/utils/oauth2.py:158
when accessing
state["database_id"] (and similarly for other keys).
3. Observe a KeyError traceback originating at
superset/utils/oauth2.py:156-162 (payload
construction) causing the OAuth2 start flow to fail; the caller (e.g.,
whatever initiated
the OAuth2 dance) will receive an exception instead of a valid encoded state.
4. Note: decoding the state uses OAuth2StateSchema which enforces required
fields
(superset/utils/oauth2.py:176-181). However, the encode path does not
perform validation
and therefore is vulnerable to partial-state dicts constructed elsewhere.
The TypedDict
being total=False permits creation of partial states that will crash at
runtime when
encode_oauth2_state indexes missing keys.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/superset_typing.py
**Line:** 357:357
**Comment:**
*Possible Bug: The `OAuth2State` TypedDict is declared with
`total=False`, implying that all keys are optional, but all callers (such as
`encode_oauth2_state` in `superset/utils/oauth2.py`) unconditionally index
`database_id`, `user_id`, `default_redirect_uri`, and `tab_id`, so if any
future caller constructs a partial state that type-checks, it will trigger a
`KeyError` at runtime when encoding the state.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/commands/database/oauth2.py:
##########
@@ -50,9 +53,28 @@ def run(self) -> DatabaseUserOAuth2Tokens:
if oauth2_config is None:
raise OAuth2Error("No configuration found for OAuth2")
+ # Look up PKCE code_verifier from KV store (RFC 7636)
+ code_verifier = None
+ tab_id = self._state["tab_id"]
Review Comment:
**Suggestion:** The new PKCE lookup logic assumes that the OAuth2 state
always contains a `tab_id` key and indexes it directly; however `OAuth2State`
is defined as a non-total `TypedDict`, so if a caller ever passes a state
without `tab_id` (for example, older tokens, custom clients, or future reuse of
this command with a reduced state), this will raise a `KeyError` and break the
token exchange instead of simply skipping PKCE. [possible bug]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ OAuth2 token exchange flow may raise KeyError and fail.
- ❌ Database OAuth2 connection setup blocked on callback.
- ⚠️ Breaks PKCE only for requests missing tab_id.
- ⚠️ Unit/integration tests may surface regressions.
```
</details>
```suggestion
tab_uuid = None
tab_id = self._state.get("tab_id")
if tab_id:
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. The OAuth2 token exchange flow decodes the state in validate() at
superset/commands/database/oauth2.py:103 (self._state =
decode_oauth2_state(self._parameters["state"])). This populates self._state
with the
TypedDict from superset/superset_typing.py:357 (OAuth2State, total=False).
2. Call OAuth2StoreTokenCommand.run() which executes the PKCE lookup
starting at
superset/commands/database/oauth2.py:56-72. In the current code it indexes
self._state["tab_id"] directly at line 58.
3. If the decoded state does not include "tab_id" (allowed because
OAuth2State is
TypedDict total=False — see superset/superset_typing.py:357), Python will
raise a KeyError
at superset/commands/database/oauth2.py:58 before the code reaches the token
exchange (the
call to get_oauth2_token at lines 74-78).
4. The failure occurs during handling of the OAuth2 provider callback (token
exchange
command execution). Observed result: a KeyError traceback originating at
superset/commands/database/oauth2.py:58, aborting token storage and causing
the OAuth2
flow to fail for that callback.
5. The proposed change (use self._state.get("tab_id") and guard) prevents
KeyError and
will simply skip PKCE when tab_id is absent, restoring backward
compatibility with
older/partial states. This is necessary because OAuth2State is explicitly
non-total
(superset/superset_typing.py:357) so absence of tab_id is a valid/expected
shape.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/database/oauth2.py
**Line:** 58:58
**Comment:**
*Possible Bug: The new PKCE lookup logic assumes that the OAuth2 state
always contains a `tab_id` key and indexes it directly; however `OAuth2State`
is defined as a non-total `TypedDict`, so if a caller ever passes a state
without `tab_id` (for example, older tokens, custom clients, or future reuse of
this command with a reduced state), this will raise a `KeyError` and break the
token exchange instead of simply skipping PKCE.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/utils/oauth2_tests.py:
##########
@@ -93,3 +101,96 @@ def test_get_oauth2_access_token_base_no_refresh(mocker:
MockerFixture) -> None:
# check that token was deleted
db.session.delete.assert_called_with(token)
+
+
+def test_generate_code_verifier_length() -> None:
+ """
+ Test that generate_code_verifier produces a string of valid length (RFC
7636).
+ """
+ code_verifier = generate_code_verifier()
+ # RFC 7636 requires 43-128 characters
+ assert 43 <= len(code_verifier) <= 128
+
+
+def test_generate_code_verifier_uniqueness() -> None:
+ """
+ Test that generate_code_verifier produces unique values.
+ """
+ verifiers = {generate_code_verifier() for _ in range(100)}
+ # All generated verifiers should be unique
+ assert len(verifiers) == 100
+
+
+def test_generate_code_verifier_valid_characters() -> None:
+ """
+ Test that generate_code_verifier only uses valid characters (RFC 7636).
+ """
+ code_verifier = generate_code_verifier()
+ # RFC 7636 allows: [A-Z] / [a-z] / [0-9] / "-" / "." / "_" / "~"
+ # URL-safe base64 uses: [A-Z] / [a-z] / [0-9] / "-" / "_"
+ valid_chars = set(
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
+ )
+ assert all(char in valid_chars for char in code_verifier)
+
+
+def test_generate_code_challenge_s256() -> None:
+ """
+ Test that generate_code_challenge produces correct S256 challenge.
+ """
+ # Use a known code_verifier to verify the challenge computation
+ code_verifier = "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
+
+ # Compute expected challenge manually
+ digest = hashlib.sha256(code_verifier.encode("ascii")).digest()
+ expected_challenge =
base64.urlsafe_b64encode(digest).rstrip(b"=").decode("ascii")
+
+ code_challenge = generate_code_challenge(code_verifier)
+ assert code_challenge == expected_challenge
+
+
+def test_generate_code_challenge_rfc_example() -> None:
+ """
+ Test PKCE code challenge against RFC 7636 Appendix B example.
+
+ See: https://datatracker.ietf.org/doc/html/rfc7636#appendix-B
+ """
+ # RFC 7636 example code_verifier (Appendix B)
+ code_verifier = "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
+ # RFC 7636 expected code_challenge for S256 method
+ expected_challenge = "E9Melhoa2OwvFrEMTJguCHaoeK1t8URWbuGJSstw-cM"
+
+ code_challenge = generate_code_challenge(code_verifier)
+ assert code_challenge == expected_challenge
+
+
+def test_encode_decode_oauth2_state(
+ mocker: MockerFixture,
+) -> None:
+ """
+ Test that encode/decode cycle preserves state fields.
+ """
+ from superset.superset_typing import OAuth2State
+
+ mocker.patch(
+ "flask.current_app.config",
+ {
+ "SECRET_KEY": "test-secret-key",
+ "DATABASE_OAUTH2_JWT_ALGORITHM": "HS256",
Review Comment:
**Suggestion:** The test currently asserts that the decoded OAuth2 state
does not contain a `code_verifier` field, which contradicts the PKCE design
where the verifier must survive the encode/decode cycle so it can be sent when
exchanging the authorization code; this test will lock in broken behavior and
prevent PKCE from working correctly. [logic error]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ PKCE support broken for OAuth2 database connections.
- ❌ Token exchange missing code_verifier parameter.
- ⚠️ Unit tests accept incorrect behavior masking regressions.
- ⚠️ OAuth2 flows for public clients become less secure.
```
</details>
```suggestion
"code_verifier": "test-code-verifier",
}
with freeze_time("2024-01-01"):
encoded = encode_oauth2_state(state)
decoded = decode_oauth2_state(encoded)
assert decoded["database_id"] == 1
assert decoded["user_id"] == 2
assert decoded.get("code_verifier") == "test-code-verifier"
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Inspect encode_oauth2_state implementation at
superset/utils/oauth2.py:152-173 — it
builds JWT payload with keys: exp, database_id, user_id,
default_redirect_uri, tab_id (see
lines 156-162) and does NOT include code_verifier in the encoded payload.
2. Inspect decode_oauth2_state implementation at
superset/utils/oauth2.py:204-218 — it
decodes the JWT and loads it via oauth2_state_schema (lines 211-216). The
schema
(OAuth2StateSchema) defined at superset/utils/oauth2.py:176-195 only lists
database_id,
user_id, default_redirect_uri, tab_id and does not accept or preserve
code_verifier (lines
176-181, 189-194).
3. Run the unit test
tests/unit_tests/utils/oauth2_tests.py::test_encode_decode_oauth2_state
which currently
creates a state without a code_verifier and asserts "code_verifier" not in
decoded (tests
file lines shown in PR). Because encode_oauth2_state never includes
code_verifier and the
schema excludes unknown keys, a code_verifier present in the input state
would be dropped
during encode/decode — reproducing the assertion the test makes.
4. Real PKCE flow requires the code_verifier to survive encode/decode so it
can be sent on
token exchange. To reproduce the breaking scenario concretely:
- Modify the test to include a code_verifier in the input state
(tests/unit_tests/utils/oauth2_tests.py: add "code_verifier":
"test-code-verifier").
- Call encode_oauth2_state(state) (superset/utils/oauth2.py:152) then
decode_oauth2_state(encoded) (superset/utils/oauth2.py:204).
- Observe that decoded does NOT contain "code_verifier" because the
payload omits it at
encode time (lines 156-162) and the schema excludes unknown fields
(OAuth2StateSchema.
Meta unknown = EXCLUDE at line 198).
5. Conclusion: The test's current assertion that the decoded state must not
contain
code_verifier masks a real problem — encode_oauth2_state and the schema must
be updated to
preserve code_verifier if PKCE is intended. The improved test (proposed)
will fail under
current code, proving the bug.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/utils/oauth2_tests.py
**Line:** 172:179
**Comment:**
*Logic Error: The test currently asserts that the decoded OAuth2 state
does not contain a `code_verifier` field, which contradicts the PKCE design
where the verifier must survive the encode/decode cycle so it can be sent when
exchanging the authorization code; this test will lock in broken behavior and
prevent PKCE from working correctly.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
superset/db_engine_specs/base.py:
##########
@@ -474,10 +481,33 @@ def start_oauth2_dance(cls, database: Database) -> None:
tab sends a message to the original tab informing that authorization
was
successful (or not), and then closes. The original tab will
automatically
re-run the query after authorization.
+
+ PKCE (RFC 7636) is used to protect against authorization code
interception
+ attacks. A code_verifier is generated and stored server-side in the KV
store,
+ while the code_challenge (derived from the verifier) is sent to the
+ authorization server.
"""
+ # prevent circular import
+ from superset.daos.key_value import KeyValueDAO
+
tab_id = str(uuid4())
default_redirect_uri = url_for("DatabaseRestApi.oauth2",
_external=True)
+ # Generate PKCE code verifier (RFC 7636)
+ code_verifier = generate_code_verifier()
+
+ # Store the code_verifier server-side in the KV store, keyed by tab_id.
+ # This avoids exposing it in the URL/browser history via the JWT state.
+ KeyValueDAO.delete_expired_entries(KeyValueResource.PKCE_CODE_VERIFIER)
+ KeyValueDAO.create_entry(
+ resource=KeyValueResource.PKCE_CODE_VERIFIER,
+ value={"code_verifier": code_verifier},
+ codec=JsonKeyValueCodec(),
+ key=UUID(tab_id),
+ expires_on=datetime.now() + timedelta(minutes=5),
+ )
+ db.session.flush()
Review Comment:
**Suggestion:** The PKCE `code_verifier` is written to the key-value store
and then `OAuth2RedirectError` is raised without committing the transaction, so
in configurations where sessions are rolled back on exceptions the
`code_verifier` row will never actually be persisted and the subsequent token
exchange will not find it, causing the PKCE flow to fail; replacing the flush
with a commit (or otherwise ensuring a commit here) guarantees the verifier is
durable before redirecting the user. [logic error]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ OAuth2 PKCE token exchange fails for DB OAuth2 flows
- ⚠️ OAuth2 database authentication unreliable in some deployments
- ⚠️ Affects any engine using DATABASE_OAUTH2_CLIENTS config
```
</details>
```suggestion
db.session.commit()
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Configure a Database with OAuth2 in app config and mark engine as
OAuth2-enabled. (See
superset/db_engine_specs/base.py start_oauth2_dance where PKCE is generated
and stored;
PKCE storage occurs at the snippet starting at
superset/db_engine_specs/base.py:501.)
2. Trigger an OAuth2 authorization flow from a frontend operation that hits
the code path
which raises OAuth2RedirectError. In practice this happens when a DB call
raises an
OAuth2RedirectError and execute() calls start_oauth2_dance (execute handles
exceptions and
calls cls.start_oauth2_dance(database) — see
superset/db_engine_specs/base.py execute
method which flows into start_oauth2_dance).
3. start_oauth2_dance generates a code_verifier and calls
KeyValueDAO.create_entry, then
calls db.session.flush() (lines shown in the existing_code block at
superset/db_engine_specs/base.py:501-509).
4. Because an OAuth2RedirectError is raised immediately after (the flow
raises
OAuth2RedirectError to redirect the browser — see the raise at
superset/db_engine_specs/base.py around the OAuth redirect), some
configurations roll back
the current DB session on exceptions. In those setups the earlier
db.session.flush() does
not persist the KeyValueDAO entry (it can be rolled back), so the later
token exchange
that looks up the verifier in the KV store fails to find it and PKCE token
exchange is
rejected.
5. Reproduce locally by: (a) enabling DATABASE_OAUTH2_CLIENTS for an engine,
(b) causing a
DB operation that triggers OAuth2 redirect (the path through execute() ->
needs_oauth2()
-> start_oauth2_dance()), (c) inspect the KV store for the tab_id key — the
entry will be
missing if the session was rolled back because only flush() was called.
Observed failure:
the token exchange step cannot validate PKCE because stored verifier does
not exist.
6. Explanation: a flush() only sends SQL to the DB within the transaction;
without commit
the row can be rolled back by exception handlers. Replacing flush() with
commit() (or
ensuring the entry is persisted outside the request transaction) guarantees
durability
across the redirect and subsequent token exchange.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/db_engine_specs/base.py
**Line:** 509:509
**Comment:**
*Logic Error: The PKCE `code_verifier` is written to the key-value
store and then `OAuth2RedirectError` is raised without committing the
transaction, so in configurations where sessions are rolled back on exceptions
the `code_verifier` row will never actually be persisted and the subsequent
token exchange will not find it, causing the PKCE flow to fail; replacing the
flush with a commit (or otherwise ensuring a commit here) guarantees the
verifier is durable before redirecting the user.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]