codeant-ai-for-open-source[bot] commented on code in PR #37067:
URL: https://github.com/apache/superset/pull/37067#discussion_r2743520507


##########
superset/superset_typing.py:
##########
@@ -354,7 +354,7 @@ class OAuth2TokenResponse(TypedDict, total=False):
     refresh_token: str
 
 
-class OAuth2State(TypedDict):
+class OAuth2State(TypedDict, total=False):

Review Comment:
   **Suggestion:** The `OAuth2State` TypedDict is declared with `total=False`, 
implying that all keys are optional, but all callers (such as 
`encode_oauth2_state` in `superset/utils/oauth2.py`) unconditionally index 
`database_id`, `user_id`, `default_redirect_uri`, and `tab_id`, so if any 
future caller constructs a partial state that type-checks, it will trigger a 
`KeyError` at runtime when encoding the state. [possible bug]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ DB OAuth2 start flow can crash with KeyError.
   - ⚠️ Tests or custom callers constructing partial states.
   - ⚠️ Encoded-state creation fails blocking OAuth2 redirects.
   ```
   </details>
   
   ```suggestion
   class OAuth2State(TypedDict):
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger the OAuth2 encoding logic by calling encode_oauth2_state(state) in
   superset/utils/oauth2.py:152 (function defined at lines 152-174). This 
function reads
   state["database_id"], state["user_id"], state["default_redirect_uri"], and 
state["tab_id"]
   unconditionally (see lines 156-162).
   
   2. Construct an OAuth2State instance with one or more missing keys (e.g., 
{"database_id":
   1, "user_id": 2} ) in application code or tests and pass it to 
encode_oauth2_state.
   Because OAuth2State is declared total=False, type checkers allow the partial 
dict to
   type-check, but at runtime encode_oauth2_state will attempt to index missing 
keys and
   raise KeyError. The exact failure happens at superset/utils/oauth2.py:158 
when accessing
   state["database_id"] (and similarly for other keys).
   
   3. Observe a KeyError traceback originating at 
superset/utils/oauth2.py:156-162 (payload
   construction) causing the OAuth2 start flow to fail; the caller (e.g., 
whatever initiated
   the OAuth2 dance) will receive an exception instead of a valid encoded state.
   
   4. Note: decoding the state uses OAuth2StateSchema which enforces required 
fields
   (superset/utils/oauth2.py:176-181). However, the encode path does not 
perform validation
   and therefore is vulnerable to partial-state dicts constructed elsewhere. 
The TypedDict
   being total=False permits creation of partial states that will crash at 
runtime when
   encode_oauth2_state indexes missing keys.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/superset_typing.py
   **Line:** 357:357
   **Comment:**
        *Possible Bug: The `OAuth2State` TypedDict is declared with 
`total=False`, implying that all keys are optional, but all callers (such as 
`encode_oauth2_state` in `superset/utils/oauth2.py`) unconditionally index 
`database_id`, `user_id`, `default_redirect_uri`, and `tab_id`, so if any 
future caller constructs a partial state that type-checks, it will trigger a 
`KeyError` at runtime when encoding the state.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/commands/database/oauth2.py:
##########
@@ -50,9 +53,28 @@ def run(self) -> DatabaseUserOAuth2Tokens:
         if oauth2_config is None:
             raise OAuth2Error("No configuration found for OAuth2")
 
+        # Look up PKCE code_verifier from KV store (RFC 7636)
+        code_verifier = None
+        tab_id = self._state["tab_id"]

Review Comment:
   **Suggestion:** The new PKCE lookup logic assumes that the OAuth2 state 
always contains a `tab_id` key and indexes it directly; however `OAuth2State` 
is defined as a non-total `TypedDict`, so if a caller ever passes a state 
without `tab_id` (for example, older tokens, custom clients, or future reuse of 
this command with a reduced state), this will raise a `KeyError` and break the 
token exchange instead of simply skipping PKCE. [possible bug]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ OAuth2 token exchange flow may raise KeyError and fail.
   - ❌ Database OAuth2 connection setup blocked on callback.
   - ⚠️ Breaks PKCE only for requests missing tab_id.
   - ⚠️ Unit/integration tests may surface regressions.
   ```
   </details>
   
   ```suggestion
           tab_uuid = None
           tab_id = self._state.get("tab_id")
           if tab_id:
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. The OAuth2 token exchange flow decodes the state in validate() at
   superset/commands/database/oauth2.py:103 (self._state =
   decode_oauth2_state(self._parameters["state"])). This populates self._state 
with the
   TypedDict from superset/superset_typing.py:357 (OAuth2State, total=False).
   
   2. Call OAuth2StoreTokenCommand.run() which executes the PKCE lookup 
starting at
   superset/commands/database/oauth2.py:56-72. In the current code it indexes
   self._state["tab_id"] directly at line 58.
   
   3. If the decoded state does not include "tab_id" (allowed because 
OAuth2State is
   TypedDict total=False — see superset/superset_typing.py:357), Python will 
raise a KeyError
   at superset/commands/database/oauth2.py:58 before the code reaches the token 
exchange (the
   call to get_oauth2_token at lines 74-78).
   
   4. The failure occurs during handling of the OAuth2 provider callback (token 
exchange
   command execution). Observed result: a KeyError traceback originating at
   superset/commands/database/oauth2.py:58, aborting token storage and causing 
the OAuth2
   flow to fail for that callback.
   
   5. The proposed change (use self._state.get("tab_id") and guard) prevents 
KeyError and
   will simply skip PKCE when tab_id is absent, restoring backward 
compatibility with
   older/partial states. This is necessary because OAuth2State is explicitly 
non-total
   (superset/superset_typing.py:357) so absence of tab_id is a valid/expected 
shape.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/database/oauth2.py
   **Line:** 58:58
   **Comment:**
        *Possible Bug: The new PKCE lookup logic assumes that the OAuth2 state 
always contains a `tab_id` key and indexes it directly; however `OAuth2State` 
is defined as a non-total `TypedDict`, so if a caller ever passes a state 
without `tab_id` (for example, older tokens, custom clients, or future reuse of 
this command with a reduced state), this will raise a `KeyError` and break the 
token exchange instead of simply skipping PKCE.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
tests/unit_tests/utils/oauth2_tests.py:
##########
@@ -93,3 +101,96 @@ def test_get_oauth2_access_token_base_no_refresh(mocker: 
MockerFixture) -> None:
 
     # check that token was deleted
     db.session.delete.assert_called_with(token)
+
+
+def test_generate_code_verifier_length() -> None:
+    """
+    Test that generate_code_verifier produces a string of valid length (RFC 
7636).
+    """
+    code_verifier = generate_code_verifier()
+    # RFC 7636 requires 43-128 characters
+    assert 43 <= len(code_verifier) <= 128
+
+
+def test_generate_code_verifier_uniqueness() -> None:
+    """
+    Test that generate_code_verifier produces unique values.
+    """
+    verifiers = {generate_code_verifier() for _ in range(100)}
+    # All generated verifiers should be unique
+    assert len(verifiers) == 100
+
+
+def test_generate_code_verifier_valid_characters() -> None:
+    """
+    Test that generate_code_verifier only uses valid characters (RFC 7636).
+    """
+    code_verifier = generate_code_verifier()
+    # RFC 7636 allows: [A-Z] / [a-z] / [0-9] / "-" / "." / "_" / "~"
+    # URL-safe base64 uses: [A-Z] / [a-z] / [0-9] / "-" / "_"
+    valid_chars = set(
+        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
+    )
+    assert all(char in valid_chars for char in code_verifier)
+
+
+def test_generate_code_challenge_s256() -> None:
+    """
+    Test that generate_code_challenge produces correct S256 challenge.
+    """
+    # Use a known code_verifier to verify the challenge computation
+    code_verifier = "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
+
+    # Compute expected challenge manually
+    digest = hashlib.sha256(code_verifier.encode("ascii")).digest()
+    expected_challenge = 
base64.urlsafe_b64encode(digest).rstrip(b"=").decode("ascii")
+
+    code_challenge = generate_code_challenge(code_verifier)
+    assert code_challenge == expected_challenge
+
+
+def test_generate_code_challenge_rfc_example() -> None:
+    """
+    Test PKCE code challenge against RFC 7636 Appendix B example.
+
+    See: https://datatracker.ietf.org/doc/html/rfc7636#appendix-B
+    """
+    # RFC 7636 example code_verifier (Appendix B)
+    code_verifier = "dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk"
+    # RFC 7636 expected code_challenge for S256 method
+    expected_challenge = "E9Melhoa2OwvFrEMTJguCHaoeK1t8URWbuGJSstw-cM"
+
+    code_challenge = generate_code_challenge(code_verifier)
+    assert code_challenge == expected_challenge
+
+
+def test_encode_decode_oauth2_state(
+    mocker: MockerFixture,
+) -> None:
+    """
+    Test that encode/decode cycle preserves state fields.
+    """
+    from superset.superset_typing import OAuth2State
+
+    mocker.patch(
+        "flask.current_app.config",
+        {
+            "SECRET_KEY": "test-secret-key",
+            "DATABASE_OAUTH2_JWT_ALGORITHM": "HS256",

Review Comment:
   **Suggestion:** The test currently asserts that the decoded OAuth2 state 
does not contain a `code_verifier` field, which contradicts the PKCE design 
where the verifier must survive the encode/decode cycle so it can be sent when 
exchanging the authorization code; this test will lock in broken behavior and 
prevent PKCE from working correctly. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ PKCE support broken for OAuth2 database connections.
   - ❌ Token exchange missing code_verifier parameter.
   - ⚠️ Unit tests accept incorrect behavior masking regressions.
   - ⚠️ OAuth2 flows for public clients become less secure.
   ```
   </details>
   
   ```suggestion
       "code_verifier": "test-code-verifier",
   }
   
   with freeze_time("2024-01-01"):
       encoded = encode_oauth2_state(state)
       decoded = decode_oauth2_state(encoded)
   
   assert decoded["database_id"] == 1
   assert decoded["user_id"] == 2
   assert decoded.get("code_verifier") == "test-code-verifier"
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Inspect encode_oauth2_state implementation at 
superset/utils/oauth2.py:152-173 — it
   builds JWT payload with keys: exp, database_id, user_id, 
default_redirect_uri, tab_id (see
   lines 156-162) and does NOT include code_verifier in the encoded payload.
   
   2. Inspect decode_oauth2_state implementation at 
superset/utils/oauth2.py:204-218 — it
   decodes the JWT and loads it via oauth2_state_schema (lines 211-216). The 
schema
   (OAuth2StateSchema) defined at superset/utils/oauth2.py:176-195 only lists 
database_id,
   user_id, default_redirect_uri, tab_id and does not accept or preserve 
code_verifier (lines
   176-181, 189-194).
   
   3. Run the unit test
   tests/unit_tests/utils/oauth2_tests.py::test_encode_decode_oauth2_state 
which currently
   creates a state without a code_verifier and asserts "code_verifier" not in 
decoded (tests
   file lines shown in PR). Because encode_oauth2_state never includes 
code_verifier and the
   schema excludes unknown keys, a code_verifier present in the input state 
would be dropped
   during encode/decode — reproducing the assertion the test makes.
   
   4. Real PKCE flow requires the code_verifier to survive encode/decode so it 
can be sent on
   token exchange. To reproduce the breaking scenario concretely:
   
      - Modify the test to include a code_verifier in the input state
      (tests/unit_tests/utils/oauth2_tests.py: add "code_verifier": 
"test-code-verifier").
   
      - Call encode_oauth2_state(state) (superset/utils/oauth2.py:152) then
      decode_oauth2_state(encoded) (superset/utils/oauth2.py:204).
   
      - Observe that decoded does NOT contain "code_verifier" because the 
payload omits it at
      encode time (lines 156-162) and the schema excludes unknown fields 
(OAuth2StateSchema.
      Meta unknown = EXCLUDE at line 198).
   
   5. Conclusion: The test's current assertion that the decoded state must not 
contain
   code_verifier masks a real problem — encode_oauth2_state and the schema must 
be updated to
   preserve code_verifier if PKCE is intended. The improved test (proposed) 
will fail under
   current code, proving the bug.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/utils/oauth2_tests.py
   **Line:** 172:179
   **Comment:**
        *Logic Error: The test currently asserts that the decoded OAuth2 state 
does not contain a `code_verifier` field, which contradicts the PKCE design 
where the verifier must survive the encode/decode cycle so it can be sent when 
exchanging the authorization code; this test will lock in broken behavior and 
prevent PKCE from working correctly.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



##########
superset/db_engine_specs/base.py:
##########
@@ -474,10 +481,33 @@ def start_oauth2_dance(cls, database: Database) -> None:
         tab sends a message to the original tab informing that authorization 
was
         successful (or not), and then closes. The original tab will 
automatically
         re-run the query after authorization.
+
+        PKCE (RFC 7636) is used to protect against authorization code 
interception
+        attacks. A code_verifier is generated and stored server-side in the KV 
store,
+        while the code_challenge (derived from the verifier) is sent to the
+        authorization server.
         """
+        # prevent circular import
+        from superset.daos.key_value import KeyValueDAO
+
         tab_id = str(uuid4())
         default_redirect_uri = url_for("DatabaseRestApi.oauth2", 
_external=True)
 
+        # Generate PKCE code verifier (RFC 7636)
+        code_verifier = generate_code_verifier()
+
+        # Store the code_verifier server-side in the KV store, keyed by tab_id.
+        # This avoids exposing it in the URL/browser history via the JWT state.
+        KeyValueDAO.delete_expired_entries(KeyValueResource.PKCE_CODE_VERIFIER)
+        KeyValueDAO.create_entry(
+            resource=KeyValueResource.PKCE_CODE_VERIFIER,
+            value={"code_verifier": code_verifier},
+            codec=JsonKeyValueCodec(),
+            key=UUID(tab_id),
+            expires_on=datetime.now() + timedelta(minutes=5),
+        )
+        db.session.flush()

Review Comment:
   **Suggestion:** The PKCE `code_verifier` is written to the key-value store 
and then `OAuth2RedirectError` is raised without committing the transaction, so 
in configurations where sessions are rolled back on exceptions the 
`code_verifier` row will never actually be persisted and the subsequent token 
exchange will not find it, causing the PKCE flow to fail; replacing the flush 
with a commit (or otherwise ensuring a commit here) guarantees the verifier is 
durable before redirecting the user. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ OAuth2 PKCE token exchange fails for DB OAuth2 flows
   - ⚠️ OAuth2 database authentication unreliable in some deployments
   - ⚠️ Affects any engine using DATABASE_OAUTH2_CLIENTS config
   ```
   </details>
   
   ```suggestion
           db.session.commit()
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure a Database with OAuth2 in app config and mark engine as 
OAuth2-enabled. (See
   superset/db_engine_specs/base.py start_oauth2_dance where PKCE is generated 
and stored;
   PKCE storage occurs at the snippet starting at 
superset/db_engine_specs/base.py:501.)
   
   2. Trigger an OAuth2 authorization flow from a frontend operation that hits 
the code path
   which raises OAuth2RedirectError. In practice this happens when a DB call 
raises an
   OAuth2RedirectError and execute() calls start_oauth2_dance (execute handles 
exceptions and
   calls cls.start_oauth2_dance(database) — see 
superset/db_engine_specs/base.py execute
   method which flows into start_oauth2_dance).
   
   3. start_oauth2_dance generates a code_verifier and calls 
KeyValueDAO.create_entry, then
   calls db.session.flush() (lines shown in the existing_code block at
   superset/db_engine_specs/base.py:501-509).
   
   4. Because an OAuth2RedirectError is raised immediately after (the flow 
raises
   OAuth2RedirectError to redirect the browser — see the raise at
   superset/db_engine_specs/base.py around the OAuth redirect), some 
configurations roll back
   the current DB session on exceptions. In those setups the earlier 
db.session.flush() does
   not persist the KeyValueDAO entry (it can be rolled back), so the later 
token exchange
   that looks up the verifier in the KV store fails to find it and PKCE token 
exchange is
   rejected.
   
   5. Reproduce locally by: (a) enabling DATABASE_OAUTH2_CLIENTS for an engine, 
(b) causing a
   DB operation that triggers OAuth2 redirect (the path through execute() -> 
needs_oauth2()
   -> start_oauth2_dance()), (c) inspect the KV store for the tab_id key — the 
entry will be
   missing if the session was rolled back because only flush() was called. 
Observed failure:
   the token exchange step cannot validate PKCE because stored verifier does 
not exist.
   
   6. Explanation: a flush() only sends SQL to the DB within the transaction; 
without commit
   the row can be rolled back by exception handlers. Replacing flush() with 
commit() (or
   ensuring the entry is persisted outside the request transaction) guarantees 
durability
   across the redirect and subsequent token exchange.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/db_engine_specs/base.py
   **Line:** 509:509
   **Comment:**
        *Logic Error: The PKCE `code_verifier` is written to the key-value 
store and then `OAuth2RedirectError` is raised without committing the 
transaction, so in configurations where sessions are rolled back on exceptions 
the `code_verifier` row will never actually be persisted and the subsequent 
token exchange will not find it, causing the PKCE flow to fail; replacing the 
flush with a commit (or otherwise ensuring a commit here) guarantees the 
verifier is durable before redirecting the user.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to