cloud-fan opened a new pull request, #56630:
URL: https://github.com/apache/spark/pull/56630

   ### What changes were proposed in this pull request?
   
   `JDBCOptions.getRedactUrl()` is used to surface the JDBC URL in logs and in 
the
   `FAILED_JDBC.*` error messages (e.g. `FAILED_JDBC.CONNECTION`). Today it is 
implemented as:
   
   ```scala
   def getRedactUrl(): String = 
Utils.redact(SQLConf.get.stringRedactionPattern, url)
   ```
   
   `SQLConf.get.stringRedactionPattern` (`spark.sql.redaction.string.regex`) is 
**unset by
   default**, so `Utils.redact(None, url)` returns the URL unchanged. JDBC URLs 
routinely embed
   credentials — either as userinfo in the authority 
(`jdbc:mysql://user:password@host/db`) or as
   connection properties (`...?password=secret`, `...;token=abc`) — so by 
default those secrets are
   printed verbatim in error messages and logs.
   
   This PR makes `getRedactUrl()` redact credentials unconditionally, 
independent of the optional
   `spark.sql.redaction.string.regex`:
   - A new `JDBCOptions.redactUrl(url, regex)` helper redacts the password in 
the authority's userinfo
     (keeping the username) and the values of sensitive connection properties 
(`password`, `pwd`,
     `token`, `secret`, `accessKey`, `credential`, …), then still applies the 
user-configured
     `regex` on top, preserving existing behavior.
   - `getRedactUrl()` now delegates to this helper.
   
   Non-sensitive parts of the URL (scheme, host, port, database, and non-secret 
properties) are kept,
   so error messages remain useful for debugging while no longer leaking 
credentials.
   
   ### Why are the changes needed?
   
   `FAILED_JDBC.*` errors and connection logs can currently leak JDBC 
credentials to anyone who can
   read query error messages or driver logs, because the default redaction is a 
no-op. Redacting at
   the single `getRedactUrl()` chokepoint fixes every `FAILED_JDBC.*` call site 
at once.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. When a JDBC URL contains embedded credentials, `FAILED_JDBC.*` error 
messages (and any log
   line built from `getRedactUrl()`) now show the credentials replaced with 
`*********(redacted)`
   instead of in clear text. URLs without embedded credentials are unchanged.
   
   ### How was this patch tested?
   
   New unit test `redactUrl redacts credentials embedded in a JDBC URL` in 
`JdbcUtilsSuite`, covering:
   credential-free URLs (passthrough), userinfo password redaction (username 
kept), bare-username
   passthrough, sensitive connection properties across `&`/`;` separators and 
mixed case,
   non-sensitive properties kept, combined userinfo + property redaction, and 
null/empty inputs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.8
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to