MaxGekk commented on code in PR #56630:
URL: https://github.com/apache/spark/pull/56630#discussion_r3447465703
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala:
##########
@@ -302,6 +304,41 @@ object JDBCOptions {
name
}
+ // The userinfo of a URL authority (everything between "//" and the
authority's last "@") may
+ // carry credentials, e.g. the "user:password" in "//user:password@host".
"[^/?#]*" extends
+ // greedily to the last "@" before the authority ends (at the first "/", "?"
or "#"), so an "@"
+ // embedded in the password is covered as well; the whole userinfo is
redacted.
+ private val URL_USER_INFO_PATTERN = "(//)[^/?#]*(@)".r
Review Comment:
Blocking: this misses Oracle Thin's standard credential syntax, where the
goal of redacting credentials "unconditionally" is the whole point. Credentials
before an `@` that isn't reached via a `//...@` authority survive in clear text
— I confirmed all three forms:
- `jdbc:oracle:thin:scott/tiger@host:1521/service` → unchanged (no `//`, no
`?`/`;`)
- `jdbc:oracle:thin:scott/tiger@//host:1521/service` → unchanged (the `//`
sits *after* the `@`, so `(//)[^/?#]*(@)` never matches)
- `…scott/tiger@//host:1521/service?x=1` → only the `?x=1` tail is redacted;
`scott/tiger` survives
Oracle is a mainstream driver and this is its primary embedded-credential
form, so a feature that announces unconditional redaction while silently
exempting it is a false-confidence trap. Resolve either by broadening redaction
to also catch a `…@host` credential form not preceded by `//` (carefully, so an
`@` legitimately appearing in a path/database segment isn't over-matched), or
by scoping the "unconditionally" wording and documenting Oracle Thin inline
credentials as unhandled.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala:
##########
@@ -302,6 +304,41 @@ object JDBCOptions {
name
}
+ // The userinfo of a URL authority (everything between "//" and the
authority's last "@") may
+ // carry credentials, e.g. the "user:password" in "//user:password@host".
"[^/?#]*" extends
Review Comment:
Edge-case partial leak: because `[^/?#]*` stops at `?`, a password
containing `?` (or `#`) prevents the userinfo `@` from being reached, and the
property pass only redacts from the `?` onward. So
`jdbc:mysql://user:pa?ss@host/db` → `jdbc:mysql://user:pa?*********(redacted)`,
leaking `user:pa`. Raw `?`/`#` in userinfo is non-RFC-compliant (should be
percent-encoded), so this is low-likelihood — noting it for completeness. (`;`
in a password is fine: it's not in the excluded class, so the `@` is still
reached.)
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala:
##########
@@ -302,6 +304,41 @@ object JDBCOptions {
name
}
+ // The userinfo of a URL authority (everything between "//" and the
authority's last "@") may
+ // carry credentials, e.g. the "user:password" in "//user:password@host".
"[^/?#]*" extends
+ // greedily to the last "@" before the authority ends (at the first "/", "?"
or "#"), so an "@"
+ // embedded in the password is covered as well; the whole userinfo is
redacted.
+ private val URL_USER_INFO_PATTERN = "(//)[^/?#]*(@)".r
+
+ // The query / connection-property portion of a JDBC URL (everything from
the first "?" or ";")
+ // can carry credentials as key=value pairs (e.g. "password=...",
"token=..."), and the set of
+ // credential-bearing property names is driver-specific and open-ended.
Rather than enumerate
+ // them, redact the whole portion, keeping only the structural
"scheme://host:port/database".
+ private val URL_PARAMS_PATTERN = "([?;]).*".r
+
+ /**
+ * Redacts credentials that may be embedded in a JDBC URL so it is safe to
surface in logs and
+ * error messages. Only the structural prefix (scheme, host, port,
database/path) is preserved;
Review Comment:
Minor: the doc says only the "structural prefix (scheme, host, port,
database/path)" is preserved, but for `;`/`?`-property URLs the database name
is carried as a property and is redacted — the test asserts
`jdbc:sqlserver://localhost:1433;*********(redacted)` for
`;databaseName=testdb`. Worth noting the db name is preserved only when it's
part of the `/`-path, not when carried as a property.
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala:
##########
@@ -69,4 +70,48 @@ class JdbcUtilsSuite extends SparkFunSuite {
condition = "PARSE_SYNTAX_ERROR",
parameters = Map("error" -> "'.'", "hint" -> ""))
}
+
+ test("redactUrl redacts credentials embedded in a JDBC URL") {
Review Comment:
Every assertion passes `regex = None`, so the "user-configured
`spark.sql.redaction.string.regex` is still applied on top" behavior — the one
piece of prior behavior this PR preserves — has no coverage. Consider a
`Some(regex)` case asserting it redacts an additional substring of the
structural prefix (e.g. the host).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]