Re: [PR] [SPARK-55964] system catalog wins over user catalog for BUILTIN, and SESSION schemas. [spark]

via GitHub Mon, 16 Mar 2026 12:28:07 -0700


cloud-fan commented on code in PR #54765:
URL: https://github.com/apache/spark/pull/54765#discussion_r2942458744



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionResolution.scala:
##########
@@ -75,21 +75,52 @@ class FunctionResolution(
 
   /**
    * Produces the ordered list of fully qualified candidate names for 
resolution.
+   * Every candidate is 3-part (catalog.database.function), assuming search 
path
+   * entries are catalog.database (2-part).
    *
    * @param nameParts The function name parts.
-   * @return A sequence of fully qualified function names to attempt 
resolution with.
+   * @return A sequence of fully qualified 3-part names to attempt resolution 
with.

Review Comment:
   The Scaladoc says "Every candidate is 3-part" but line 117 returns 2-part 
names for catalogs without a default namespace (the comment on line 116 even 
says "use 2-part").
   ```suggestion
      * Candidates are normally 3-part (catalog.database.function), assuming 
search path
      * entries are catalog.database (2-part). A 2-part candidate is returned 
when a
      * V2 catalog has no default namespace.
      *
      * @param nameParts The function name parts.
      * @return A sequence of candidate names (usually 3-part) to attempt 
resolution with.
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionResolution.scala:
##########
@@ -75,21 +75,52 @@ class FunctionResolution(
 
   /**
    * Produces the ordered list of fully qualified candidate names for 
resolution.
+   * Every candidate is 3-part (catalog.database.function), assuming search 
path
+   * entries are catalog.database (2-part).
    *
    * @param nameParts The function name parts.
-   * @return A sequence of fully qualified function names to attempt 
resolution with.
+   * @return A sequence of fully qualified 3-part names to attempt resolution 
with.
    */
   private def resolutionCandidates(nameParts: Seq[String]): Seq[Seq[String]] = 
{
+    def ensureThreePart(parts: Seq[String]): Seq[String] =
+      if (parts.length == 3) parts
+      else if (parts.length == 2) currentCatalogPath.head +: parts
+      else parts  // Safety fallback; valid paths yield 2 or 3 parts before 
this point.
+
     if (nameParts.size == 1) {
+      // Search path entries are 2-part (catalog.database), so path ++ 
nameParts is always 3-part.
       val searchPath = SQLConf.get.resolutionSearchPath(currentCatalogPath)
       searchPath.map(_ ++ nameParts)
     } else {
       nameParts.size match {
-        case 2 if FunctionResolution.sessionNamespaceKind(nameParts).isDefined 
=>
-          // Partially qualified builtin/session: try persistent first so user 
schema wins
-          Seq(nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME) ++ nameParts)
+        case 2
+            if FunctionResolution.sessionNamespaceKind(nameParts).isDefined =>
+          // Partially qualified builtin/session: produce both candidates as 
3-part
+          // (catalog.database.func).
+          val systemCandidate = CatalogManager.SYSTEM_CATALOG_NAME +: nameParts
+          val persistentCandidate = currentCatalogPath.head +: nameParts
+          if (SQLConf.get.prioritizeSystemCatalog) {
+            Seq(systemCandidate, persistentCandidate)
+          } else {
+            Seq(persistentCandidate, systemCandidate)
+          }
+        case 2 if catalogManager.isCatalogRegistered(nameParts(0)) =>

Review Comment:
   This `case 2 if catalogManager.isCatalogRegistered` branch is new behavior 
not present in the old code and not mentioned in the PR description. 
Previously, 2-part names like `mycatalog.func` passed through unchanged to 
`resolveQualifiedFunction`, where `CatalogAndIdentifier.unapply` would create 
an empty-namespace identifier — the same convention used for tables 
(`mycatalog.mytable` also gets an empty namespace, and the catalog fills in its 
default internally).
   
   The new code explicitly fetches `cat.defaultNamespace()` and pre-fills it, 
which:
   - Is **inconsistent with table resolution**, where V2 catalogs handle empty 
namespaces themselves
   - **Drops multi-level default namespaces** — `defaultNs(0)` only takes the 
first element, so `Array("ns1", "ns2")` loses `"ns2"`
   - Has no test coverage
   
   This should either be removed (let `CatalogAndIdentifier` handle it, as 
before) or justified and tested separately.



##########
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala:
##########
@@ -363,24 +364,55 @@ class SparkSessionExtensions {
 
   private[this] val injectedTableFunctions = 
mutable.Buffer.empty[TableFunctionDescription]
 
+  /**
+   * Normalizes an extension function identifier to a fully qualified one for 
registration.
+   * Accepts unqualified (1-part), fully qualified (3-part), or 2-part system 
names
+   * (builtin.func, session.func). It rejects other partially qualified names.
+   */
+  private def fullyQualifiedFunctionIdentifier(name: FunctionIdentifier): 
FunctionIdentifier = {
+    (name.catalog.isEmpty, name.database.isEmpty) match {
+      case (true, true) =>
+        // Unqualified names are registered in system.builtin.
+        FunctionRegistry.builtinFunctionIdentifier(name.funcName)
+      case (false, false) =>
+        // Fully qualified names are used as-is.
+        name
+      case (true, false) =>

Review Comment:
   `fullyQualifiedFunctionIdentifier` maps ALL 2-part names to the system 
catalog, not just `builtin` and `session`. If a user calls `injectFunction` 
with `FunctionIdentifier("func", Some("custom_db"))`, it would be stored as 
`system.custom_db.func` — unreachable by any standard resolution path. No 
error, the function just silently never resolves.
   
   Consider validating that the database is `builtin` or `session`, or falling 
through to the invalid case for other databases.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala:
##########
@@ -197,8 +197,8 @@ case class ShowFunctionsCommand(
       sparkSession.sessionState.catalog
         .listFunctions(db, pattern.getOrElse("*"))
         .collect {
-          case (f, "USER") if showUserFunctions => f.unquotedString
-          case (f, "SYSTEM") if showSystemFunctions => f.unquotedString
+          case (f, "USER") if showUserFunctions => 
f.displayNameForShowFunctions

Review Comment:
   With the 3-part key change, `listFunctions` now produces different tuple 
values for a builtin and a same-named temp function. Both display as "abs" but 
have different `FunctionIdentifier` shapes and different categories ("SYSTEM" 
vs "USER"), so the upstream `.distinct` on tuples doesn't collapse them. 
`ShowFunctionsCommand` would show "abs" twice when a temp function shadows a 
builtin.
   
   The v2 path (`ShowFunctionsExec`) avoids this with `.distinct` on 
display-name strings. Consider adding `.distinct` to `functionNames` here as 
well — e.g., after the `.collect { ... }` block.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##########
@@ -210,10 +210,26 @@ trait SimpleFunctionRegistryBase[T] extends 
FunctionRegistryBase[T] with Logging
   protected val functionBuilders =
     new mutable.HashMap[FunctionIdentifier, (ExpressionInfo, FunctionBuilder)]
 
-  // Resolution of the function name is always case insensitive, but the 
database name
-  // depends on the caller
+  // Function name resolution is case-insensitive; database and catalog are 
preserved so
+  // system.session.foo and spark_catalog.session.foo do not collide. 
Unqualified (1-part)
+  // lookups resolve to the builtin 3-part key so that callers using 
FunctionIdentifier("time")
+  // find functions registered as system.builtin.time. A 2-part name with 
database=session
+  // and no catalog resolves to the temp 3-part key.
   private def normalizeFuncName(name: FunctionIdentifier): FunctionIdentifier 
= {
-    FunctionIdentifier(name.funcName.toLowerCase(Locale.ROOT), name.database)
+    if (name.database.isEmpty && name.catalog.isEmpty) {

Review Comment:
   `normalizeFuncName` now has branches beyond lowercasing, encoding the 
assumption "unqualified = builtin" and "database=session without catalog = 
system.session" in the storage layer. The registry is a storage layer — ideally 
it should only normalize case. The caller (SessionCatalog / FunctionResolution) 
already knows what kind of function it's registering or looking up and should 
provide the exact storage key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55964] system catalog wins over user catalog for BUILTIN, and SESSION schemas. [spark]

Reply via email to