[jira] [Updated] (SPARK-57725) NPE in AttributeSeq column resolution when an attribute has a null name

Max Gekk (Jira) Sat, 27 Jun 2026 01:30:09 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-57725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Max Gekk updated SPARK-57725:
-----------------------------
    Description: 
h2. Summary

  Column resolution throws an internal {{NullPointerException}} when the input 
plan exposes an
  {{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds 
case-insensitive name lookup
  maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key 
function dereferences
  the name without a null check, so a single null-named attribute aborts 
resolution of the whole
  operator with an {{INTERNAL_ERROR}} instead of resolving the other columns 
(or producing a normal
  unresolved-column error).

  h2. Affected code

  {{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
  
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
 The
  {{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy 
maps all group by
  {{_.name.toLowerCase(Locale.ROOT)}}:

  {code:scala}
  @transient private lazy val direct: Map[String, Seq[Attribute]] = {
    unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT)))   // NPE if a.name 
== null
  }
  {code}

  This grouping has been present (unchanged) since well before SPARK-50037 
reworked
  {{AttributeSeq.resolve}}, so the issue is long-standing rather than a recent 
regression.

h2. Reproduction (minimal, Catalyst level)

  {code:scala}
  import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
  import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
  import org.apache.spark.sql.types.IntegerType

  val attrs: Seq[Attribute] = Seq(
    AttributeReference("a", IntegerType)(),
    AttributeReference(null, IntegerType)())   // an attribute with a null name

  // Resolving any real column forces the case-insensitive name map and throws:
  attrs.resolve(Seq("a"), caseInsensitiveResolution)
  {code}

  Result:
{code:none}
  java.lang.NullPointerException: Cannot invoke 
"String.toLowerCase(java.util.Locale)" because
  the return value of 
"org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
    at scala.collection.IterableOps.groupBy(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
    at 
org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
  {code}

  In practice this surfaces during normal analysis (e.g. a {{DataFrame.filter}} 
whose child plan
  carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}} 
(SQLSTATE {{XX000}}).

h3. How a null-named attribute arises (e.g.)

  The Scala DataFrame API builds attributes directly from the schema
  ({{StructField -> DataTypeUtils.toAttribute -> AttributeReference}}), and 
{{StructField}} permits a
  null {{name}} (no {{require(name != null)}}), so a null field name yields a 
null-named attribute:

  {code:scala}
  import org.apache.spark.sql.Row
  import org.apache.spark.sql.types._

  val schema = StructType(Seq(StructField(null, IntegerType), StructField("b", 
IntegerType)))
  val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row(1, 
2))), schema)
  df.select("b").collect()   // forces resolution -> the NPE above
  {code}

  Note: this is not reproducible from PySpark directly -- 
{{pyspark.sql.types.StructField}} asserts
  the field name is a string ({{assert isinstance(name, str)}}), so a null name 
can only originate on
  the JVM side (e.g. an internal or connector-produced attribute), which is how 
it is observed in
  practice.

  h2. Root cause

  {{StructField}} permits a null {{name}} (no {{require(name != null)}}), and 
the name flows
  unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}. 
When such an attribute
  reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key 
function NPEs. The same
  null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.

h2. Proposed fix

  Exclude null-named attributes when building the case-insensitive name maps. A 
null-named attribute
  is unaddressable by any column reference — a reference's name parts are never 
null — so dropping it
  from the name maps cannot change resolution of any legitimate reference. It 
converts the hard
  {{NullPointerException}} into correct resolution of the remaining (named) 
attributes, or a normal
  unresolved-column error if the null-named column is referenced:

  {code:scala}
  // Build the name maps from attributes that actually have a name.
  private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
  // ... use `namedAttrs` instead of `attrs` in 
direct/qualified/qualified3Part/qualified4Part.
  {code}

  A regression test asserting that {{AttributeSeq.resolve}} no longer throws 
when a null-named
  attribute is present (covering the unqualified {{direct}} map and the 
qualified maps) accompanies
  the fix.

  was:
h2. Summary

  Column resolution throws an internal {{NullPointerException}} when the input 
plan exposes an
  {{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds 
case-insensitive name lookup
  maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key 
function dereferences
  the name without a null check, so a single null-named attribute aborts 
resolution of the whole
  operator with an {{INTERNAL_ERROR}} instead of resolving the other columns 
(or producing a normal
  unresolved-column error).

  h2. Affected code

  {{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
  
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
 The
  {{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy 
maps all group by
  {{_.name.toLowerCase(Locale.ROOT)}}:

  {code:scala}
  @transient private lazy val direct: Map[String, Seq[Attribute]] = {
    unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT)))   // NPE if a.name 
== null
  }
  {code}

  This grouping has been present (unchanged) since well before SPARK-50037 
reworked
  {{AttributeSeq.resolve}}, so the issue is long-standing rather than a recent 
regression.

h2. Reproduction (minimal, Catalyst level)

  {code:scala}
  import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
  import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
  import org.apache.spark.sql.types.IntegerType

  val attrs: Seq[Attribute] = Seq(
    AttributeReference("a", IntegerType)(),
    AttributeReference(null, IntegerType)())   // an attribute with a null name

  // Resolving any real column forces the case-insensitive name map and throws:
  attrs.resolve(Seq("a"), caseInsensitiveResolution)
  {code}

  Result:
{code:none}
  java.lang.NullPointerException: Cannot invoke 
"String.toLowerCase(java.util.Locale)" because
  the return value of 
"org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
    at scala.collection.IterableOps.groupBy(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
    at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
    at 
org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
  {code}

  In practice this surfaces during normal analysis (e.g. a {{DataFrame.filter}} 
whose child plan
  carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}} 
(SQLSTATE {{XX000}}).

  h2. Root cause

  {{StructField}} permits a null {{name}} (no {{require(name != null)}}), and 
the name flows
  unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}. 
When such an attribute
  reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key 
function NPEs. The same
  null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.

h2. Proposed fix

  Exclude null-named attributes when building the case-insensitive name maps. A 
null-named attribute
  is unaddressable by any column reference — a reference's name parts are never 
null — so dropping it
  from the name maps cannot change resolution of any legitimate reference. It 
converts the hard
  {{NullPointerException}} into correct resolution of the remaining (named) 
attributes, or a normal
  unresolved-column error if the null-named column is referenced:

  {code:scala}
  // Build the name maps from attributes that actually have a name.
  private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
  // ... use `namedAttrs` instead of `attrs` in 
direct/qualified/qualified3Part/qualified4Part.
  {code}

  A regression test asserting that {{AttributeSeq.resolve}} no longer throws 
when a null-named
  attribute is present (covering the unqualified {{direct}} map and the 
qualified maps) accompanies
  the fix.


> NPE in AttributeSeq column resolution when an attribute has a null name
> -----------------------------------------------------------------------
>
>                 Key: SPARK-57725
>                 URL: https://issues.apache.org/jira/browse/SPARK-57725
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h2. Summary
>   Column resolution throws an internal {{NullPointerException}} when the 
> input plan exposes an
>   {{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds 
> case-insensitive name lookup
>   maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key 
> function dereferences
>   the name without a null check, so a single null-named attribute aborts 
> resolution of the whole
>   operator with an {{INTERNAL_ERROR}} instead of resolving the other columns 
> (or producing a normal
>   unresolved-column error).
>   h2. Affected code
>   {{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
>   
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
>  The
>   {{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy 
> maps all group by
>   {{_.name.toLowerCase(Locale.ROOT)}}:
>   {code:scala}
>   @transient private lazy val direct: Map[String, Seq[Attribute]] = {
>     unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT)))   // NPE if a.name 
> == null
>   }
>   {code}
>   This grouping has been present (unchanged) since well before SPARK-50037 
> reworked
>   {{AttributeSeq.resolve}}, so the issue is long-standing rather than a 
> recent regression.
> h2. Reproduction (minimal, Catalyst level)
>   {code:scala}
>   import org.apache.spark.sql.catalyst.expressions.{Attribute, 
> AttributeReference}
>   import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
>   import org.apache.spark.sql.types.IntegerType
>   val attrs: Seq[Attribute] = Seq(
>     AttributeReference("a", IntegerType)(),
>     AttributeReference(null, IntegerType)())   // an attribute with a null 
> name
>   // Resolving any real column forces the case-insensitive name map and 
> throws:
>   attrs.resolve(Seq("a"), caseInsensitiveResolution)
>   {code}
>   Result:
> {code:none}
>   java.lang.NullPointerException: Cannot invoke 
> "String.toLowerCase(java.util.Locale)" because
>   the return value of 
> "org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
>     at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
>     at scala.collection.IterableOps.groupBy(...)
>     at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
>     at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
>     at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
>     at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
>     at 
> org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
>     at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
>   {code}
>   In practice this surfaces during normal analysis (e.g. a 
> {{DataFrame.filter}} whose child plan
>   carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}} 
> (SQLSTATE {{XX000}}).
> h3. How a null-named attribute arises (e.g.)
>   The Scala DataFrame API builds attributes directly from the schema
>   ({{StructField -> DataTypeUtils.toAttribute -> AttributeReference}}), and 
> {{StructField}} permits a
>   null {{name}} (no {{require(name != null)}}), so a null field name yields a 
> null-named attribute:
>   {code:scala}
>   import org.apache.spark.sql.Row
>   import org.apache.spark.sql.types._
>   val schema = StructType(Seq(StructField(null, IntegerType), 
> StructField("b", IntegerType)))
>   val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row(1, 
> 2))), schema)
>   df.select("b").collect()   // forces resolution -> the NPE above
>   {code}
>   Note: this is not reproducible from PySpark directly -- 
> {{pyspark.sql.types.StructField}} asserts
>   the field name is a string ({{assert isinstance(name, str)}}), so a null 
> name can only originate on
>   the JVM side (e.g. an internal or connector-produced attribute), which is 
> how it is observed in
>   practice.
>   h2. Root cause
>   {{StructField}} permits a null {{name}} (no {{require(name != null)}}), and 
> the name flows
>   unchanged through {{DataTypeUtils.toAttribute}} into 
> {{AttributeReference}}. When such an attribute
>   reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key 
> function NPEs. The same
>   null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.
> h2. Proposed fix
>   Exclude null-named attributes when building the case-insensitive name maps. 
> A null-named attribute
>   is unaddressable by any column reference — a reference's name parts are 
> never null — so dropping it
>   from the name maps cannot change resolution of any legitimate reference. It 
> converts the hard
>   {{NullPointerException}} into correct resolution of the remaining (named) 
> attributes, or a normal
>   unresolved-column error if the null-named column is referenced:
>   {code:scala}
>   // Build the name maps from attributes that actually have a name.
>   private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
>   // ... use `namedAttrs` instead of `attrs` in 
> direct/qualified/qualified3Part/qualified4Part.
>   {code}
>   A regression test asserting that {{AttributeSeq.resolve}} no longer throws 
> when a null-named
>   attribute is present (covering the unqualified {{direct}} map and the 
> qualified maps) accompanies
>   the fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-57725) NPE in AttributeSeq column resolution when an attribute has a null name

Reply via email to