[
https://issues.apache.org/jira/browse/SPARK-57725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57725:
-----------------------------
Description:
h2. Summary
Column resolution throws an internal {{NullPointerException}} when the input
plan exposes an
{{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds
case-insensitive name lookup
maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key
function dereferences
the name without a null check, so a single null-named attribute aborts
resolution of the whole
operator with an {{INTERNAL_ERROR}} instead of resolving the other columns
(or producing a normal
unresolved-column error).
h2. Affected code
{{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
The
{{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy
maps all group by
{{_.name.toLowerCase(Locale.ROOT)}}:
{code:scala}
@transient private lazy val direct: Map[String, Seq[Attribute]] = {
unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT))) // NPE if a.name
== null
}
{code}
This grouping has been present (unchanged) since well before SPARK-50037
reworked
{{AttributeSeq.resolve}}, so the issue is long-standing rather than a recent
regression.
h2. Reproduction (minimal, Catalyst level)
{code:scala}
import org.apache.spark.sql.catalyst.expressions.{Attribute,
AttributeReference}
import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
import org.apache.spark.sql.types.IntegerType
val attrs: Seq[Attribute] = Seq(
AttributeReference("a", IntegerType)(),
AttributeReference(null, IntegerType)()) // an attribute with a null name
// Resolving any real column forces the case-insensitive name map and throws:
attrs.resolve(Seq("a"), caseInsensitiveResolution)
{code}
Result:
{code:none}
java.lang.NullPointerException: Cannot invoke
"String.toLowerCase(java.util.Locale)" because
the return value of
"org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
at scala.collection.IterableOps.groupBy(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
at
org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
{code}
In practice this surfaces during normal analysis (e.g. a {{DataFrame.filter}}
whose child plan
carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}}
(SQLSTATE {{XX000}}).
h3. How a null-named attribute arises (e.g.)
The Scala DataFrame API builds attributes directly from the schema
({{StructField -> DataTypeUtils.toAttribute -> AttributeReference}}), and
{{StructField}} permits a
null {{name}} (no {{require(name != null)}}), so a null field name yields a
null-named attribute:
{code:scala}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField(null, IntegerType), StructField("b",
IntegerType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row(1,
2))), schema)
df.select("b").collect() // forces resolution -> the NPE above
{code}
Note: this is not reproducible from PySpark directly --
{{pyspark.sql.types.StructField}} asserts
the field name is a string ({{assert isinstance(name, str)}}), so a null name
can only originate on
the JVM side (e.g. an internal or connector-produced attribute), which is how
it is observed in
practice.
h2. Root cause
{{StructField}} permits a null {{name}} (no {{require(name != null)}}), and
the name flows
unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}.
When such an attribute
reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key
function NPEs. The same
null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.
h2. Proposed fix
Exclude null-named attributes when building the case-insensitive name maps. A
null-named attribute
is unaddressable by any column reference — a reference's name parts are never
null — so dropping it
from the name maps cannot change resolution of any legitimate reference. It
converts the hard
{{NullPointerException}} into correct resolution of the remaining (named)
attributes, or a normal
unresolved-column error if the null-named column is referenced:
{code:scala}
// Build the name maps from attributes that actually have a name.
private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
// ... use `namedAttrs` instead of `attrs` in
direct/qualified/qualified3Part/qualified4Part.
{code}
A regression test asserting that {{AttributeSeq.resolve}} no longer throws
when a null-named
attribute is present (covering the unqualified {{direct}} map and the
qualified maps) accompanies
the fix.
was:
h2. Summary
Column resolution throws an internal {{NullPointerException}} when the input
plan exposes an
{{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds
case-insensitive name lookup
maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key
function dereferences
the name without a null check, so a single null-named attribute aborts
resolution of the whole
operator with an {{INTERNAL_ERROR}} instead of resolving the other columns
(or producing a normal
unresolved-column error).
h2. Affected code
{{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
The
{{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy
maps all group by
{{_.name.toLowerCase(Locale.ROOT)}}:
{code:scala}
@transient private lazy val direct: Map[String, Seq[Attribute]] = {
unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT))) // NPE if a.name
== null
}
{code}
This grouping has been present (unchanged) since well before SPARK-50037
reworked
{{AttributeSeq.resolve}}, so the issue is long-standing rather than a recent
regression.
h2. Reproduction (minimal, Catalyst level)
{code:scala}
import org.apache.spark.sql.catalyst.expressions.{Attribute,
AttributeReference}
import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
import org.apache.spark.sql.types.IntegerType
val attrs: Seq[Attribute] = Seq(
AttributeReference("a", IntegerType)(),
AttributeReference(null, IntegerType)()) // an attribute with a null name
// Resolving any real column forces the case-insensitive name map and throws:
attrs.resolve(Seq("a"), caseInsensitiveResolution)
{code}
Result:
{code:none}
java.lang.NullPointerException: Cannot invoke
"String.toLowerCase(java.util.Locale)" because
the return value of
"org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
at scala.collection.IterableOps.groupBy(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
at
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
at
org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
{code}
In practice this surfaces during normal analysis (e.g. a {{DataFrame.filter}}
whose child plan
carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}}
(SQLSTATE {{XX000}}).
h2. Root cause
{{StructField}} permits a null {{name}} (no {{require(name != null)}}), and
the name flows
unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}.
When such an attribute
reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key
function NPEs. The same
null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.
h2. Proposed fix
Exclude null-named attributes when building the case-insensitive name maps. A
null-named attribute
is unaddressable by any column reference — a reference's name parts are never
null — so dropping it
from the name maps cannot change resolution of any legitimate reference. It
converts the hard
{{NullPointerException}} into correct resolution of the remaining (named)
attributes, or a normal
unresolved-column error if the null-named column is referenced:
{code:scala}
// Build the name maps from attributes that actually have a name.
private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
// ... use `namedAttrs` instead of `attrs` in
direct/qualified/qualified3Part/qualified4Part.
{code}
A regression test asserting that {{AttributeSeq.resolve}} no longer throws
when a null-named
attribute is present (covering the unqualified {{direct}} map and the
qualified maps) accompanies
the fix.
> NPE in AttributeSeq column resolution when an attribute has a null name
> -----------------------------------------------------------------------
>
> Key: SPARK-57725
> URL: https://issues.apache.org/jira/browse/SPARK-57725
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. Summary
> Column resolution throws an internal {{NullPointerException}} when the
> input plan exposes an
> {{Attribute}} whose {{name}} is {{null}}. {{AttributeSeq}} builds
> case-insensitive name lookup
> maps keyed on {{attr.name.toLowerCase(Locale.ROOT)}}, and the grouping key
> function dereferences
> the name without a null check, so a single null-named attribute aborts
> resolution of the whole
> operator with an {{INTERNAL_ERROR}} instead of resolving the other columns
> (or producing a normal
> unresolved-column error).
> h2. Affected code
> {{org.apache.spark.sql.catalyst.expressions.package.AttributeSeq}} (file
>
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala}}).
> The
> {{direct}}, {{qualified}}, {{qualified3Part}} and {{qualified4Part}} lazy
> maps all group by
> {{_.name.toLowerCase(Locale.ROOT)}}:
> {code:scala}
> @transient private lazy val direct: Map[String, Seq[Attribute]] = {
> unique(attrs.groupBy(_.name.toLowerCase(Locale.ROOT))) // NPE if a.name
> == null
> }
> {code}
> This grouping has been present (unchanged) since well before SPARK-50037
> reworked
> {{AttributeSeq.resolve}}, so the issue is long-standing rather than a
> recent regression.
> h2. Reproduction (minimal, Catalyst level)
> {code:scala}
> import org.apache.spark.sql.catalyst.expressions.{Attribute,
> AttributeReference}
> import org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
> import org.apache.spark.sql.types.IntegerType
> val attrs: Seq[Attribute] = Seq(
> AttributeReference("a", IntegerType)(),
> AttributeReference(null, IntegerType)()) // an attribute with a null
> name
> // Resolving any real column forces the case-insensitive name map and
> throws:
> attrs.resolve(Seq("a"), caseInsensitiveResolution)
> {code}
> Result:
> {code:none}
> java.lang.NullPointerException: Cannot invoke
> "String.toLowerCase(java.util.Locale)" because
> the return value of
> "org.apache.spark.sql.catalyst.expressions.Attribute.name()" is null
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.$anonfun$direct$1(package.scala:...)
> at scala.collection.IterableOps.groupBy(...)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:...)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.matchWithThreeOrLessQualifierParts(...)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.getCandidatesForResolution(...)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(...)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:...)
> at
> org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveExpressionByPlanChildren(...)
> at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences...
> {code}
> In practice this surfaces during normal analysis (e.g. a
> {{DataFrame.filter}} whose child plan
> carries an attribute with a null name) as an uncaught {{INTERNAL_ERROR}}
> (SQLSTATE {{XX000}}).
> h3. How a null-named attribute arises (e.g.)
> The Scala DataFrame API builds attributes directly from the schema
> ({{StructField -> DataTypeUtils.toAttribute -> AttributeReference}}), and
> {{StructField}} permits a
> null {{name}} (no {{require(name != null)}}), so a null field name yields a
> null-named attribute:
> {code:scala}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(StructField(null, IntegerType),
> StructField("b", IntegerType)))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row(1,
> 2))), schema)
> df.select("b").collect() // forces resolution -> the NPE above
> {code}
> Note: this is not reproducible from PySpark directly --
> {{pyspark.sql.types.StructField}} asserts
> the field name is a string ({{assert isinstance(name, str)}}), so a null
> name can only originate on
> the JVM side (e.g. an internal or connector-produced attribute), which is
> how it is observed in
> practice.
> h2. Root cause
> {{StructField}} permits a null {{name}} (no {{require(name != null)}}), and
> the name flows
> unchanged through {{DataTypeUtils.toAttribute}} into
> {{AttributeReference}}. When such an attribute
> reaches {{AttributeSeq}}, the {{groupBy(_.name.toLowerCase(...))}} key
> function NPEs. The same
> null-unsafe {{_.name.toLowerCase}} pattern exists in all four name maps.
> h2. Proposed fix
> Exclude null-named attributes when building the case-insensitive name maps.
> A null-named attribute
> is unaddressable by any column reference — a reference's name parts are
> never null — so dropping it
> from the name maps cannot change resolution of any legitimate reference. It
> converts the hard
> {{NullPointerException}} into correct resolution of the remaining (named)
> attributes, or a normal
> unresolved-column error if the null-named column is referenced:
> {code:scala}
> // Build the name maps from attributes that actually have a name.
> private lazy val namedAttrs: Seq[Attribute] = attrs.filter(_.name != null)
> // ... use `namedAttrs` instead of `attrs` in
> direct/qualified/qualified3Part/qualified4Part.
> {code}
> A regression test asserting that {{AttributeSeq.resolve}} no longer throws
> when a null-named
> attribute is present (covering the unqualified {{direct}} map and the
> qualified maps) accompanies
> the fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]