[
https://issues.apache.org/jira/browse/SPARK-57726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57726:
-----------------------------
Description:
h2. Summary
{{AttributeReference.hashCode}} computes the name's contribution to the hash
with a
direct {{name.hashCode()}} call, which throws a {{NullPointerException}} when
the
attribute has a {{null}} name. {{AttributeReference.equals}} already compares
the name
null-safely ({{name == ar.name}}), so {{hashCode}} is inconsistent with
{{equals}} for
null-named attributes, and any use in a hash-based collection crashes.
h2. Affected code
{{org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode}} in
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala}}:
{code:scala}
override def hashCode: Int = {
// See http://stackoverflow.com/questions/113511/hash-code-implementation
var h = 17
h = h * 37 + name.hashCode() // NPE if name == null
h = h * 37 + dataType.hashCode()
h = h * 37 + nullable.hashCode()
h = h * 37 + metadata.hashCode()
h = h * 37 + exprId.hashCode()
h = h * 37 + qualifier.hashCode()
h
}
{code}
h2. Reproduction (minimal, Catalyst level)
{code:scala}
import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.types.IntegerType
val a = AttributeReference(null, IntegerType)()
Set(a) // or a.hashCode(), a HashMap/HashSet, .distinct, .toSet, ...
{code}
Result:
{code}
java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because
the return value of "...AttributeReference.name()" is null
at
org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:...)
{code}
h2. How a null-named attribute arises
{{StructField}} permits a null name (no {{require(name != null)}}), and the
name flows
unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}.
Such an
attribute can therefore reach hash-based collections during planning/analysis.
h2. Root cause
{{name.hashCode()}} is not null-safe, while {{equals}} is. This violates the
equals/hashCode contract for null-named attributes and turns a recoverable
situation into
a hard {{NullPointerException}}.
h2. Proposed fix
Use {{java.util.Objects.hashCode(name)}} (already imported) instead of
{{name.hashCode()}}:
{code:scala}
h = h * 37 + Objects.hashCode(name)
{code}
A regression test in {{NamedExpressionSuite}} asserts that {{hashCode}} does
not throw on
a null-named attribute and that the equals/hashCode contract holds.
h2. Related
Noticed during review of SPARK-57725 (NPE in {{AttributeSeq}} column resolution
when an
attribute has a null name). The two issues are independent and are fixed in
separate PRs.
> Fix NPE in AttributeReference.hashCode when the attribute name is null
> ----------------------------------------------------------------------
>
> Key: SPARK-57726
> URL: https://issues.apache.org/jira/browse/SPARK-57726
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 5.0.0
> Reporter: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h2. Summary
> {{AttributeReference.hashCode}} computes the name's contribution to the hash
> with a
> direct {{name.hashCode()}} call, which throws a {{NullPointerException}} when
> the
> attribute has a {{null}} name. {{AttributeReference.equals}} already compares
> the name
> null-safely ({{name == ar.name}}), so {{hashCode}} is inconsistent with
> {{equals}} for
> null-named attributes, and any use in a hash-based collection crashes.
> h2. Affected code
> {{org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode}} in
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala}}:
> {code:scala}
> override def hashCode: Int = {
> // See http://stackoverflow.com/questions/113511/hash-code-implementation
> var h = 17
> h = h * 37 + name.hashCode() // NPE if name == null
> h = h * 37 + dataType.hashCode()
> h = h * 37 + nullable.hashCode()
> h = h * 37 + metadata.hashCode()
> h = h * 37 + exprId.hashCode()
> h = h * 37 + qualifier.hashCode()
> h
> }
> {code}
> h2. Reproduction (minimal, Catalyst level)
> {code:scala}
> import org.apache.spark.sql.catalyst.expressions.AttributeReference
> import org.apache.spark.sql.types.IntegerType
> val a = AttributeReference(null, IntegerType)()
> Set(a) // or a.hashCode(), a HashMap/HashSet, .distinct, .toSet, ...
> {code}
> Result:
> {code}
> java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because
> the return value of "...AttributeReference.name()" is null
> at
> org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:...)
> {code}
> h2. How a null-named attribute arises
> {{StructField}} permits a null name (no {{require(name != null)}}), and the
> name flows
> unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}.
> Such an
> attribute can therefore reach hash-based collections during planning/analysis.
> h2. Root cause
> {{name.hashCode()}} is not null-safe, while {{equals}} is. This violates the
> equals/hashCode contract for null-named attributes and turns a recoverable
> situation into
> a hard {{NullPointerException}}.
> h2. Proposed fix
> Use {{java.util.Objects.hashCode(name)}} (already imported) instead of
> {{name.hashCode()}}:
> {code:scala}
> h = h * 37 + Objects.hashCode(name)
> {code}
> A regression test in {{NamedExpressionSuite}} asserts that {{hashCode}} does
> not throw on
> a null-named attribute and that the equals/hashCode contract holds.
> h2. Related
> Noticed during review of SPARK-57725 (NPE in {{AttributeSeq}} column
> resolution when an
> attribute has a null name). The two issues are independent and are fixed in
> separate PRs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]