[ 
https://issues.apache.org/jira/browse/SPARK-57726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57726:
-----------------------------
    Affects Version/s: 4.3.0
                           (was: 5.0.0)

> Fix NPE in AttributeReference.hashCode when the attribute name is null
> ----------------------------------------------------------------------
>
>                 Key: SPARK-57726
>                 URL: https://issues.apache.org/jira/browse/SPARK-57726
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Summary
> {{AttributeReference.hashCode}} computes the name's contribution to the hash 
> with a
> direct {{name.hashCode()}} call, which throws a {{NullPointerException}} when 
> the
> attribute has a {{null}} name. {{AttributeReference.equals}} already compares 
> the name
> null-safely ({{name == ar.name}}), so {{hashCode}} is inconsistent with 
> {{equals}} for
> null-named attributes, and any use in a hash-based collection crashes.
> h2. Affected code
> {{org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode}} in
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala}}:
> {code:scala}
> override def hashCode: Int = {
>   // See http://stackoverflow.com/questions/113511/hash-code-implementation
>   var h = 17
>   h = h * 37 + name.hashCode()   // NPE if name == null
>   h = h * 37 + dataType.hashCode()
>   h = h * 37 + nullable.hashCode()
>   h = h * 37 + metadata.hashCode()
>   h = h * 37 + exprId.hashCode()
>   h = h * 37 + qualifier.hashCode()
>   h
> }
> {code}
> h2. Reproduction (minimal, Catalyst level)
> {code:scala}
> import org.apache.spark.sql.catalyst.expressions.AttributeReference
> import org.apache.spark.sql.types.IntegerType
> val a = AttributeReference(null, IntegerType)()
> Set(a)        // or a.hashCode(), a HashMap/HashSet, .distinct, .toSet, ...
> {code}
> Result:
> {code}
> java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because
> the return value of "...AttributeReference.name()" is null
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:...)
> {code}
> h2. How a null-named attribute arises
> {{StructField}} permits a null name (no {{require(name != null)}}), and the 
> name flows
> unchanged through {{DataTypeUtils.toAttribute}} into {{AttributeReference}}. 
> Such an
> attribute can therefore reach hash-based collections during planning/analysis.
> h2. Root cause
> {{name.hashCode()}} is not null-safe, while {{equals}} is. This violates the
> equals/hashCode contract for null-named attributes and turns a recoverable 
> situation into
> a hard {{NullPointerException}}.
> h2. Proposed fix
> Use {{java.util.Objects.hashCode(name)}} (already imported) instead of
> {{name.hashCode()}}:
> {code:scala}
> h = h * 37 + Objects.hashCode(name)
> {code}
> A regression test in {{NamedExpressionSuite}} asserts that {{hashCode}} does 
> not throw on
> a null-named attribute and that the equals/hashCode contract holds.
> h2. Related
> Noticed during review of SPARK-57725 (NPE in {{AttributeSeq}} column 
> resolution when an
> attribute has a null name). The two issues are independent and are fixed in 
> separate PRs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to