Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/7179#discussion_r33750029
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
---
@@ -42,4 +47,47 @@ class CodeGenerationSuite extends SparkFunSuite {
futures.foreach(Await.result(_, 10.seconds))
}
+
+ // Test GenerateOrdering for all common types. For each type, we
construct random input rows that
+ // contain two columns of that type, then for pairs of
randomly-generated rows we check that
+ // GenerateOrdering agrees with RowOrdering.
+ (DataTypeTestUtils.atomicTypes ++ Set(NullType)).foreach { dataType =>
+ test(s"GenerateOrdering with $dataType") {
+ val rowOrdering = RowOrdering.forSchema(Seq(dataType, dataType))
+ val genOrdering = GenerateOrdering.generate(
+ BoundReference(0, dataType, nullable = true).asc ::
+ BoundReference(1, dataType, nullable = true).asc :: Nil)
+ val rowType = StructType(
+ StructField("a", dataType, nullable = true) ::
+ StructField("b", dataType, nullable = true) :: Nil)
+ val toCatalyst =
CatalystTypeConverters.createToCatalystConverter(rowType)
+ // Sort ordering is not defined for NaN, so skip any random inputs
that contain it:
+ def isIncomparable(v: Any): Boolean = v match {
--- End diff --
Given that we might use sorting for clustering as part of a sort-based
distinct operator, I wonder whether this has any bad implications for
performing distinct on columns that contain NaN. Should we warn about this
undefined behavior somewhere in our documentation?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]