MGHawes commented on a change in pull request #25575: [SPARK-28818][SQL]
Respect source column nullability in the arrays created by `freqItems()`
URL: https://github.com/apache/spark/pull/25575#discussion_r318092162
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
##########
@@ -117,10 +112,16 @@ object FrequentItems extends Logging {
)
val justItems = freqItems.map(m => m.baseMap.keys.toArray)
val resultRow = Row(justItems : _*)
- // append frequent Items to the column name for easy debugging
- val outputCols = colInfo.map { v =>
- StructField(v._1 + "_freqItems", ArrayType(v._2, false))
- }
+
+ val originalSchema = df.schema
+ val outputCols = cols.map { name =>
+ val index = originalSchema.fieldIndex(name)
+ val originalField = originalSchema.fields(index)
+
+ // append frequent Items to the column name for easy debugging
+ StructField(name + "_freqItems", ArrayType(originalField.dataType,
originalField.nullable))
+ }.toArray
+
Review comment:
We can, but I felt what the code was doing already was a little weird. Like
is there a good reason to split out the creation of `colInfo` and `outputCols`?
Additionally it seems more readable to me to actually name the variable rather
than use `_*` and `_1`, `_2`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]