MGHawes opened a new pull request #25575: [WIP][SPARK-28818] Respect source 
column nullability in the arrays created by `freqItems()`
URL: https://github.com/apache/spark/pull/25575
 
 
   ### What changes were proposed in this pull request?
   This PR replaces the hard-coded non-nullability of the array elements 
returned by `freqItems()` with a nullability that reflects the original schema. 
Essentially the change to the schema generation is:
   ```
   StructField(name + "_freqItems", ArrayType(dataType, false))
   ```
   Becomes:
   ```
   StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable))
   ```
   
   Respecting the original nullability prevents issues when Spark depends on 
`ArrayType`'s `containsNull` being accurate. The example that uncovered this is 
calling `collect()` on the dataframe (see 
[ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). 
Though it's likely that there a several places where this could cause a 
problem. 
   
   I've also refactored a small amount of the surrounding code to remove some 
unnecessary steps and group together related operations.
   
   ### Why are the changes needed?
   I think it's pretty clear why this change is needed. It fixes a bug that 
currently prevents users from calling `df.freqItems.collect()` along with 
potentially causing other, as yet unknown, issues.
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   I added a test that specifically tests the carry-through of the nullability 
as well as explicitly calling `collect()` to catch the exact regression that 
was observed.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to