BramBoog opened a new pull request, #41016:
URL: https://github.com/apache/spark/pull/41016

   ### What changes were proposed in this pull request?
   When converting a StructType instance containing a nested StructType column 
which in turn contains a column for which `nullable = false` to a DDL string 
using `.toDDL`, the resulting DDL string does not include this non-nullability. 
For example:
   ```
   val testSchema = StructType(List(
     StructField("key", IntegerType, false),
     StructField("value", StringType, true),
     StructField("nestedCols", StructType(List(
       StructField("nestedKey", IntegerType, false),
       StructField("nestedValue", StringType, true)
     )), false)
   ))
   
   println(testSchema.toDDL)
   println(StructType.fromDDL(testschema.toDDL))
   ```
   gives
   ```
   key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT, nestedValue: 
STRING> NOT NULL
   
   StructType(
     StructField(key,IntegerType,false),
     StructField(value,StringType,true),
     StructField(nestedCols,StructType(
       StructField(nestedKey,IntegerType,true),
       StructField(nestedValue,StringType,true)
     ),false)
   )
   ```
   
   This is due to the fact that `StructType.toDDL` calls `StructField.toDDL` 
for its fields, which in turn calls `.sql` for its `dataType`. If `dataType` is 
a StructType, the call to `.sql` in turn calls `.sql` for all the nested 
fields, and this last method does not include the nullability of the field in 
its output. The proposed solution is therefore to have `StructField.toDDL` call 
`dataType.toDDL` for a StructType, since this will include information about 
nullability of nested columns.
   
   To work around the different DDL formats of nested and non-nested structs 
(the former is wrapped in `"STRUCT ...>"` and uses `colName: dataType` for its 
fields instead of `colName dataType`), package-private nested-specific versions 
of `.toDDL` have been added for StructType and StructField.
   
   ### Why are the changes needed?
   Currently, converting a StructType schema to a DDL string does not pass 
information about nullability of nested columns. This leads to a loss of 
information, and means converting to DDL and then back could alter the 
StructType schema. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, given the example above, the output will become:
   ```
   key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT NOT NULL, 
nestedValue: STRING> NOT NULL
   
   StructType(
     StructField(key,IntegerType,false),
     StructField(value,StringType,true),
     StructField(nestedCols,StructType(
       StructField(nestedKey,IntegerType,false),
       StructField(nestedValue,StringType,true)
     ),false)
   )
   ```
   
   ### How was this patch tested?
   In `StructTypeSuite`, the `nestedStruct` testing value has been modified to 
include a non-nullable nested column. The relevant unit tests have been changed 
accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to