BramBoog opened a new pull request, #41016:
URL: https://github.com/apache/spark/pull/41016
### What changes were proposed in this pull request?
When converting a StructType instance containing a nested StructType column
which in turn contains a column for which `nullable = false` to a DDL string
using `.toDDL`, the resulting DDL string does not include this non-nullability.
For example:
```
val testSchema = StructType(List(
StructField("key", IntegerType, false),
StructField("value", StringType, true),
StructField("nestedCols", StructType(List(
StructField("nestedKey", IntegerType, false),
StructField("nestedValue", StringType, true)
)), false)
))
println(testSchema.toDDL)
println(StructType.fromDDL(testschema.toDDL))
```
gives
```
key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT, nestedValue:
STRING> NOT NULL
StructType(
StructField(key,IntegerType,false),
StructField(value,StringType,true),
StructField(nestedCols,StructType(
StructField(nestedKey,IntegerType,true),
StructField(nestedValue,StringType,true)
),false)
)
```
This is due to the fact that `StructType.toDDL` calls `StructField.toDDL`
for its fields, which in turn calls `.sql` for its `dataType`. If `dataType` is
a StructType, the call to `.sql` in turn calls `.sql` for all the nested
fields, and this last method does not include the nullability of the field in
its output. The proposed solution is therefore to have `StructField.toDDL` call
`dataType.toDDL` for a StructType, since this will include information about
nullability of nested columns.
To work around the different DDL formats of nested and non-nested structs
(the former is wrapped in `"STRUCT ...>"` and uses `colName: dataType` for its
fields instead of `colName dataType`), package-private nested-specific versions
of `.toDDL` have been added for StructType and StructField.
### Why are the changes needed?
Currently, converting a StructType schema to a DDL string does not pass
information about nullability of nested columns. This leads to a loss of
information, and means converting to DDL and then back could alter the
StructType schema.
### Does this PR introduce _any_ user-facing change?
Yes, given the example above, the output will become:
```
key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT NOT NULL,
nestedValue: STRING> NOT NULL
StructType(
StructField(key,IntegerType,false),
StructField(value,StringType,true),
StructField(nestedCols,StructType(
StructField(nestedKey,IntegerType,false),
StructField(nestedValue,StringType,true)
),false)
)
```
### How was this patch tested?
In `StructTypeSuite`, the `nestedStruct` testing value has been modified to
include a non-nullable nested column. The relevant unit tests have been changed
accordingly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]