Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21305#discussion_r207723244
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -336,4 +337,97 @@ object DataType {
case (fromDataType, toDataType) => fromDataType == toDataType
}
}
+
+ /**
+ * Returns true if the write data type can be read using the read data
type.
+ *
+ * The write type is compatible with the read type if:
+ * - Both types are arrays, the array element types are compatible, and
element nullability is
+ * compatible (read allows nulls or write does not contain nulls).
+ * - Both types are maps and the map key and value types are compatible,
and value nullability
+ * is compatible (read allows nulls or write does not contain nulls).
+ * - Both types are structs and each field in the read struct is present
in the write struct and
+ * compatible (including nullability), or is nullable if the write
struct does not contain the
+ * field. Write-side structs are not compatible if they contain fields
that are not present in
+ * the read-side struct.
+ * - Both types are atomic and the write type can be safely cast to the
read type.
+ *
+ * Extra fields in write-side structs are not allowed to avoid
accidentally writing data that
+ * the read schema will not read, and to ensure map key equality is not
changed when data is read.
+ *
+ * @param write a write-side data type to validate against the read type
+ * @param read a read-side data type
+ * @return true if data written with the write type can be read using
the read type
+ */
+ def canWrite(
+ write: DataType,
+ read: DataType,
+ resolver: Resolver,
+ context: String,
+ addError: String => Unit = (_: String) => {}): Boolean = {
+ (write, read) match {
+ case (wArr: ArrayType, rArr: ArrayType) =>
+ if (wArr.containsNull && !rArr.containsNull) {
+ addError(s"Cannot write nullable elements to array of non-nulls:
'$context'")
+ false
+ } else {
+ canWrite(wArr.elementType, rArr.elementType, resolver, context +
".element", addError)
+ }
+
+ case (wMap: MapType, rMap: MapType) =>
+ // map keys cannot include data fields not in the read schema
without changing equality when
+ // read. map keys can be missing fields as long as they are
nullable in the read schema.
+ if (wMap.valueContainsNull && !rMap.valueContainsNull) {
+ addError(s"Cannot write nullable values to map of non-nulls:
'$context'")
+ false
+ } else {
+ canWrite(wMap.keyType, rMap.keyType, resolver, context + ".key",
addError) &&
+ canWrite(wMap.valueType, rMap.valueType, resolver, context +
".value", addError)
+ }
+
+ case (StructType(writeFields), StructType(readFields)) =>
+ lazy val extraFields = writeFields.map(_.name).toSet --
readFields.map(_.name)
+
+ var result = readFields.forall { readField =>
+ val fieldContext = context + "." + readField.name
+ writeFields.find(writeField => resolver(writeField.name,
readField.name)) match {
--- End diff --
@cloud-fan, After thinking about this more, I think the right thing to do
is to match by position and validate names if names are available. We can
always make it more permissive later if we think it is reasonable and we have
the capability to reorder fields
My reasoning for this is similar to the by-name or by-position mappings.
For top-level columns, users expect SQL to match by position and DataFrames to
match by name. Structs are similar in that the way a struct is built determines
the user's expectation, but structs are always constructed with an explicit
field order.
SQL has two methods to create structs:
* [`named_struct(name1, expr1,
...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#named_struct)
* [`struct(expr1, expr2,
...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#struct) uses
names `col1`, `col2`, ...
The Dataset API has two methods, [`struct(Column*)` and
`struct(String*)`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@struct(colName:String,colNames:String*):org.apache.spark.sql.Column)
that use the incoming column names or `col1`, `col2`, etc.
Other methods like `from_json` require a `StructType` so they use named
structs. You can also create structs when using a Dataset of case classes or
Java beans, in which case the struct has the names from the type.
Looks like users always make structs with a field order and sometimes
names. When field names are missing, we should clearly use field order. But
even when there are field names, order is still obvious to the user -- unlike
the DataFrame API where `withColumn` doesn't give a clear indication of column
order. Because order is obvious, we should require the correct order for
structs. We should *also* make sure that the field names match since we have
that extra information to validate cases like `struct<y int, x int>` written to
`struct<x int, y int>`.
If we want to add field reordering in named structs, we can do that later.
Does that sound like a reasonable solution for now? I'll update the code
and finish the tests I've been working on.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]