Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21305#discussion_r207723244 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala --- @@ -336,4 +337,97 @@ object DataType { case (fromDataType, toDataType) => fromDataType == toDataType } } + + /** + * Returns true if the write data type can be read using the read data type. + * + * The write type is compatible with the read type if: + * - Both types are arrays, the array element types are compatible, and element nullability is + * compatible (read allows nulls or write does not contain nulls). + * - Both types are maps and the map key and value types are compatible, and value nullability + * is compatible (read allows nulls or write does not contain nulls). + * - Both types are structs and each field in the read struct is present in the write struct and + * compatible (including nullability), or is nullable if the write struct does not contain the + * field. Write-side structs are not compatible if they contain fields that are not present in + * the read-side struct. + * - Both types are atomic and the write type can be safely cast to the read type. + * + * Extra fields in write-side structs are not allowed to avoid accidentally writing data that + * the read schema will not read, and to ensure map key equality is not changed when data is read. + * + * @param write a write-side data type to validate against the read type + * @param read a read-side data type + * @return true if data written with the write type can be read using the read type + */ + def canWrite( + write: DataType, + read: DataType, + resolver: Resolver, + context: String, + addError: String => Unit = (_: String) => {}): Boolean = { + (write, read) match { + case (wArr: ArrayType, rArr: ArrayType) => + if (wArr.containsNull && !rArr.containsNull) { + addError(s"Cannot write nullable elements to array of non-nulls: '$context'") + false + } else { + canWrite(wArr.elementType, rArr.elementType, resolver, context + ".element", addError) + } + + case (wMap: MapType, rMap: MapType) => + // map keys cannot include data fields not in the read schema without changing equality when + // read. map keys can be missing fields as long as they are nullable in the read schema. + if (wMap.valueContainsNull && !rMap.valueContainsNull) { + addError(s"Cannot write nullable values to map of non-nulls: '$context'") + false + } else { + canWrite(wMap.keyType, rMap.keyType, resolver, context + ".key", addError) && + canWrite(wMap.valueType, rMap.valueType, resolver, context + ".value", addError) + } + + case (StructType(writeFields), StructType(readFields)) => + lazy val extraFields = writeFields.map(_.name).toSet -- readFields.map(_.name) + + var result = readFields.forall { readField => + val fieldContext = context + "." + readField.name + writeFields.find(writeField => resolver(writeField.name, readField.name)) match { --- End diff -- @cloud-fan, After thinking about this more, I think the right thing to do is to match by position and validate names if names are available. We can always make it more permissive later if we think it is reasonable and we have the capability to reorder fields My reasoning for this is similar to the by-name or by-position mappings. For top-level columns, users expect SQL to match by position and DataFrames to match by name. Structs are similar in that the way a struct is built determines the user's expectation, but structs are always constructed with an explicit field order. SQL has two methods to create structs: * [`named_struct(name1, expr1, ...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#named_struct) * [`struct(expr1, expr2, ...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#struct) uses names `col1`, `col2`, ... The Dataset API has two methods, [`struct(Column*)` and `struct(String*)`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@struct(colName:String,colNames:String*):org.apache.spark.sql.Column) that use the incoming column names or `col1`, `col2`, etc. Other methods like `from_json` require a `StructType` so they use named structs. You can also create structs when using a Dataset of case classes or Java beans, in which case the struct has the names from the type. Looks like users always make structs with a field order and sometimes names. When field names are missing, we should clearly use field order. But even when there are field names, order is still obvious to the user -- unlike the DataFrame API where `withColumn` doesn't give a clear indication of column order. Because order is obvious, we should require the correct order for structs. We should *also* make sure that the field names match since we have that extra information to validate cases like `struct<y int, x int>` written to `struct<x int, y int>`. If we want to add field reordering in named structs, we can do that later. Does that sound like a reasonable solution for now? I'll update the code and finish the tests I've been working on.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org