[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

rdblue Sat, 04 Aug 2018 18:50:44 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21305#discussion_r207723244
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
    @@ -336,4 +337,97 @@ object DataType {
           case (fromDataType, toDataType) => fromDataType == toDataType
         }
       }
    +
    +  /**
    +   * Returns true if the write data type can be read using the read data 
type.
    +   *
    +   * The write type is compatible with the read type if:
    +   * - Both types are arrays, the array element types are compatible, and 
element nullability is
    +   *   compatible (read allows nulls or write does not contain nulls).
    +   * - Both types are maps and the map key and value types are compatible, 
and value nullability
    +   *   is compatible  (read allows nulls or write does not contain nulls).
    +   * - Both types are structs and each field in the read struct is present 
in the write struct and
    +   *   compatible (including nullability), or is nullable if the write 
struct does not contain the
    +   *   field. Write-side structs are not compatible if they contain fields 
that are not present in
    +   *   the read-side struct.
    +   * - Both types are atomic and the write type can be safely cast to the 
read type.
    +   *
    +   * Extra fields in write-side structs are not allowed to avoid 
accidentally writing data that
    +   * the read schema will not read, and to ensure map key equality is not 
changed when data is read.
    +   *
    +   * @param write a write-side data type to validate against the read type
    +   * @param read a read-side data type
    +   * @return true if data written with the write type can be read using 
the read type
    +   */
    +  def canWrite(
    +      write: DataType,
    +      read: DataType,
    +      resolver: Resolver,
    +      context: String,
    +      addError: String => Unit = (_: String) => {}): Boolean = {
    +    (write, read) match {
    +      case (wArr: ArrayType, rArr: ArrayType) =>
    +        if (wArr.containsNull && !rArr.containsNull) {
    +          addError(s"Cannot write nullable elements to array of non-nulls: 
'$context'")
    +          false
    +        } else {
    +          canWrite(wArr.elementType, rArr.elementType, resolver, context + 
".element", addError)
    +        }
    +
    +      case (wMap: MapType, rMap: MapType) =>
    +        // map keys cannot include data fields not in the read schema 
without changing equality when
    +        // read. map keys can be missing fields as long as they are 
nullable in the read schema.
    +        if (wMap.valueContainsNull && !rMap.valueContainsNull) {
    +          addError(s"Cannot write nullable values to map of non-nulls: 
'$context'")
    +          false
    +        } else {
    +          canWrite(wMap.keyType, rMap.keyType, resolver, context + ".key", 
addError) &&
    +              canWrite(wMap.valueType, rMap.valueType, resolver, context + 
".value", addError)
    +        }
    +
    +      case (StructType(writeFields), StructType(readFields)) =>
    +        lazy val extraFields = writeFields.map(_.name).toSet -- 
readFields.map(_.name)
    +
    +        var result = readFields.forall { readField =>
    +          val fieldContext = context + "." + readField.name
    +          writeFields.find(writeField => resolver(writeField.name, 
readField.name)) match {
    --- End diff --
    
    @cloud-fan, After thinking about this more, I think the right thing to do 
is to match by position and validate names if names are available. We can 
always make it more permissive later if we think it is reasonable and we have 
the capability to reorder fields
    
    My reasoning for this is similar to the by-name or by-position mappings. 
For top-level columns, users expect SQL to match by position and DataFrames to 
match by name. Structs are similar in that the way a struct is built determines 
the user's expectation, but structs are always constructed with an explicit 
field order.
    
    SQL has two methods to create structs:
    * [`named_struct(name1, expr1, 
...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#named_struct)
    * [`struct(expr1, expr2, 
...)`](https://spark.apache.org/docs/2.3.0/api/sql/index.html#struct) uses 
names `col1`, `col2`, ...
    
    The Dataset API has two methods, [`struct(Column*)` and 
`struct(String*)`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@struct(colName:String,colNames:String*):org.apache.spark.sql.Column)
 that use the incoming column names or `col1`, `col2`, etc.
    
    Other methods like `from_json` require a `StructType` so they use named 
structs. You can also create structs when using a Dataset of case classes or 
Java beans, in which case the struct has the names from the type.
    
    Looks like users always make structs with a field order and sometimes 
names. When field names are missing, we should clearly use field order. But 
even when there are field names, order is still obvious to the user -- unlike 
the DataFrame API where `withColumn` doesn't give a clear indication of column 
order. Because order is obvious, we should require the correct order for 
structs. We should *also* make sure that the field names match since we have 
that extra information to validate cases like `struct<y int, x int>` written to 
`struct<x int, y int>`.
    
    If we want to add field reordering in named structs, we can do that later.
    
    Does that sound like a reasonable solution for now? I'll update the code 
and finish the tests I've been working on.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

Reply via email to