Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20208#discussion_r175579537
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in
file-based data sources.
+ *
+ * 1. Add a column
+ * 2. Remove a column
+ * 3. Change a column position
+ * 4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like
the followings.
+ *
+ * | File Format | Coverage | Note
|
+ * | ------------ | ------------ |
------------------------------------------------------ |
+ * | TEXT | N/A | Schema consists of a single string
column. |
+ * | CSV | 1, 2, 4 |
|
+ * | JSON | 1, 2, 3, 4 |
|
+ * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the
widest coverage. |
+ * | PARQUET | 1, 2, 3 |
|
+ *
+ * This aims to provide an explicit test coverage for schema evolution on
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a
test suite
+ * for each file-based data source with corresponding supported test case
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ * SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with
SharedSQLContext {
+ val format: String
+ val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column (Case 1).
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+ import testImplicits._
+
+ test("append column at the end") {
+ withTempPath { dir =>
+ val path = dir.getCanonicalPath
+
+ val df1 = Seq("a", "b").toDF("col1")
+ val df2 = df1.withColumn("col2", lit("x"))
+ val df3 = df2.withColumn("col3", lit("y"))
+
+ val dir1 = s"$path${File.separator}part=one"
+ val dir2 = s"$path${File.separator}part=two"
+ val dir3 = s"$path${File.separator}part=three"
+
+ df1.write.format(format).options(options).save(dir1)
+ df2.write.format(format).options(options).save(dir2)
+ df3.write.format(format).options(options).save(dir3)
+
+ val df = spark.read
+ .schema(df3.schema)
--- End diff --
@gatorsmile . Please see this. This is not about **schema inferencing**.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]