Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20208#discussion_r162835448
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in
file-based data sources.
+ *
+ * 1. Add a column
+ * 2. Remove a column
+ * 3. Change a column position
+ * 4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like
the followings.
+ *
+ * | File Format | Coverage | Note
|
+ * | ------------ | ------------ |
------------------------------------------------------ |
+ * | TEXT | N/A | Schema consists of a single string
column. |
+ * | CSV | 1, 2, 4 |
|
+ * | JSON | 1, 2, 3, 4 |
|
+ * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the
widest coverage. |
+ * | PARQUET | 1, 2, 3 |
|
--- End diff --
@dongjoon-hyun, how do we guarantee schema change in Parquet and ORC?
I thought we (roughly) randomly pick up a file, read its footer and then
use it. So, I was thinking we don't properly support this. It makes sense to
Parquet with `mergeSchema` tho.
I think it's not even guaranteed in CSV too because we will rely on its
header from one file.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]