[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

dongjoon-hyun Sun, 25 Mar 2018 02:02:59 -0700

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176933664
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file 
can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed 
single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    +  3. Change a column position
    +  4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` 
-> `double`)
    --- End diff --
    
    Yep. `Upcast`s are safe. This PR doesn't aim to cover or guarantee unsafe 
casting at this stage. Although these are straight-forward `upcast`s, not all 
Spark file-based data sources seems to support them (based on the test cases). 
This PR is trying to set the clear boundary and to clarify those missed things.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Reply via email to