GitHub user dongjoon-hyun opened a pull request:
https://github.com/apache/spark/pull/20208
[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
data sources
## What changes were proposed in this pull request?
A schema can evolve in several ways and the followings are already
supported in file-based data sources.
1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type
This issue aims to guarantee users a backward-compatible schema evolution
coverage on file-based data sources and to prevent future regressions by
*adding schema evolution test suites explicitly*.
Here, we consider safe evolution without data loss. For example, data type
evolution should be from small types to larger types like `int`-to-`long`, not
vice versa.
As of today, in the master branch, file-based data sources have schema
evolution coverages like the followings.
File Format | Coverage | Note
----------- | ---------- | ------------------------------------------------
TEXT | N/A | Schema consists of a single string column.
CSV | 1, 2, 4 |
JSON | 1, 2, 3, 4 |
ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the
widest coverage.
PARQUET | 1, 2, 3 |
## How was this patch tested?
Pass the jenkins with newly added test suites.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dongjoon-hyun/spark SPARK-SCHEMA-EVOLUTION
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20208.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20208
----
commit 499801e7fdd545ac5918dd5f7a9294db2d5373be
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-01-07T00:02:09Z
[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
data sources
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]