Hi, All.
A data schema can evolve in several ways and Apache Spark 2.3 already
supports the followings for file-based data sources like
CSV/JSON/ORC/Parquet.
1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type
Can we guarantee users some schema evolution coverage on file-based data
sources by adding schema evolution test suites explicitly? So far, there
are some test cases.
For simplicity, I have several assumptions on schema evolution.
1. A safe evolution without data loss.
- e.g. from small types to larger types like int-to-long, not vice
versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.
I made a test case PR to receive your opinions for this.
[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
data sources
- https://github.com/apache/spark/pull/20208
Could you take a look and give some opinions?
Bests,
Dongjoon.