Hi Aniket,

In general the schema of all rows in a single table must be same. This is a basic assumption made by Spark SQL. Schema union does make sense, and we're planning to support this for Parquet. But as you've mentioned, it doesn't help if types of different versions of a column differ from each other. Also, you need to reload the data source table after schema changes happen.

Cheng

On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without affecting existing rows. Basically, each row contains a schema hint
(id and version) and this allows developers to evolve schema over time and
perform migration at will. Since the schema needs to be specified upfront
in the data source API, one possible way would be to build a union of all
schema versions and handle populating row values appropriately. This works
in case columns have been added or deleted in the schema but doesn't work
if types have changed. I was wondering if it is possible to change the API
  to provide schema for each row instead of expecting data source to provide
schema upfront?

Thanks,
Aniket



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to