I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is
for a unified way of dealing with missing columns? I feel like that
probably should be left up to the underlying datasource implementation. For
example if you have missing columns with a database the Datasource can
choose a value based on the Database's metadata if such a thing exists, I
don't think Spark should really have a this level of detail but I've also
missed out on all of these meetings (sorry it's family dinner time :) ) so
I may be missing something.

So my tldr is, Let a datasource report whether or not missing columns are
OK and let the Datasource deal with the missing data based on it's
underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I agree that we should not rewrite existing parquet files when a new
> column is added, but we should also try out best to make the behavior same
> as RDBMS/SQL standard.
>
> 1. it should be the user who decides the default value of a column, by
> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
> 2. When adding a new column, the default value should be effective for all
> the existing data, and newly written data.
> 3. When altering an existing column and change the default value, it
> should be effective for newly written data only.
>
> A possible implementation:
> 1. a columnn has 2 default values: the initial one and the latest one.
> 2. when adding a column with a default value, set both the initial one and
> the latest one to this value. But do not update existing data.
> 3. when reading data, fill the missing column with the initial default
> value
> 4. when writing data, fill the missing column with the latest default value
> 5. when altering a column to change its default value, only update the
> latest default value.
>
> This works because:
> 1. new files will be written with the latest default value, nothing we
> need to worry about at read time.
> 2. old files will be read with the initial default value, which returns
> expected result.
>
> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi everyone,
>>
>> This thread is a follow-up to a discussion that we started in the DSv2
>> community sync last week.
>>
>> The problem I’m trying to solve is that the format I’m using DSv2 to
>> integrate supports schema evolution. Specifically, adding a new optional
>> column so that rows without that column get a default value (null for
>> Iceberg). The current validation rule for an append in DSv2 fails a write
>> if it is missing a column, so adding a column to an existing table will
>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>> schema evolution shouldn't break existing jobs that produce valid data.
>>
>> To fix this problem, I suggested option 1: adding a way for Spark to
>> check whether to fail when an optional column is missing. Other
>> contributors in the sync thought that Spark should go with option 2:
>> Spark’s schema should have defaults and Spark should handle filling in
>> defaults the same way across all sources, like other databases.
>>
>> I think we agree that option 2 would be ideal. The problem is that it is
>> very hard to implement.
>>
>> A source might manage data stored in millions of immutable Parquet files,
>> so adding a default value isn’t possible. Spark would need to fill in
>> defaults for files written before the column was added at read time (it
>> could fill in defaults in new files at write time). Filling in defaults at
>> read time would require Spark to fill in defaults for only some of the
>> files in a scan, so Spark would need different handling for each task
>> depending on the schema of that task. Tasks would also be required to
>> produce a consistent schema, so a file without the new column couldn’t be
>> combined into a task with a file that has the new column. This adds quite a
>> bit of complexity.
>>
>> Other sources may not need Spark to fill in the default at all. A JDBC
>> source would be capable of filling in the default values itself, so Spark
>> would need some way to communicate the default to that source. If the
>> source had a different policy for default values (write time instead of
>> read time, for example) then behavior could still be inconsistent.
>>
>> I think that this complexity probably isn’t worth consistency in default
>> values across sources, if that is even achievable.
>>
>> In the sync we thought it was a good idea to send this out to the larger
>> group to discuss. Please reply with comments!
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Reply via email to