[GitHub] [iceberg] shenodaguirguis commented on pull request #2496: [#2039] Support default value semantics - API changes

GitBox Wed, 09 Feb 2022 15:14:25 -0800


shenodaguirguis commented on pull request #2496:
URL: https://github.com/apache/iceberg/pull/2496#issuecomment-1034297274

   Thanks @RussellSpitzer  for the great questions. I took time to make sure I 
am thinking with general Iceberg mindset, below please find my responses. If 
they make sense, I will go ahead and update the specs accordingly.
   Let me start with the example you wrote:
   > An example being:
   > 
   > Say I have file A' without column id 3 which has a default in my spec of 
"foo". If I read rows out of fileA' with this schema do I return foo? If I 
later read the table where the spec has a default for column3 of "bar" do I 
read bar? If I rewrite the datafiles inbetween the schema changes do I get foo 
or bar or null?

   The default value is part of the table’s schema/spec.  Hence, it lives in 
the Iceberg Metadata file. Therefore, whenever the schema is updated, leading 
to a new Schema ID,  successor reads will read the new value. If however 
reading an older snapshot, i.e., using the older Snapshot Schema, the older 
default value saved in that older Schema will be used. 
   So, in the example you listed, reading file-A first time (snapshot 0) will 
return value “foo”, while reading it after updating the spec/schema (new 
Snapshot 1) will read “bar” (but of course reading snapshot 0 even after 
updating the schema will still read “foo”)..
   Rewriting datafiles produces a new snapshot, which derives from the latest 
snapshot, with its same schema, so it is sort of orthogonal in the sense that 
the reading behavior is the same.

   > For example, does a writer upon rewriting a file with a default value set, 
materialize that value?

   No. Default values belong to the table’s spec/schema, and are used only when 
reading and the column is missing (not materialized)

   > I think "manifested" needs explanation as well. If defaults are used on 
writers I think we probably need to explain that in the writer section.

   Agreed, will use “materialized” instead. Also, defaults are not used in 
writes (of datafiles). However, it worth mentioning that we might consider 
adding a DDL to ALTER column adding/changing/dropping default values, if fits. 
Such DDLs would create a new schema in the metadata file, and a new Snapshot.

   > A field is required and has a default value, What does this mean on 
writing/rewriting + reading. 

   Default values are used only during reading data. When a data row is missing 
a column id that has a default value, the default value (from the schema) is 
read/used.

   >What happens if this default is changed in the schema but the column id 
remains the same.

   This is the default case, right? The opposite case is interesting: if column 
id is changed, I am not sure what does that even mean? Column is deleted and a 
new one with the same column name is introduced? In any case, the default value 
defined for column X in the schema will only be used with the X's column id.

   > 
   > A field is optional and has a default value, What does this mean on 
writing/rewriting + reading. What happens if this default is changed ...

   Optional vs Required makes difference if no default value is defined, while 
reading data. If a default value is defined in and the column id is missing, 
the default value is used in both cases (i.e., optional and required fields 
cases). If no default value is defined, then an exception is thrown only if the 
field is required. For reference: 
https://github.com/linkedin/iceberg/pull/72/files#diff-40083c166e284232643fa343534c626bca09d488537c226bb324be6169cab571R109

   > 
   > Can a default valued field be applied to files where the column ID of the 
field is not present? 

   if I read this correctly, the question is: if table had columns A and B, and 
we have datafile1 written. Then later we add column X with default value d_x. 
Can we read the default value while reading datafile1? The answer is yes. In 
fact, this support for non-null default values was motivated to address this 
scenario particularly. 

   > I believe this is part of the idea here but i'm not sure how it would be 
defined. Is the idea that if the column ID is not present in a given data file 
do we always just return the default of the current schema?

   Correct (where current = schema of the snapshot we are reading)

   > Do defaults exist retroactively or are they always forward looking?

   Defaults can exist retroactively, since we can read older data files using 
newer snapshots’ schemata.. 

-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shenodaguirguis commented on pull request #2496: [#2039] Support default value semantics - API changes

Reply via email to