shenodaguirguis commented on pull request #2496: URL: https://github.com/apache/iceberg/pull/2496#issuecomment-1034297274
Thanks @RussellSpitzer for the great questions. I took time to make sure I am thinking with general Iceberg mindset, below please find my responses. If they make sense, I will go ahead and update the specs accordingly. Let me start with the example you wrote: > An example being: > > Say I have file A' without column id 3 which has a default in my spec of "foo". If I read rows out of fileA' with this schema do I return foo? If I later read the table where the spec has a default for column3 of "bar" do I read bar? If I rewrite the datafiles inbetween the schema changes do I get foo or bar or null? The default value is part of the table’s schema/spec. Hence, it lives in the Iceberg Metadata file. Therefore, whenever the schema is updated, leading to a new Schema ID, successor reads will read the new value. If however reading an older snapshot, i.e., using the older Snapshot Schema, the older default value saved in that older Schema will be used. So, in the example you listed, reading file-A first time (snapshot 0) will return value “foo”, while reading it after updating the spec/schema (new Snapshot 1) will read “bar” (but of course reading snapshot 0 even after updating the schema will still read “foo”).. Rewriting datafiles produces a new snapshot, which derives from the latest snapshot, with its same schema, so it is sort of orthogonal in the sense that the reading behavior is the same. > For example, does a writer upon rewriting a file with a default value set, materialize that value? No. Default values belong to the table’s spec/schema, and are used only when reading and the column is missing (not materialized) > I think "manifested" needs explanation as well. If defaults are used on writers I think we probably need to explain that in the writer section. Agreed, will use “materialized” instead. Also, defaults are not used in writes (of datafiles). However, it worth mentioning that we might consider adding a DDL to ALTER column adding/changing/dropping default values, if fits. Such DDLs would create a new schema in the metadata file, and a new Snapshot. > A field is required and has a default value, What does this mean on writing/rewriting + reading. Default values are used only during reading data. When a data row is missing a column id that has a default value, the default value (from the schema) is read/used. >What happens if this default is changed in the schema but the column id remains the same. This is the default case, right? The opposite case is interesting: if column id is changed, I am not sure what does that even mean? Column is deleted and a new one with the same column name is introduced? In any case, the default value defined for column X in the schema will only be used with the X's column id. > > A field is optional and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed ... Optional vs Required makes difference if no default value is defined, while reading data. If a default value is defined in and the column id is missing, the default value is used in both cases (i.e., optional and required fields cases). If no default value is defined, then an exception is thrown only if the field is required. For reference: https://github.com/linkedin/iceberg/pull/72/files#diff-40083c166e284232643fa343534c626bca09d488537c226bb324be6169cab571R109 > > Can a default valued field be applied to files where the column ID of the field is not present? if I read this correctly, the question is: if table had columns A and B, and we have datafile1 written. Then later we add column X with default value d_x. Can we read the default value while reading datafile1? The answer is yes. In fact, this support for non-null default values was motivated to address this scenario particularly. > I believe this is part of the idea here but i'm not sure how it would be defined. Is the idea that if the column ID is not present in a given data file do we always just return the default of the current schema? Correct (where current = schema of the snapshot we are reading) > Do defaults exist retroactively or are they always forward looking? Defaults can exist retroactively, since we can read older data files using newer snapshots’ schemata.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
