shenodaguirguis opened a new pull request #2496:
URL: https://github.com/apache/iceberg/pull/2496


   # Summary
   
   Iceberg schema currently does not support default values which imposes a 
challenge on reading Hive tables written in avro format and have default values 
(see issue 2039). Specifically, if a field has a non-null default value, it is 
mapped to a required field - with no default value - in iceberg. Thus, upon 
reading rows where this field is not manifested, an IllegalArgumentException 
(Missing required field) will be thrown. Further, default values of nullable 
fields are lost silently. That is because nullable fields with default values 
are mapped to optional fields with no default values, and thus null is returned 
when the field is absent, instead of the default value. This document describes 
how to support the default values semantics in Iceberg schema to resolve these 
issues.
   
   # Problem
   ## Default values are lost
   Default values are specified using the Avro schema keyword “default”. For 
example, the following is an example of an Avro string field with default value 
“unknown”:
   
   `{"name": "filedName", "type": "string", "default": “unknown”}`
   
   Also, a nullable (optional) avro field can define default value as follows:
   
   `{"name": "fieldName", "type": ["null", "string"], "default": null}`
   
   Please note that nullability is specified via UNION type (i.e., the [“null’, 
“string”]) and the default value’s type must match the first type in the union. 
In other words, the following are invalid types:
   
   ```
   {"name": "fieldName", "type": ["null", "string"], "default": “unnown”}
   {"name": "fieldName", "type": ["string", “null”], "default": null}
   ```
   
   That is, if the default value is of type null, the first type of the field 
must be the “null”, o.w., if the default value is of type string, the first 
type of the field must be string.
   
   When converting an avro schema to Iceberg Types, we have 2 cases. If the 
field is nullable, it maps to an optional NestedField Iceberg type. While if 
the field is non-nullable, it is mapped to a required NestedField Iceberg type, 
with no default value, since the NestedFiled does not support default values. 
In both cases, the default value is lost which leads to a wrong handling of the 
data when read. In the case of non-nullable fields, error is thrown if the 
field is not present, whereas in case of null, it goes by as an optional field.
   ## Where in code this breaks
   When reading avro records [AvroSchemaUtils::buildAvroProjections() 
](https://github.com/apache/iceberg/blob/aba898b1a2ea15fd091228626b6887a5a72800c0/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L103)
 is invoked, which invokes [BuildAvroProjection.record() 
](https://github.com/linkedin/iceberg/blob/14b1d891522cfd111f50531e78ba63b8a60ccf1f/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java#L53)
 to construct the Iceberge’s record. When reading rows with default values 
(i.e., the field is not present in the data file), the code path goes to check 
if this field is optional or not (here). If the field has a null-default value, 
it is nullable, and is therefore mapped to an optional field and the field is 
skipped, if otherwise it has a non-null default value, this check throws an 
exception.
   
   # Solution
   ## Overview
   The fix is simply to add the default value to the 
[NestedField](https://github.com/linkedin/iceberg/blob/14b1d891522cfd111f50531e78ba63b8a60ccf1f/api/src/main/java/org/apache/iceberg/types/Types.java#L415),
 and add relevant APIs to copy the default value over when converting from AVRO 
schema, to use it while reading. In case of non-nullable fields with default 
values, it is obvious that these need to be modeled as required fields with 
default values. The default value here can be used in schema evolution, for 
example if a required field is added. In this case, while reading older 
data/partitions, the default value will be returned. For nullable fields, with 
similar reasoning about schema evolution, we should model these fields as 
optional with default values and use the default value instead of just using 
null for optional fields. This includes the two cases: (a) if the default value 
is null (as in: `{"name": "fieldName", "type": ["null", "string"], "default": 
null}`), and (b) non
 -null default value( as in: `{"name": "fieldName", "type": ["string", "null"], 
"default": “defValue”}` ). 
   ## ORC and Parquet
   While Avro support defaults semantics, and Avro libraries can be used to 
read fields with defaults values, neither ORC nor Parquet formats support 
default values semantics. It is required, though, to provide consistent 
behavior and semantics across different file formats, therefore, once default 
semantics are enabled into Iceberg schema, the ORC and Parquet readers should 
be modified to handle this properly. Specifically, when reading a field that is 
not manifested but has a default value, the default value should be used.
   
   # Planned Code Changes
   
   -  FIrst (this) PR: API Changes: An AVRO schema of type record is mapped 
into Iceberg type: StructType, which consists of an array of NestedField’s. 
Therefore, to support default values we need to make NestedField default-value 
aware. This involves adding APIs to create NetedField objects with default 
values, as well as getting the and checking for the default value.
   -  Second PR: schema mapping changes: to copy over default values to Iceberg 
schema during schema mapping/conversion
   - Third PR: Avro Reader changes to use the default value when needed
   - Fourth PR: ORC Reader changes
   - Fifth PR: Parquet Reader changes 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to