shenodaguirguis opened a new pull request #2496:
URL: https://github.com/apache/iceberg/pull/2496
# Summary
Iceberg schema currently does not support default values which imposes a
challenge on reading Hive tables written in avro format and have default values
(see issue 2039). Specifically, if a field has a non-null default value, it is
mapped to a required field - with no default value - in iceberg. Thus, upon
reading rows where this field is not manifested, an IllegalArgumentException
(Missing required field) will be thrown. Further, default values of nullable
fields are lost silently. That is because nullable fields with default values
are mapped to optional fields with no default values, and thus null is returned
when the field is absent, instead of the default value. This document describes
how to support the default values semantics in Iceberg schema to resolve these
issues.
# Problem
## Default values are lost
Default values are specified using the Avro schema keyword “default”. For
example, the following is an example of an Avro string field with default value
“unknown”:
`{"name": "filedName", "type": "string", "default": “unknown”}`
Also, a nullable (optional) avro field can define default value as follows:
`{"name": "fieldName", "type": ["null", "string"], "default": null}`
Please note that nullability is specified via UNION type (i.e., the [“null’,
“string”]) and the default value’s type must match the first type in the union.
In other words, the following are invalid types:
```
{"name": "fieldName", "type": ["null", "string"], "default": “unnown”}
{"name": "fieldName", "type": ["string", “null”], "default": null}
```
That is, if the default value is of type null, the first type of the field
must be the “null”, o.w., if the default value is of type string, the first
type of the field must be string.
When converting an avro schema to Iceberg Types, we have 2 cases. If the
field is nullable, it maps to an optional NestedField Iceberg type. While if
the field is non-nullable, it is mapped to a required NestedField Iceberg type,
with no default value, since the NestedFiled does not support default values.
In both cases, the default value is lost which leads to a wrong handling of the
data when read. In the case of non-nullable fields, error is thrown if the
field is not present, whereas in case of null, it goes by as an optional field.
## Where in code this breaks
When reading avro records [AvroSchemaUtils::buildAvroProjections()
](https://github.com/apache/iceberg/blob/aba898b1a2ea15fd091228626b6887a5a72800c0/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L103)
is invoked, which invokes [BuildAvroProjection.record()
](https://github.com/linkedin/iceberg/blob/14b1d891522cfd111f50531e78ba63b8a60ccf1f/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java#L53)
to construct the Iceberge’s record. When reading rows with default values
(i.e., the field is not present in the data file), the code path goes to check
if this field is optional or not (here). If the field has a null-default value,
it is nullable, and is therefore mapped to an optional field and the field is
skipped, if otherwise it has a non-null default value, this check throws an
exception.
# Solution
## Overview
The fix is simply to add the default value to the
[NestedField](https://github.com/linkedin/iceberg/blob/14b1d891522cfd111f50531e78ba63b8a60ccf1f/api/src/main/java/org/apache/iceberg/types/Types.java#L415),
and add relevant APIs to copy the default value over when converting from AVRO
schema, to use it while reading. In case of non-nullable fields with default
values, it is obvious that these need to be modeled as required fields with
default values. The default value here can be used in schema evolution, for
example if a required field is added. In this case, while reading older
data/partitions, the default value will be returned. For nullable fields, with
similar reasoning about schema evolution, we should model these fields as
optional with default values and use the default value instead of just using
null for optional fields. This includes the two cases: (a) if the default value
is null (as in: `{"name": "fieldName", "type": ["null", "string"], "default":
null}`), and (b) non
-null default value( as in: `{"name": "fieldName", "type": ["string", "null"],
"default": “defValue”}` ).
## ORC and Parquet
While Avro support defaults semantics, and Avro libraries can be used to
read fields with defaults values, neither ORC nor Parquet formats support
default values semantics. It is required, though, to provide consistent
behavior and semantics across different file formats, therefore, once default
semantics are enabled into Iceberg schema, the ORC and Parquet readers should
be modified to handle this properly. Specifically, when reading a field that is
not manifested but has a default value, the default value should be used.
# Planned Code Changes
- FIrst (this) PR: API Changes: An AVRO schema of type record is mapped
into Iceberg type: StructType, which consists of an array of NestedField’s.
Therefore, to support default values we need to make NestedField default-value
aware. This involves adding APIs to create NetedField objects with default
values, as well as getting the and checking for the default value.
- Second PR: schema mapping changes: to copy over default values to Iceberg
schema during schema mapping/conversion
- Third PR: Avro Reader changes to use the default value when needed
- Fourth PR: ORC Reader changes
- Fifth PR: Parquet Reader changes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]