afilipchik commented on a change in pull request #1513: [HUDI-793] Adding
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
List<Schema.Field> parentFields = new ArrayList<>();
Schema.Field commitTimeField =
- new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", (Object) null);
+ new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
Schema.Field commitSeqnoField =
- new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", (Object) null);
+ new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
Schema.Field recordKeyField =
- new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", (Object) null);
+ new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
Schema.Field partitionPathField =
- new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", (Object) null);
+ new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
Schema.Field fileNameField =
- new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", (Object) null);
+ new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD,
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
Review comment:
On schema generation, we had to do some stuff... Our service emits protobufs
to kafka, which we transform to avro using ProtoToAvro converter. Because those
are service to service messages, they are not perfect and we had to do some
bending to make sure schema can evolve. Some things:
1) We made everything optional
2) Avro and Protobuf allows records to have no fields. But parquet doesn't,
so, we inject marker field called exists pretty much everywhere. Originally
added it only to records with no fields, but found out that when fields are
added and exists is autoremoved, parquet-avro reader breaks as it expects every
field in parquet schema to be present in avro. I have a fix to parquet-avro to
skip those fields, will send a PR to show so we can discuss whether it is a
good idea or not.
3) Then we sometimes run sql transformation (same kafka stream produces
multiple outputs). In this case, we infer schema from Spark (using
NullTargetSchemaProvider). This schema must be correct as well which is not the
case as spark omits defaults, so compaction breaks. Have a PR to adress.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services