[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

GitBox Tue, 14 Apr 2020 15:49:10 -0700

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553


 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs 
to kafka, which we transform to avro using ProtoToAvro converter. Because those 
are service to service messages, they are not perfect and we had to do some 
bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, 
so, we inject marker field called exists pretty much everywhere. Originally 
added it only to records with no fields, but found out that when fields are 
added and exists is autoremoved, parquet-avro reader breaks as it expects every 
field in parquet schema to be present in avro. I have a fix to parquet-avro to 
skip those fields, will send a PR to show so we can discuss whether it is a 
good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces 
multiple outputs). In this case, we infer schema from Spark (using 
NullTargetSchemaProvider). This schema must be correct as well which is not the 
case as spark omits defaults, so compaction breaks. Have a PR to adress.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Reply via email to