[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

GitBox Tue, 14 Apr 2020 15:49:54 -0700

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553


 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##########
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
     List<Schema.Field> parentFields = new ArrayList<>();
 
     Schema.Field commitTimeField =
-        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field commitSeqnoField =
-        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field recordKeyField =
-        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field partitionPathField =
-        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
     Schema.Field fileNameField =
-        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+        new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs 
to kafka, which we transform to avro using ProtoToAvro converter. Because those 
are service to service messages, they are not perfect and we had to do some 
bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, 
so, we inject marker field called exists pretty much everywhere. We do it 
inside HUDI to avoid clutter in kafka. Originally added it only to records with 
no fields, but found out that when fields are added and exists is autoremoved, 
parquet-avro reader breaks as it expects every field in parquet schema to be 
present in avro. I have a fix to parquet-avro to skip those fields, will send a 
PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces 
multiple outputs). In this case, we infer schema from Spark (using 
NullTargetSchemaProvider). This schema must be correct as well which is not the 
case as spark omits defaults, so compaction breaks. Have a PR to adress.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

Reply via email to