[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

GitBox Wed, 25 Mar 2020 00:45:36 -0700

pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy 
default values of fields if not present when rewriting incoming record with new 
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r397658217


 ##########
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/TestHoodieAvroUtils.java
 ##########
 @@ -57,4 +60,16 @@ public void testPropsPresent() {
     }
     Assert.assertTrue("column pii_col doesn't show up", piiPresent);
   }
+
+  @Test
+  public void testDefaultValue() {
+    GenericRecord rec = new GenericData.Record(new 
Schema.Parser().parse(EXAMPLE_SCHEMA));
+    rec.put("_row_key", "key1");
+    rec.put("non_pii_col", "val1");
+    rec.put("pii_col", "val2");
+    rec.put("timestamp", 3.5);
 
 Review comment:
   > conversion to avro is internal to Hudi and a custom avro schema (with 
default values) is not something that user can themselves pass
   
   I did not understand this. As a user I can always specify the schema that is 
to be used either via FileBasedSchemaProvider or using schema registry. 
   
   Let me give you an example. Suppose there is some table with schema S1 and 
you have published some records (R1 and R2) with this schema into kafka. Next 
you evolve the schema (it now becomes S2) and a new nullable field is added as 
below -> 
   
   {"name": "col1", "type":["string", "null"], "default": "dummy"}
   
   you again publish some records (R3 and R4) with S2 and now start consuming 
with delta streamer. So your kafka topic is having records with both the 
schemas and delta streamer is using S2 as target schema. Now while writing to 
parquet, I want R1 and R2 to be written with this default value "dummy" for 
field "col1", which is a pretty common case. Generally users prefer to have 
some default value for newly added fields rather than having written them as 
null. How do you achieve this without this PR? 
   
   Open to hearing your thoughts on this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

Reply via email to