ad1happy2go commented on issue #10303:
URL: https://github.com/apache/hudi/issues/10303#issuecomment-1895009159

   @srinikandi Sorry for the delay on this. 
   
   I was able to reproduce the issue with Hudi version 0.12.1 and 0.14.1. We 
have introduced the config 
"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled", 
you can set it to True.
   
   ```
     public static final ConfigProperty<String> 
KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED = ConfigProperty
         
.key("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
         .defaultValue("false")
         .withDocumentation("When set to true, consistent value will be 
generated for a logical timestamp type column, "
             + "like timestamp-millis and timestamp-micros, irrespective of 
whether row-writer is enabled. Disabled by default so "
             + "as not to break the pipeline that deploy either fully 
row-writer path or non row-writer path. For example, "
             + "if it is kept disabled then record key of timestamp type with 
value `2016-12-29 09:54:00` will be written as timestamp "
             + "`2016-12-29 09:54:00.0` in row-writer path, while it will be 
written as long value `1483023240000000` in non row-writer path. "
             + "If enabled, then the timestamp value will be written in both 
the cases.");
   ```
   
   Reproducible Code which works when we set the config. - 
   
   ```
   from faker import Faker
   import pandas as pd
   from pyspark.sql import SparkSession
   import pyspark.sql.functions as F
   
   #..........................   Fake Data Generation 
...........................
   fake = Faker()
   data = [{"transactionId": fake.uuid4(), "EventTime": "2014-01-01 
23:00:01","storeNbr" : "1",
            "FullName": fake.name(), "Address": fake.address(),
            "CompanyName": fake.company(), "JobTitle": fake.job(),
            "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
            "RandomText": fake.sentence(), "City": fake.city(),
            "State": "NYC", "Country": "US"} for _ in range(5)]
   pandas_df = pd.DataFrame(data)
   
   hoodi_configs = {
       "hoodie.insert.shuffle.parallelism": "1",
       "hoodie.upsert.shuffle.parallelism": "1",
       "hoodie.bulkinsert.shuffle.parallelism": "1",
       "hoodie.delete.shuffle.parallelism": "1",
       "hoodie.datasource.write.row.writer.enable": "true",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.recordkey.field": 
"transactionId,storeNbr,EventTime",
       "hoodie.datasource.write.precombine.field": "Country",
       "hoodie.datasource.write.partitionpath.field": "State",
       "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.combine.before.upsert": "true",
       "hoodie.table.name": "huditransaction",
       
"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled": 
"false",
   }
   spark.sparkContext.setLogLevel("WARN")
   
   df = spark.createDataFrame(pandas_df).withColumn("EventTime", 
expr("cast(EventTime as timestamp)"))
   
df.write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","bulk_insert").mode("overwrite").save(PATH)
   
spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False)
   
df.withColumn("City",lit("updated_city")).write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","upsert").mode("append").save(PATH)
   
spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False)
   ```
   
   Let me know in case you need any more help on this. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to