ad1happy2go commented on issue #10507:
URL: https://github.com/apache/hudi/issues/10507#issuecomment-1893742984

   @zeeshan-media Thanks for raising this. 
   I tried the code and realised that for the first time while writing data to 
a empty table, it gives this warning as record_index is not present inside 
metadata(as there is no data), So it falls back to GLOBAL_SIMPLE for tagging 
which anyway doesn't matter as there is no data at all. 
   
   I confirmed, In the next run, it uses the RECORD_INDEX properly (checked on 
Spark UI too) and there is no warning also.
   
   
   ```
   fake = Faker()
   data = [{"ID": fake.uuid4(), "EventTime": fake.date_time(),
            "FullName": fake.name(), "Address": fake.address(),
            "CompanyName": fake.company(), "JobTitle": fake.job(),
            "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
            "RandomText": fake.sentence(), "City": fake.city(),
            "State": fake.state(), "Country": fake.country()} for _ in 
range(1000)]
   pandas_df = pd.DataFrame(data)
   
   hoodie_properties = {
       'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.payload.class': 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.write.recordkey.field': 'ID, State,City',
       'hoodie.metadata.enable' : True,
       'hoodie.table.name' : "record_index",
       'hoodie.enable.data.skipping' : True,
       "hoodie.index.type" : "RECORD_INDEX",
       "hoodie.metadata.record.index.enable" : True,
       'hoodie.datasource.write.precombine.field': 'EventTime',
       'hoodie.payload.ordering.field': 'EventTime',
       'hoodie.datasource.write.partitionpath.field': 'partition',
       'hoodie.datasource.write.drop.partition.columns' : True
   
   }
   spark.sparkContext.setLogLevel("WARN")
   
   df = spark.createDataFrame(pandas_df)
   df = df.withColumn("partition",F.lit("record_index"))
   
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
   spark.read.options(**hoodie_properties).format("hudi").load(PATH).show()
   
df.withColumn("City",lit("updated_city")).write.format("hudi").options(**hoodie_properties).mode("append").save(PATH)
   spark.read.options(**hoodie_properties).format("hudi").load(PATH).show()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to