zeeshan-media opened a new issue, #10507:
URL: https://github.com/apache/hudi/issues/10507

   ### Problem Detail: 
   I am trying hudi record index on my machine, although my pyspark job runs 
smoothly and data is written along with creation of record_index file in the 
hudi's metadata table, it gives the following warning:
    _WARN SparkMetadataTableRecordIndex: Record index not initialized so 
falling back to GLOBAL_SIMPLE for tagging records._
   Does it mean my record Index is not working because for just 200 MB's of 
parquet data, it is creating 30 files in the output directory?
   
   
   ### **Environment Description:**
   Pyspark version : 3.3.0
   hudi version : 0.14.0
    
   I have tried this on EMR 6.15 too with details as:
   pyspark version: 3.4.1
   hudi version : 0.14.0
   
   
   ### The warning is generated from this part of the hudi code
   
![image](https://github.com/apache/hudi/assets/64635236/714d3032-0fa3-4ed2-be31-1927ea5afe31)
   
   **Code Link**
   
[SparkMetaDataTableRecordIndex.java](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java)
   
   ### How to Reproduce
   ```
   from faker import Faker
   import pandas as pd
   from pyspark.sql import SparkSession
   import pyspark.sql.functions as F
   
   
   
   #..........................   Fake Data Generation 
...........................
   fake = Faker()
   data = [{"ID": fake.uuid4(), "EventTime": fake.date_time(), 
            "FullName": fake.name(), "Address": fake.address(), 
            "CompanyName": fake.company(), "JobTitle": fake.job(), 
            "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(), 
            "RandomText": fake.sentence(), "City": fake.city(), 
            "State": fake.state(), "Country": fake.country()} for _ in 
range(1000)]
   pandas_df = pd.DataFrame(data)
   
   
   
   #.........................     Hoodie Properties  
............................
   
   
   hoodie_properties = {
   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
   'hoodie.datasource.write.operation': 'upsert',
   'hoodie.datasource.write.payload.class': 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.write.recordkey.field': 'ID, State,City',
   'hoodie.metadata.enable' : True,
   'hoodie.table.name' : "record_index",
   'hoodie.enable.data.skipping' : True,
   "hoodie.index.type" : "RECORD_INDEX",
   "hoodie.metadata.record.index.enable" : True,
   'hoodie.datasource.write.precombine.field': 'EventTime',
   'hoodie.payload.ordering.field': 'EventTime',
   'hoodie.datasource.write.partitionpath.field': 'partition',
   'hoodie.datasource.write.drop.partition.columns' : True
   
   }
   
   if __name__ == '__main__':
       with SparkSession.builder.appName(f"hudi_record_index") \
           .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer") \
           .config("spark.jars", 
"/jars/hadoop-lzo.jar,/jars/hudi-spark3.3-bundle_2.12-0.14.0.jar")  \        # 
add these jars files path according to your machine
           
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.hudi.catalog.HoodieCatalog")\
           
.config('spark.sql.extensions','org.apache.spark.sql.hudi.HoodieSparkSessionExtension')\
           .config("spark.hadoop.parquet.avro.write-old-list-structure", False) 
\
           .config("spark.sql.adaptive.enabled", False) \
           .config("spark.dynamicAllocation.enabled", True) \
           .getOrCreate() as spark:
   
           spark.sparkContext.setLogLevel("WARN")
           
           df = spark.createDataFrame(pandas_df)
           df = df.withColumn("partition",F.lit("record_index"))
   
           
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save("your_output_file_path")
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to