zeeshan-media opened a new issue, #10507:
URL: https://github.com/apache/hudi/issues/10507
### Problem Detail:
I am trying hudi record index on my machine, although my pyspark job runs
smoothly and data is written along with creation of record_index file in the
hudi's metadata table, it gives the following warning:
_WARN SparkMetadataTableRecordIndex: Record index not initialized so
falling back to GLOBAL_SIMPLE for tagging records._
Does it mean my record Index is not working because for just 200 MB's of
parquet data, it is creating 30 files in the output directory?
### **Environment Description:**
Pyspark version : 3.3.0
hudi version : 0.14.0
I have tried this on EMR 6.15 too with details as:
pyspark version: 3.4.1
hudi version : 0.14.0
### The warning is generated from this part of the hudi code

**Code Link**
[SparkMetaDataTableRecordIndex.java](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java)
### How to Reproduce
```
from faker import Faker
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
#.......................... Fake Data Generation
...........................
fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": fake.date_time(),
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "City": fake.city(),
"State": fake.state(), "Country": fake.country()} for _ in
range(1000)]
pandas_df = pd.DataFrame(data)
#......................... Hoodie Properties
............................
hoodie_properties = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.payload.class':
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.recordkey.field': 'ID, State,City',
'hoodie.metadata.enable' : True,
'hoodie.table.name' : "record_index",
'hoodie.enable.data.skipping' : True,
"hoodie.index.type" : "RECORD_INDEX",
"hoodie.metadata.record.index.enable" : True,
'hoodie.datasource.write.precombine.field': 'EventTime',
'hoodie.payload.ordering.field': 'EventTime',
'hoodie.datasource.write.partitionpath.field': 'partition',
'hoodie.datasource.write.drop.partition.columns' : True
}
if __name__ == '__main__':
with SparkSession.builder.appName(f"hudi_record_index") \
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
.config("spark.jars",
"/jars/hadoop-lzo.jar,/jars/hudi-spark3.3-bundle_2.12-0.14.0.jar") \ #
add these jars files path according to your machine
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.hudi.catalog.HoodieCatalog")\
.config('spark.sql.extensions','org.apache.spark.sql.hudi.HoodieSparkSessionExtension')\
.config("spark.hadoop.parquet.avro.write-old-list-structure", False)
\
.config("spark.sql.adaptive.enabled", False) \
.config("spark.dynamicAllocation.enabled", True) \
.getOrCreate() as spark:
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df = df.withColumn("partition",F.lit("record_index"))
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save("your_output_file_path")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]