gentrit1 commented on issue #11560: URL: https://github.com/apache/hudi/issues/11560#issuecomment-2208500534
> > Hey there can you please provide sample dataset using faker so we can try locally to re produce this behavior ? > > Is it okay for you to use the fake data from the [parquet file ](https://we.tl/t-eBT7GTZabv) while having the table options specified as below: > > hudi_options = { "hoodie.table.name": table_name, "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.recordkey.field": "UniqueNumber", # key "hoodie.datasource.write.partitionpath.field": "Date,Job", "hoodie.datasource.write.precombine.field": "Timestamp", "hoodie.datasource.write.table.name": table_name, "hoodie.datasource.write.operation": "upsert", "hoodie.combine.before.insert": "true", "hoodie.cleaner.commits.retained": "3", "hoodie.compact.inline.max.delta.commits": "2", "hoodie.enable.data.skipping": "true", "hoodie.metadata.enable": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.metadata.record.index.enable": "true", "hoodie.index.type": "RECORD_INDEX", "hoodie.clustering.inline": "true", "hoodie.clustering.inline.max.commits": "1", "hoodie.clustering.plan.strategy.class": "org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy", "hoodie.clustering.plan.strategy.target.file.max.bytes": "4000 0000", "hoodie.clustering.plan.strategy.sort.columns": "UniqueNumber", } also find here the python code I used the generate the parquet: `from faker import Faker import pandas as pd import uuid fake = Faker() num_records = 1000 data = { "UniqueNumber": [str(uuid.uuid4()) for _ in range(num_records)], "Name": [], "Address": [], "Email": [], "Phone Number": [], "Job": [], "Company": [], "Date": [], "Timestamp": [], } for _ in range(num_records): data["Name"].append(fake.name()) data["Address"].append(fake.address()) data["Email"].append(fake.email()) data["Phone Number"].append(fake.phone_number()) data["Job"].append(fake.job()) data["Company"].append(fake.company()) data["Date"].append(fake.date('2024-07-04')) data["Timestamp"].append(fake.unix_time()) df = pd.DataFrame(data) df.to_parquet("fake_data.parquet", engine="fastparquet", index=False) print("Dataset generated and saved to fake_data.parquet") ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
