gentrit1 commented on issue #11560:
URL: https://github.com/apache/hudi/issues/11560#issuecomment-2208500534

   > > Hey there can you please provide sample dataset using faker so we can 
try locally to re produce this behavior ?
   > 
   > Is it okay for you to use the fake data from the [parquet file 
](https://we.tl/t-eBT7GTZabv) while having the table options specified as below:
   > 
   > hudi_options = { "hoodie.table.name": table_name, 
"hoodie.datasource.write.table.type": "MERGE_ON_READ", 
"hoodie.datasource.write.recordkey.field": "UniqueNumber", # key 
"hoodie.datasource.write.partitionpath.field": "Date,Job", 
"hoodie.datasource.write.precombine.field": "Timestamp", 
"hoodie.datasource.write.table.name": table_name, 
"hoodie.datasource.write.operation": "upsert", "hoodie.combine.before.insert": 
"true", "hoodie.cleaner.commits.retained": "3", 
"hoodie.compact.inline.max.delta.commits": "2", "hoodie.enable.data.skipping": 
"true", "hoodie.metadata.enable": "true", 
"hoodie.metadata.index.column.stats.enable": "true", 
"hoodie.metadata.record.index.enable": "true", "hoodie.index.type": 
"RECORD_INDEX", "hoodie.clustering.inline": "true", 
"hoodie.clustering.inline.max.commits": "1", 
"hoodie.clustering.plan.strategy.class": 
"org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy",
 "hoodie.clustering.plan.strategy.target.file.max.bytes": "4000
 0000", "hoodie.clustering.plan.strategy.sort.columns": "UniqueNumber", }
   
   also find here the python code I used the generate the parquet: 
   
   `from faker import Faker
   import pandas as pd
   import uuid
   
   fake = Faker()
   num_records = 1000
   
   data = {
       "UniqueNumber": [str(uuid.uuid4()) for _ in range(num_records)],
       "Name": [],
       "Address": [],
       "Email": [],
       "Phone Number": [],
       "Job": [],
       "Company": [],
       "Date": [],
       "Timestamp": [],
   }
   
   for _ in range(num_records):
       data["Name"].append(fake.name())
       data["Address"].append(fake.address())
       data["Email"].append(fake.email())
       data["Phone Number"].append(fake.phone_number())
       data["Job"].append(fake.job())
       data["Company"].append(fake.company())
       data["Date"].append(fake.date('2024-07-04'))
       data["Timestamp"].append(fake.unix_time())
   
   df = pd.DataFrame(data)
   
   df.to_parquet("fake_data.parquet", engine="fastparquet", index=False)
   
   print("Dataset generated and saved to fake_data.parquet")
   
   
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to