asheeshgarg opened a new issue #1825:
URL: https://github.com/apache/hudi/issues/1825


   Setup 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
   Client PySpark
   Storage S3:
   
   I have few dataset arriving at different time of the day lets say 500 
datasets each day. Each of the datasets having mostly independent small lets 
say 5000 rows data but of the same structure column wise. I have partition the 
data using date column.
   Objectively I am looking at inline compaction so that the data get compacted 
each time we write into one parquet file at end of the day and we have one 
parquet file and rest of the older parquet files get deleted.
   Following is the hudi options I have used with pyspark
   hudi_options = {
   --
     | "hoodie.table.name": self.table_name,
     | "hoodie.datasource.write.table.type": "MERGE_ON_READ",
     | "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
     | "hoodie.datasource.write.recordkey.field": 
"snapshot_date,dataset,column",
     | "hoodie.datasource.write.precombine.field": "column",
     | "hoodie.datasource.write.table.name": self.table_name,
     | "hoodie.compact.inline": True,
     | "hoodie.compact.inline.max.delta.commits": 1,
     | "hoodie.upsert.shuffle.parallelism": 2,
     | "hoodie.insert.shuffle.parallelism": 2,
     | "hoodie.embed.timeline.server": False,
     | "hoodie.datasource.write.partitionpath.field": "snapshot_date",
     | }
   
   I see writes get succeeded but I see multiple parquet files under the s3 
location for a given date. Do I need to add any property in spark hudi options 
to achieve what I am looking at to compact meta and parquet file into one files?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to