[GitHub] [hudi] asheeshgarg opened a new issue #1825: [SUPPORT] Compaction of parquet and meta file

GitBox Mon, 13 Jul 2020 15:28:11 -0700


asheeshgarg opened a new issue #1825:
URL: https://github.com/apache/hudi/issues/1825



   Setup 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
   Client PySpark
   Storage S3:
   
   I have few dataset arriving at different time of the day lets say 500 
datasets each day. Each of the datasets having mostly independent small lets 
say 5000 rows data but of the same structure column wise. I have partition the 
data using date column.
   Objectively I am looking at inline compaction so that the data get compacted 
each time we write into one parquet file at end of the day and we have one 
parquet file and rest of the older parquet files get deleted.
   Following is the hudi options I have used with pyspark
   hudi_options = {
   --
     | "hoodie.table.name": self.table_name,
     | "hoodie.datasource.write.table.type": "MERGE_ON_READ",
     | "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
     | "hoodie.datasource.write.recordkey.field": 
"snapshot_date,dataset,column",
     | "hoodie.datasource.write.precombine.field": "column",
     | "hoodie.datasource.write.table.name": self.table_name,
     | "hoodie.compact.inline": True,
     | "hoodie.compact.inline.max.delta.commits": 1,
     | "hoodie.upsert.shuffle.parallelism": 2,
     | "hoodie.insert.shuffle.parallelism": 2,
     | "hoodie.embed.timeline.server": False,
     | "hoodie.datasource.write.partitionpath.field": "snapshot_date",
     | }
   
   I see writes get succeeded but I see multiple parquet files under the s3 
location for a given date. Do I need to add any property in spark hudi options 
to achieve what I am looking at to compact meta and parquet file into one files?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] asheeshgarg opened a new issue #1825: [SUPPORT] Compaction of parquet and meta file

Reply via email to