asheeshgarg opened a new issue #1825:
URL: https://github.com/apache/hudi/issues/1825
Setup
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
Client PySpark
Storage S3:
I have few dataset arriving at different time of the day lets say 500
datasets each day. Each of the datasets having mostly independent small lets
say 5000 rows data but of the same structure column wise. I have partition the
data using date column.
Objectively I am looking at inline compaction so that the data get compacted
each time we write into one parquet file at end of the day and we have one
parquet file and rest of the older parquet files get deleted.
Following is the hudi options I have used with pyspark
hudi_options = {
--
| "hoodie.table.name": self.table_name,
| "hoodie.datasource.write.table.type": "MERGE_ON_READ",
| "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
| "hoodie.datasource.write.recordkey.field":
"snapshot_date,dataset,column",
| "hoodie.datasource.write.precombine.field": "column",
| "hoodie.datasource.write.table.name": self.table_name,
| "hoodie.compact.inline": True,
| "hoodie.compact.inline.max.delta.commits": 1,
| "hoodie.upsert.shuffle.parallelism": 2,
| "hoodie.insert.shuffle.parallelism": 2,
| "hoodie.embed.timeline.server": False,
| "hoodie.datasource.write.partitionpath.field": "snapshot_date",
| }
I see writes get succeeded but I see multiple parquet files under the s3
location for a given date. Do I need to add any property in spark hudi options
to achieve what I am looking at to compact meta and parquet file into one files?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]