KarthickAN opened a new issue #2066:
URL: https://github.com/apache/hudi/issues/2066
Hi,
I wanted to understand how much storage overhead hudi is going to add
because of its metadata. So I ran a spike with 14GB of raw data and processed
it to produce a parquet files. I already have a pyspark script that does this
processing without hudi. When I processed the data using the normal pyspark
script it produced just one file ( Since all the data belonged to one partition
) of size 294MB snappy compressed. I replicated the same processing logic with
hudi in place with the below configurations
hudi_options = {
'hoodie.table.Name': tableName,
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.recordkey.field':
'sourceId,sourceAssetId,timestamp,sourceSignalId,aggregation',
'hoodie.datasource.write.table.Type': 'COPY_ON_WRITE',
'hoodie.datasource.write.partitionpath.field': 'date,sourceId',
'hoodie.datasource.write.hive_style_partitioning': True,
'hoodie.datasource.write.table.Name': tableName,
'hoodie.datasource.write.operation': 'insert',
'hoodie.parquet.compression.codec': 'snappy',
'hoodie.parquet.compression.ratio': '0.95',
'hoodie.parquet.small.file.limit': '536870912',
'hoodie.parquet.max.file.size': '1073741824',
'hoodie.parquet.block.size': '1073741824',
'hoodie.copyonwrite.record.size.estimate': '36',
'hoodie.cleaner.commits.retained': 1,
'hoodie.combine.before.insert': True,
'hoodie.datasource.write.precombine.field': 'quality',
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.write.insert.drop.duplicates': True
}
But hudi produced 3 files of 815MB totalling to 2.4GB for the same
partition. In both the cases I've got 72107388 records to be precise. I
expected to see increase in total size but this is almost 8 times bigger in
size. So that's kind of a problem. Can I get a confirmation as to are we
expected to see this big a difference if hudi is used or this could be reduced
further by tuning the config ?
**Environment Description**
* Hudi version : 0.6.0
* Spark version : 2.4.3
* Hadoop version : 2.8.5-amzn-1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No. Running on AWS Glue
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]