KarthickAN opened a new issue #2066:
URL: https://github.com/apache/hudi/issues/2066


   Hi,
       I wanted to understand how much storage overhead hudi is going to add 
because of its metadata. So I ran a spike with 14GB of raw data and processed 
it to produce a parquet files. I already have a pyspark script that does this 
processing without hudi. When I processed the data using the normal pyspark 
script it produced just one file ( Since all the data belonged to one partition 
) of size 294MB snappy compressed. I replicated the same processing logic with 
hudi in place with the below configurations
   
   hudi_options = {
       'hoodie.table.Name': tableName,
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
       'hoodie.datasource.write.recordkey.field': 
'sourceId,sourceAssetId,timestamp,sourceSignalId,aggregation',
       'hoodie.datasource.write.table.Type': 'COPY_ON_WRITE',
       'hoodie.datasource.write.partitionpath.field': 'date,sourceId',
       'hoodie.datasource.write.hive_style_partitioning': True,
       'hoodie.datasource.write.table.Name': tableName,
       'hoodie.datasource.write.operation': 'insert',
       'hoodie.parquet.compression.codec': 'snappy',
       'hoodie.parquet.compression.ratio': '0.95',
       'hoodie.parquet.small.file.limit': '536870912',
       'hoodie.parquet.max.file.size': '1073741824',
       'hoodie.parquet.block.size': '1073741824',
       'hoodie.copyonwrite.record.size.estimate': '36',
       'hoodie.cleaner.commits.retained': 1,
       'hoodie.combine.before.insert': True,
       'hoodie.datasource.write.precombine.field': 'quality',
       'hoodie.insert.shuffle.parallelism': 10,
       'hoodie.datasource.write.insert.drop.duplicates': True
   }
   
   But hudi produced 3 files of 815MB totalling to 2.4GB for the same 
partition. In both the cases I've got 72107388 records to be precise. I 
expected to see increase in total size but this is almost 8 times bigger in 
size. So that's kind of a problem. Can I get a confirmation as to are we 
expected to see this big a difference if hudi is used or this could be reduced 
further by tuning the config ?
   
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.3
   
   * Hadoop version : 2.8.5-amzn-1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No. Running on AWS Glue
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to