[GitHub] [hudi] KarthickAN commented on issue #2066: [SUPPORT] Hudi is increasing the storage size big time

GitBox Wed, 14 Oct 2020 09:40:10 -0700


KarthickAN commented on issue #2066:
URL: https://github.com/apache/hudi/issues/2066#issuecomment-708521865



   @bvaradar We have two types of data for which we have used hudi. Both of 
them are pretty much similar with some difference in the schema. I almost 
completed development. For one of the data type where we have less volume of 
data this difference was so huge. 
   
   Say I have following number of objects in json lines format 
   4326 Objects - 580.4 MB
   
   when transformed to hudi with snappy compression enabled it comes to
   4599 Objects - 7.2 GB
   
   which is really huge. For the other type where we have more volume of data I 
don't see this issue. It looks like below.
   
   json lines
   6895 Objects - 100.0 GB
   
   parquet snappy with hudi
   10597 Objects - 42.4 GB
   
   Following are the configs I am using right now. 
   
   SmallFileSize = 104857600
   MaxFileSize = 125829120
   RecordSize = 35
   CompressionRatio = 5
   InsertSplitSize = 3500000
   IndexBloomNumEntries = 1500000
   KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
   RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
   TableType = COPY_ON_WRITE
   PartitionPathFields = date,sourceid
   HiveStylePartitioning = True
   WriteOperation = insert
   CompressionCodec = snappy
   CommitsRetained = 1
   CombineBeforeInsert = True
   PrecombineField = timestamp
   InsertDropDuplicates = True
   InsertShuffleParallelism = 100
   
   
   Anything I should look at to improve this right now ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] KarthickAN commented on issue #2066: [SUPPORT] Hudi is increasing the storage size big time

Reply via email to