KarthickAN commented on issue #2066: URL: https://github.com/apache/hudi/issues/2066#issuecomment-708521865
@bvaradar We have two types of data for which we have used hudi. Both of them are pretty much similar with some difference in the schema. I almost completed development. For one of the data type where we have less volume of data this difference was so huge. Say I have following number of objects in json lines format 4326 Objects - 580.4 MB when transformed to hudi with snappy compression enabled it comes to 4599 Objects - 7.2 GB which is really huge. For the other type where we have more volume of data I don't see this issue. It looks like below. json lines 6895 Objects - 100.0 GB parquet snappy with hudi 10597 Objects - 42.4 GB Following are the configs I am using right now. SmallFileSize = 104857600 MaxFileSize = 125829120 RecordSize = 35 CompressionRatio = 5 InsertSplitSize = 3500000 IndexBloomNumEntries = 1500000 KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp TableType = COPY_ON_WRITE PartitionPathFields = date,sourceid HiveStylePartitioning = True WriteOperation = insert CompressionCodec = snappy CommitsRetained = 1 CombineBeforeInsert = True PrecombineField = timestamp InsertDropDuplicates = True InsertShuffleParallelism = 100 Anything I should look at to improve this right now ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
