mkk1490 commented on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-899649238


   @nsivabalan  I tested this scenario with >400 GB data with a few incremental 
loads. Following the order of the loads and data size.
   Datalake is snapshot into another partition and Hudi is upsert
   1. Original data size = 414 GB and Hudi data size = 424 GB
   2. After 4 incremental loads, original came around 2100 GB with Snapshot 
data in each partition and Hudi data size was 547 GB
   
   I tested for another scenario with >1.5 TB with the same process as above
   1. Original = 1.6 TB, Hudi 1.3 TB
   2. After 1 incremental, orig = 2.9 TB and Hudi 2.5 TB
   
   The same data in partition actually doubled for the bigger dataset while it 
remained the same for the 500 GB dataset. In both case, cleaner.commit: 5.
   Hudi works differently for different size of datasets. I initially thought 
increase in data size would be only for smaller sized tables. Is there anything 
I'm missing here? 
   
   Another thing I noted was for 1.5 TB, bulk_insert into Hudi had only 878 GB 
for the same number of records while insert overwrite had 1.3 TB. I couldn't 
figure out the why there was a huge difference in data size. Another 
observation, after doing bulk_insert, I was not able to perform delete 
operation on the table.
   
   This is my config.
   hudi_upsert_options = {
         'hoodie.table.name': 'f_claim_phcy_hudi_cow',
         'hoodie.datasource.write.recordkey.field': 'yr_mth,claim_id',
         'hoodie.datasource.write.partitionpath.field': 'yr_mth',
         'hoodie.datasource.write.table.Type': 'COPY_ON_WRITE', 
         'hoodie.datasource.write.table.name': 'f_claim_phcy_hudi_cow',
         'hoodie.combine.before.upsert': 'true',
         'hoodie.datasource.hive_sync.enable': 'true',
         'hoodie.datasource.hive_sync.table': 'f_claim_phcy_hudi_cow',
         'hoodie.datasource.hive_sync.partition_fields': 'src_sys_nm,yr_mth',
         'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
         'hoodie.datasource.write.hive_style_partitioning': 'true',
         'hoodie.datasource.hive_sync.database': 
'us_commercial_datalake_app_commons_dev',
         'hoodie.datasource.hive_sync.support_timestamp': 'true',
         'hoodie.datasource.hive_sync.auto_create_db':'false',
         'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
         'hoodie.datasource.write.row.writer.enable': 'true',
         'hoodie.parquet.small.file.limit': '500000000', 
         'hoodie.parquet.max.file.size': '900000000',
         'hoodie.upsert.shuffle.parallelism': '2000',
         'hoodie.insert.shuffle.parallelism': '2000',
         'hoodie.bulkinsert.shuffle.parallelism': '2000',
         'hoodie.delete.shuffle.parallelism': '2000',
         'hoodie.bulkinsert.shuffle.parallelism': '2000',
         'hoodie.bulkinsert.sort.mode': 'GLOBAL_SORT',
         'hoodie.bulkinsert.sort.mode': 'GLOBAL_SORT',
         'hoodie.copyonwrite.record.size.estimate': '100',
         'hoodie.clean.automatic': 'false',
         'hoodie.cleaner.commits.retained': 6,
         'hoodie.index.type': 'SIMPLE',
         'hoodie.simple.index.update.partition.path':'true',
         'hoodie.simple.index.use.caching': 'true',
         'hoodie.metadata.enable': 'true'
           }
   
   @vinothchandar @n3nash @bvaradar @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to