[GitHub] [hudi] tjtoll commented on issue #4682: [SUPPORT] Upgrade from 0.8.0 to 0.10.0 decreases Upsert performance

GitBox Tue, 29 Mar 2022 07:14:35 -0700


tjtoll commented on issue #4682:
URL: https://github.com/apache/hudi/issues/4682#issuecomment-1081925888



   Sure @rkkalluri - and thank you again for your help and what you do the for 
the community. 
   
   I switched from a timestamp partition in yyyyMM format to a range partition 
on the record key. The record key on the table is autoincrementing so it lent 
itself well to range partitioning and still leveraging the bloom index. 
   
   BEFORE:
   
           hudiOptions = {
               'hoodie.table.name': hoodie_table_name,
               'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
               'hoodie.datasource.hive_sync.use_jdbc': 'false',
               'hoodie.datasource.hive_sync.mode': 'hms',
               'hoodie.datasource.write.recordkey.field': 
topics[topic_name]['record_key'],
               'hoodie.datasource.write.partitionpath.field': 
topics[topic_name]['partition_path'],
               'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
               'hoodie.datasource.write.precombine.field': 'ts_ms',
               'hoodie.datasource.hive_sync.database': args['TARGET_DATABASE'],
               'hoodie.parquet.outputtimestamptype' : 'TIMESTAMP_MICROS',
               'hoodie.datasource.hive_sync.enable': 'true',
               'hoodie.datasource.hive_sync.table': hoodie_table_name,
               'hoodie.datasource.hive_sync.auto_create_database': 'true',
               'hoodie.datasource.hive_sync.partition_fields': 
(re.sub(r':.+?,', ',', topics[topic_name]['partition_path'])).split(":")[0],
               'hoodie.datasource.write.hive_style_partitioning': 'true',
               'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
               'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'SCALAR',
               
'hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit': 
'microseconds',
               'hoodie.deltastreamer.keygen.timebased.output.dateformat': 
'yyyyMM',
               'hoodie.deltastreamer.keygen.timebased.timezone': 'UTC',
               'hoodie.copyonwrite.record.size.estimate' : 50,
               'hoodie.datasource.hive_sync.skip_ro_suffix' : 'true'
           }
   
   AFTER:
   
               hudiOptions = {
                   'hoodie.table.name': hoodie_table_name,
                   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
                   'hoodie.datasource.hive_sync.use_jdbc': 'false',
                   'hoodie.datasource.hive_sync.mode': 'hms',
                   'hoodie.datasource.write.recordkey.field': 
topics[topic_name]['record_key'],
                   'hoodie.datasource.write.partitionpath.field': 
topics[topic_name]['partition_path'],
                   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
                   'hoodie.datasource.write.precombine.field': 'ts_ms',
                   'hoodie.datasource.hive_sync.database': 
args['TARGET_DATABASE'],
                   'hoodie.parquet.outputtimestamptype' : 'TIMESTAMP_MICROS',
                   'hoodie.datasource.hive_sync.enable': 'true',
                   'hoodie.datasource.hive_sync.table': hoodie_table_name,
                   'hoodie.datasource.hive_sync.auto_create_database': 'true',
                   'hoodie.datasource.hive_sync.partition_fields': 
(re.sub(r':.+?,', ',', topics[topic_name]['partition_path'])).split(":")[0],
                   'hoodie.datasource.write.hive_style_partitioning': 'true',
                   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
                   'hoodie.deltastreamer.keygen.timebased.timestamp.type': 
'SCALAR',
                   
'hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit': 
'microseconds',
                   'hoodie.deltastreamer.keygen.timebased.output.dateformat': 
'yyyyMM',
                   'hoodie.deltastreamer.keygen.timebased.timezone': 'UTC',
                   'hoodie.datasource.hive_sync.skip_ro_suffix' : 'true',
                   'hoodie.parquet.compression.codec' : 'snappy'
               }
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] tjtoll commented on issue #4682: [SUPPORT] Upgrade from 0.8.0 to 0.10.0 decreases Upsert performance

Reply via email to