tjtoll commented on issue #4682:
URL: https://github.com/apache/hudi/issues/4682#issuecomment-1078976112


   @rkkalluri thanks I should be able to get .11 tested out today. We are using 
the default bloom index (below is our entire config).  The table is right 
around 1 billion records and the upsert batch we've been benchmarking with is 
800K records. Our average batch is 85% updates and 15% inserts. We are 
partitioned by source system (we have 75~ identical source systems) and a range 
partition (500K records per range) on the primary key of the table which is an 
autoincrementing integer. The record key is set to the same key that decides 
the range partition.
   
   
       hudi_options = {
           'hoodie.table.name': hoodie_table_name,
           'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
           'hoodie.datasource.hive_sync.use_jdbc': 'false',
           'hoodie.datasource.hive_sync.mode': 'hms',
           'hoodie.datasource.write.recordkey.field': 
topics[topic_name]['record_key'],
           'hoodie.datasource.write.partitionpath.field': 
topics[topic_name]['partition_path'],
           'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
           'hoodie.datasource.write.precombine.field': 'ts_ms',
           'hoodie.datasource.hive_sync.database': args['TARGET_DATABASE'],
           'hoodie.parquet.outputtimestamptype' : 'TIMESTAMP_MICROS',
           'hoodie.datasource.hive_sync.enable': 'true',
           'hoodie.datasource.hive_sync.table': hoodie_table_name,
           'hoodie.datasource.hive_sync.auto_create_database': 'true',
           'hoodie.datasource.hive_sync.partition_fields': (re.sub(r':.+?,', 
',', topics[topic_name]['partition_path'])).split(":")[0],
           'hoodie.datasource.write.hive_style_partitioning': 'true',
           'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
           'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'SCALAR',
           'hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit': 
'microseconds',
           'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyyMM',
           'hoodie.deltastreamer.keygen.timebased.timezone': 'UTC',
           'hoodie.datasource.hive_sync.skip_ro_suffix' : 'true',
           "hoodie.upsert.shuffle.parallelism" : 1500,
           "hoodie.insert.shuffle.parallelism" : 1500,
           "hoodie.finalize.write.parallelism" : 1500,
           "hoodie.bulkinsert.shuffle.parallelism" : 1500,
           'hoodie.parquet.compression.codec' : 'snappy'
       }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to