tjtoll commented on issue #4682:
URL: https://github.com/apache/hudi/issues/4682#issuecomment-1078976112
@rkkalluri thanks I should be able to get .11 tested out today. We are using
the default bloom index (below is our entire config). The table is right
around 1 billion records and the upsert batch we've been benchmarking with is
800K records. Our average batch is 85% updates and 15% inserts. We are
partitioned by source system (we have 75~ identical source systems) and a range
partition (500K records per range) on the primary key of the table which is an
autoincrementing integer. The record key is set to the same key that decides
the range partition.
hudi_options = {
'hoodie.table.name': hoodie_table_name,
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.datasource.write.recordkey.field':
topics[topic_name]['record_key'],
'hoodie.datasource.write.partitionpath.field':
topics[topic_name]['partition_path'],
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.write.precombine.field': 'ts_ms',
'hoodie.datasource.hive_sync.database': args['TARGET_DATABASE'],
'hoodie.parquet.outputtimestamptype' : 'TIMESTAMP_MICROS',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': hoodie_table_name,
'hoodie.datasource.hive_sync.auto_create_database': 'true',
'hoodie.datasource.hive_sync.partition_fields': (re.sub(r':.+?,',
',', topics[topic_name]['partition_path'])).split(":")[0],
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'SCALAR',
'hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit':
'microseconds',
'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyyMM',
'hoodie.deltastreamer.keygen.timebased.timezone': 'UTC',
'hoodie.datasource.hive_sync.skip_ro_suffix' : 'true',
"hoodie.upsert.shuffle.parallelism" : 1500,
"hoodie.insert.shuffle.parallelism" : 1500,
"hoodie.finalize.write.parallelism" : 1500,
"hoodie.bulkinsert.shuffle.parallelism" : 1500,
'hoodie.parquet.compression.codec' : 'snappy'
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]