liujinhui1994 commented on issue #3418: URL: https://github.com/apache/hudi/issues/3418#issuecomment-906027579
> 感谢@nsivabalan和@liujinhui1994,HiveSyncTool 确实有帮助!因此,根据您的建议,`bulk_insert`对迁移表有很大帮助。现在,在我们完成迁移后,我正在尝试使用`upsert`操作来增量加载数据。我观察到`upsert`一天的数据需要花费 3 个小时,而`bulk_insert`在相同资源的情况下,整个历史数据只需要 2 个小时。您对提高性能有什么建议`upsert`吗? > 我正在使用以下配置`upsert`: > > ``` > val hudiOptions = Map[String,String]( > HoodieWriteConfig.TABLE_NAME -> "hudi_table", > HoodieIndexConfig.INDEX_TYPE_PROP -> "GLOBAL_BLOOM", > DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true", > DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL, > DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "data_load_date", > DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> "org.apache.hudi.keygen.SimpleKeyGenerator", > DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", > DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "sid", > DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "date_updated") > ``` If there are a lot of updates, you can try MOR table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
