FeiZou commented on issue #3418:
URL: https://github.com/apache/hudi/issues/3418#issuecomment-905712134
Thanks @nsivabalan and @liujinhui1994, HiveSyncTool did helps! So based on
your advice, `bulk_insert` help a lot on migration the table. Now after we done
with migration, I'm trying to use `upsert` operation to load data
incrementally. I'm observing that `upsert` is a single day's data is costing 3
hours while `bulk_insert` the whole historical data only takes 2 hours given
the same resource. Do you have any advice on improve the performance of
`upsert`?
I'm using following config for `upsert`:
```
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> "hudi_table",
HoodieIndexConfig.INDEX_TYPE_PROP -> "GLOBAL_BLOOM",
DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY ->
"true",
DataSourceWriteOptions.OPERATION_OPT_KEY ->
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->
"data_load_date",
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY ->
"org.apache.hudi.keygen.SimpleKeyGenerator",
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "sid",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY ->
"date_updated")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]