FeiZou commented on issue #3418:
URL: https://github.com/apache/hudi/issues/3418#issuecomment-905712134


   Thanks @nsivabalan and @liujinhui1994, HiveSyncTool did helps! So based on 
your advice, `bulk_insert` help a lot on migration the table. Now after we done 
with migration, I'm trying to use `upsert` operation to load data 
incrementally.  I'm observing that `upsert` is a single day's data is costing 3 
hours while `bulk_insert` the whole historical data only takes 2 hours given 
the same resource. Do you have any advice on improve the performance of 
`upsert`?
   I'm using following config for `upsert`:
   ```
   val hudiOptions = Map[String,String](
               HoodieWriteConfig.TABLE_NAME -> "hudi_table",
               HoodieIndexConfig.INDEX_TYPE_PROP -> "GLOBAL_BLOOM",
               DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY  -> 
"true",
               DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
               DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"data_load_date",
               DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> 
"org.apache.hudi.keygen.SimpleKeyGenerator",
               DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
               DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "sid",
               DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> 
"date_updated")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to