pengxianzi commented on issue #12589:
URL: https://github.com/apache/hudi/issues/12589#issuecomment-2574618976

   > For migration, maybe you can use the bulk_insert to write the history 
dataset from Kudu in batch execution mode, you can then ingest into this Hudi 
table switching to the upsert operation.
   > 
   > The write is slow for cow because for each checkpointing, the cow would 
trigger a whole table rewrite, this is also true for mor compaction.
   > 
   > So maybe you can migrate the existing data set from Kudu using the 
bulk_insert operation, and do streaming upsert with the incremental inputs. If 
the data set itself is huge, partition table using datetime should also be 
helpful because that would reduce the scope for rewrite significally.
   
   Thank you for your suggestions! Our current approach for large tables aligns 
with your recommendations:
   
   We use the bulk_insert operation to migrate historical data from Kudu to the 
Hudi table.
   
   We switch to the upsert operation for incremental data writes.
   
   However, we encountered the following issues during implementation:
   
   Necessity of Bucketing: Without bucketing, data duplication occurs during 
Flink writes. Only after adding bucketing is the data duplication issue 
resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to