nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917429823


   Hey hi @Ambarish-Giri : 
   For initial bulk loading of data into hudi, you can try "bulk_insert" 
operation. it is expected to be faster compared to regular operations. Ensure 
you set the right value for [avg record size 
config](https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate)
 . for subsequent operations, hudi will infer the record size from older 
commits. But for first commit (bulk import/bulk_insert), hudi relies on this 
config to pack records to right sized files. 
   
   Couple of questions before we dive into perf in detail: 
   1. may I know whats your upsert characteristics? Is it spread across all 
partitions, or just very few recent partitions. 
   2. Does your record key have any timestamp affinity or characteristics. If 
record keys are completely random, we can try SIMPLE index, since bloom may not 
be very effective for completely random keys. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to