nsivabalan commented on issue #3605: URL: https://github.com/apache/hudi/issues/3605#issuecomment-917429823
Hey hi @Ambarish-Giri : For initial bulk loading of data into hudi, you can try "bulk_insert" operation. it is expected to be faster compared to regular operations. Ensure you set the right value for [avg record size config](https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate) . for subsequent operations, hudi will infer the record size from older commits. But for first commit (bulk import/bulk_insert), hudi relies on this config to pack records to right sized files. Couple of questions before we dive into perf in detail: 1. may I know whats your upsert characteristics? Is it spread across all partitions, or just very few recent partitions. 2. Does your record key have any timestamp affinity or characteristics. If record keys are completely random, we can try SIMPLE index, since bloom may not be very effective for completely random keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
