nsivabalan commented on issue #3400: URL: https://github.com/apache/hudi/issues/3400#issuecomment-903878681
@mkk1490 : sorry for late turn around. been busy with work. Few clarifying questions. 1. What do you mean by incremental loads? ``` Original data size = 414 GB and Hudi data size = 424 GB After 4 incremental loads, original came around 2100 GB with Snapshot data in each partition and Hudi data size was 547 GB I tested for another scenario with >1.5 TB with the same process as above ``` In this, does incremental load mean, you took your original dataset and made some updates to some of the records and did an upsert to hudi? Or without doing any updates, you ingested into hudi? Also, is 2100Gb = (414 Gb * 5) i.e one version of original and 4 incremental copies? 2. also, by Hudi data size, you mean on disk size? (which includes all versions of data) or just the snapshot data size? 3. ``` Original = 1.6 TB, Hudi 1.3 TB After 1 incremental, orig = 2.9 TB and Hudi 2.5 TB ``` This confuses me as to whats incremental. In previous example, I thought you took entire original data and made another copy w/ some updates. but here i see only 1.3Tb added to 1.6Tb. Or in general, incremental means, updates to some records from original snapshot. Once I get a clarification on these, will respond to your questions. @codope : will look into the delete issue w/ bulk_insert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
