nsivabalan commented on issue #1955: URL: https://github.com/apache/hudi/issues/1955#issuecomment-679352339
@tooptoop4 : can you clarify what you mean by this. ``` ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has. ``` But in general, yes, if you use global index with the update partition path set, you should not see any duplicates in your entire hoodie dataset. I can try to illustrate with an eg. Lets say each row consists only 4 vals, v_no(version no), cmp (group_company), time_cr, sys_user. Incase of regular index, combination of record keys and partition path forms unique keys. If you are using regular index and ingest v_1, c_1, t_1, u_1 v_2, c_1, t_1, u_1 v_1, c_1, t_1, u_2 v_1, c_1, t_1, u_3 This will result in 2 rows going to partition u_1, 1 row to partition u_2, and one row to u_3. In 2nd batch of updates, lets say you ingest few more rows. v_1, c_1, t_2, u_1 v_3, c_1, t_2, u_1 v_1, c_2, t_2, u_2 v_1, c_3, t_2, u_3 Here is the result u_1: v_1, c_1, t_2, u_1 (updated with latest value) v_2, c_1, t_1, u_1 v_3, c_1, t_2, u_1 (insert from 2nd batch) u_2: v_1, c_2, t_2, u_2 (updated with latest value) u_3: v_1, c_1, t_1, u_3 v_1, c_3, t_2, u_3(insert from 2nd batch) Incase of global index, only record keys are unique. Lets see an example with global bloom, but with the update partition path config not set. If 1st batch of ingest contains v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 result will be. v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 And 2nd batch of ingest contains v_1, c_1, t_2, u_1 (updating with latest time) v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will update U_1 only, since the config is not set. and hence new partition path i.e. u_2 will be ignored. v_2, c_2, t_2, u_2 (new insert) v_1, c_3, t_2, u_3 (new insert) So, the result will be v_1, c_1, t_2, u_1 (updated with latest time) v_1, c_2, t_2, u_1 (updated with latest time even though incoming record was sent to u_2) v_2, c_1, t_1, u_2 v_2, c_2, t_2, u_2 (new insert) v_3, c_1, t_1, u_3 v_1, c_3, t_2, u_3 (new insert) We can go the same with the config value set. result from first batch: v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 And 2nd batch of ingest contains v_1, c_1, t_2, u_1 (updating with latest time) v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will insert a new record to u_2 and will delete corres record from u_1, since the config is set. v_2, c_2, t_2, u_2 (new insert) v_1, c_3, t_2, u_3 (new insert) So, the result will be v_1, c_1, t_2, u_1 (updated with latest time) v_1, c_2, t_2, u_2 (updated with latest time and old record is deleted) v_2, c_1, t_1, u_2 v_2, c_2, t_2, u_2 (new insert) v_3, c_1, t_1, u_3 v_1, c_3, t_2, u_3 (new insert) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
