nsivabalan commented on issue #7897: URL: https://github.com/apache/hudi/issues/7897#issuecomment-1454333905
hey @menna224 : let me clarify something and then will ask some clarification. Commit1: Key1, val1 : file1_v1.parquet. Commit2: key2, val2: file1_v2.parquet both file1_v1 and file1_v2 belongs to same file group. When you do read query, hudi will only read file1_v2.parquet. this is due to small file handling. Cleaner when its get executed later, will clean up file1_v1.parquet. but once file1_v2.parquet is created, none of your snapshot queries will read from file1_v1. Commit3: key3, val3.: again due to small file handling, file1_v3.parquet. Commit4: key3, val4 (same key as before, but an update) Hudi will add a log file to file1 (file group). So, on disk its file1_v3.parquet and log_file1.parquet. with rt, hudi will read both of them, merge and server. incase of ro, hudi will read just file1_v3.parquet. Lets say, we keep adding more updates for key3. more log files will be added. once compaction kicks in, a new parquet file will be created file1_v4.parquet (which is a merged version of file1_v3 + all associated log files). Can you clarify whats the issue you are seeing. your example wasn't very clear for me. esply on these statements. ``` then after the 10th update where i changed the name to "joe", I can see 10 log files, and only 1 parquet file, the parquet file that is kept is the last one (file3.parquet) with the old values not the updates ones: (id=3,name=mg) (id=4,name=sa) (id=5,name=john) and file1.parquet &file2.parquet were delted. rt table contained the right values (the three records and the last record has a value joe for the coloum name) ro contained the values that's in the parquet ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
