nsivabalan commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1454333905

   hey @menna224 :
   let me clarify something and then will ask some clarification. 
   
   Commit1: 
   Key1, val1 : file1_v1.parquet. 
   
   Commit2: 
   key2, val2: file1_v2.parquet 
   
   both file1_v1 and file1_v2 belongs to same file group. When you do read 
query, hudi will only read file1_v2.parquet. this is due to small file 
handling. Cleaner when its get executed later, will clean up file1_v1.parquet. 
but once file1_v2.parquet is created, none of your snapshot queries will read 
from file1_v1.
   
   Commit3:
   key3, val3.: again due to small file handling, file1_v3.parquet. 
   
   Commit4: 
   key3, val4 (same key as before, but an update)
   Hudi will add a log file to file1 (file group). 
   
   So, on disk 
   its file1_v3.parquet and log_file1.parquet. 
   
   with rt, hudi will read both of them, merge and server. 
   incase of ro, hudi will read just file1_v3.parquet. 
   
   Lets say, we keep adding more updates for key3. more log files will be 
added. 
   once compaction kicks in, a new parquet file will be created 
   file1_v4.parquet (which is a merged version of file1_v3 + all associated log 
files).
   
   Can you clarify whats the issue you are seeing. your example wasn't very 
clear for me. 
   esply on these statements. 
   ```
   then after the 10th update where i changed the name to "joe", I can see 10 
log files, and only 1 parquet file, the parquet file that is kept is the last 
one (file3.parquet) with the old values not the updates ones:
   (id=3,name=mg)
   (id=4,name=sa)
   (id=5,name=john)
   
   and file1.parquet &file2.parquet were delted.
   rt table contained the right values (the three records and the last record 
has a value joe for the coloum name)
   ro contained the values that's in the parquet
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to