vinothchandar commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1151213942

   > because computing the diff on the fly is not a heavy work like you 
previously mentioned.
   
   I think this is the crux of the disagreement before and key concern from 
@YannByron . Let's take a COW example where the records changed in the latest 
file slice spread evenly across row groups/pages (red lines on the side) in the 
previous file slice. To compute the diff, we need to potentially read the 
entire previous file slice. 
   
   <img width="1163" alt="image" 
src="https://user-images.githubusercontent.com/1179324/172874055-99e4f713-21b6-4278-bed6-1b3741590f25.png";>
   
   If we have a CDC log attached to the slice at t1, then we only read the 
changed rows. So there is a difference IMO in terms of total query I/O i.e 
bytes read per CDC query. This has been our single biggest source of confusion. 
We say - we make it flexible around on-the-fly vs materializing and then go 
back right into claiming either materializing is not needed :) and/or 
on-the-fly is cheap. Should we actually microbenchmark both approaches and be 
done? 
   
   >the MOR table has different layout for different index,
   
   yes that my meta point that our design should not be based on handles, but a 
more generic CDC log format. In the case, inserts are sent to log (even for 
Spark/HBase index, they are IIRC), then we would encode `I` in the CDC format 
correctly.
   This is what I feel " we need to design this even more generically at the 
file group/slice level"  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to