vinothchandar commented on PR #5436: URL: https://github.com/apache/hudi/pull/5436#issuecomment-1151213942
> because computing the diff on the fly is not a heavy work like you previously mentioned. I think this is the crux of the disagreement before and key concern from @YannByron . Let's take a COW example where the records changed in the latest file slice spread evenly across row groups/pages (red lines on the side) in the previous file slice. To compute the diff, we need to potentially read the entire previous file slice. <img width="1163" alt="image" src="https://user-images.githubusercontent.com/1179324/172874055-99e4f713-21b6-4278-bed6-1b3741590f25.png"> If we have a CDC log attached to the slice at t1, then we only read the changed rows. So there is a difference IMO in terms of total query I/O i.e bytes read per CDC query. This has been our single biggest source of confusion. We say - we make it flexible around on-the-fly vs materializing and then go back right into claiming either materializing is not needed :) and/or on-the-fly is cheap. Should we actually microbenchmark both approaches and be done? >the MOR table has different layout for different index, yes that my meta point that our design should not be based on handles, but a more generic CDC log format. In the case, inserts are sent to log (even for Spark/HBase index, they are IIRC), then we would encode `I` in the CDC format correctly. This is what I feel " we need to design this even more generically at the file group/slice level" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
