xushiyan commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1120389506

   I read through the rfc and Danny's design doc and I also prefer option B 
leveraging `_hoodie_operation` which introduce less complexity and overhead 
onto the storage. I have similar concerns that the management/cost overhead of 
`.cdc/` might not justify the gain on the read efficiency, for e.g., we've put 
a lot of efforts in making metadata table stablized and in sync with data 
table; a reference for what we might have to do to `.cdc/`. Storage-wise, as 
mentioned above, even if the fraction is small, the actual cost can be 
significant, due to the table size being huge per se. Even if the storage size 
is acceptable, for cloud storage users, the added API calls to save new objects 
incurs more billings regardless of size. In update-heavy tables, this becomes 
impactful.
   
   On option B using `_hoodie_operation`, i agree some benchmarking can be 
super helpful. It may worth putting more energy there to optimize the logic if 
needed. UX-wise, it fits nicer to users already running incremental query 
pipelines; a new config to turn on then they'll get the cdc info.
   
   In short, i prefer leveraging on / improving what we already have in hudi. 
Regardless of design approach, this is a great initiative anyway!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to