prashantwason commented on pull request #4067: URL: https://github.com/apache/hudi/pull/4067#issuecomment-996247192
@vinothchandar and other reviewers: I have a basic question on this implementation - under what scenario would a user ever want to keep the duplicated keys in HFile? Note that: 1. Even though HFile is a supported base file format, it can really not be used as a base format in HUDI as there is no concrete query side support on query engines 2. The way the records are saved in HFile (HFile key=hoodie record key, HFile value=avro encoded record) has been invented for HUDI specific usecase and hence cannot be used directly without writing a HUDI specific record reader. 3. If this setting is ever enabled, there is no going back (no downgrade, no disabling the setting) without blowing up the metadata table (or any dataset using HFile format). 4. This PR introduces a lot of plumbing code just to get a setting over to the HFileWriter. The main functionality of not writing the key is a small part of change. 5. I personally feel having too many configs and options without thinking of the use-cases is not a good idea from maintenance, testing and upgrade perspective. Why even give the user this choice to duplicate or de-duplicate when we are actually inventing how records are saved within the HFile? 6. Such settings within HUDI (like base file format, log file format, key column, etc) should not be controlled through HoodieWriteConfig because once enabled it is not safe to ever disable them at will. Hence, we should think of a simpler way of enabling/disabling such settings (maybe through the hoodie.properties) 7. There are other such features which have similar downgrade issues - e.g. virtual key support - which when enabled once cannot be downgraded in-place without blowing up the data and re-bootstrapping. So based on the above notes, I feel we should make de-duplication the only way and simplify the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
