prashantwason commented on pull request #4067:
URL: https://github.com/apache/hudi/pull/4067#issuecomment-996247192


   @vinothchandar  and other reviewers: I have a basic question on this 
implementation - under what scenario would a user ever want to keep the 
duplicated keys in HFile? 
   
   Note that:
   1. Even though HFile is a supported base file format, it can really not be 
used as a base format in HUDI as there is no concrete query side support on 
query engines
   2. The way the records are saved in HFile (HFile key=hoodie record key, 
HFile value=avro encoded record) has been invented for HUDI specific usecase 
and hence cannot be used directly without writing a HUDI specific record reader.
   3. If this setting is ever enabled, there is no going back (no downgrade, no 
disabling the setting) without blowing up the metadata table (or any dataset 
using HFile format).
   4. This PR introduces a lot of plumbing code just to get a setting over to 
the HFileWriter. The main functionality of not writing the key is a small part 
of change.
   5. I personally feel having too many configs and options without thinking of 
the use-cases is not a good idea from maintenance, testing and upgrade 
perspective. Why even give the user this choice to duplicate or de-duplicate 
when we are actually inventing how records are saved within the HFile? 
   6. Such settings within HUDI (like base file format, log file format, key 
column, etc) should not be controlled through HoodieWriteConfig because once 
enabled it is not safe to ever disable them at will. Hence, we should think of 
a simpler way of enabling/disabling such settings (maybe through the 
hoodie.properties)
   7. There are other such features which have similar downgrade issues - e.g. 
virtual key support - which when enabled once cannot be downgraded in-place 
without blowing up the data and re-bootstrapping.
   
   So based on the above notes, I feel we should make de-duplication the only 
way and simplify the code. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to