manojpec commented on pull request #4067:
URL: https://github.com/apache/hudi/pull/4067#issuecomment-979015349


   > Concept looks good. But why introduce a new block type and not do it for 
the HoodieHFileDataBlock itself?
   > 
   > When the HFile format is used, whether for Metadata Table or elsewhere in 
HUDI, there will always be a key for the HFile and they will be derived from 
some field of the record. Hence, this HFile key will always be redundant. 
Therefore this optimization needs to performed for HoodieHFileDataBlock itself.
   > 
   > HoodieHFileDataBlock already accepts a "keyField". We can simply this 
change by:
   > 
   > 1. If keyField is not None:
   >    
   >    * set keyField to "null" and do not save it
   >    * materialize the keyField from HFile key
   > 2. If keyField is None - no need to do the above
   
   I was initially proposing to go with a new on-disk block type like 
HFILE_METADATA_BLOCK to differentiate from other HFILE_DATA_BLOCKS. But, that 
makes on-disk block format change and hence not backward compatible and 
downgrades would not work. So, later on further discussion made this choice of 
layering this such a way that metadata record specific key deduplication logic 
doesn't get spilled over into the lower most HFile block layer. Also, this code 
structuring gives us the benefit of restricting the functionality only to 
specific users of HFile, here Metadata table and not to all. 
   
   Previously, the HFile block layer had the hard coded keyField = "key" 
assuming the record payload would always have this key. But it is true only for 
the metadata payload. Also, it didn't sit well with the config "populate meta 
fields" where the key could be different based on the table user. So, to 
support virtual keys for the metadata table we had to pull out the abstraction 
to higher layers. Similarly, if we make the HFile block layer assume the 
de-duplication needs, it might restrict the future usages of HFile type by 
other users. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to