usberkeley commented on PR #12404:
URL: https://github.com/apache/hudi/pull/12404#issuecomment-2513670371

   ### Summary
   #### 1. When enable `populateMetaFields`, why the number of record key 
fields must be equal to one?
   the number of record key fields must be equal to one, also known as 
KeyGeneratorType.SIMPLE. The Log Scanner needs to regenerate the Record Key 
from HoodieRecord. 
   
   #### 2. Why not support KeyGeneratorType.COMPLEX in this PR?
   The implementation classes for KeyGenerator are distributed across the 
hudi-client-common and hudi-spark-client packages, while The Log Scanner is in 
the hudi-common package. This makes it impossible to directly introduce the 
KeyGenerator implementation classes into the Log Scanner.
   
   Another approach would be to pass the KeyGenerator implementation class 
object when constructing the Log Scanner. However, this would require modifying 
all modules that use the Log Scanner, which are over 20 instances. Some of 
these instances (such as Hudi Reader, which does not depend on the 
hudi-client-common package) cannot be implemented, and it would also increase 
the complexity of using the Log Scanner.
   
   #### 3. When and how to support KeyGeneratorType.COMPLEX in the future?
   The next PR will support it by consolidating the KeyGenerator implementation 
classes from the hudi-client-common package (Flink) and the hudi-spark-client 
package (Spark) into the hudi-common package. I think that the KeyGenerator 
implementation classes are fundamental and should belong to the hudi-common 
package. This will also allow the Log Scanner to support 
KeyGeneratorType.COMPLEX when meta fields are disabled.
   
   #### 4. Why are `allowOperationMetadataField` allowed only when 
`populateMetaFields` is enabled?
   1) Disabling `populateMetaFields` can reduce the performance overhead of 
decoding HoodieRecords. However, if `allowOperationMetadataField` is enabled, 
decoding performance is still affected even if `populateMetaFields` is 
disabled. Therefore, the impact of these two settings on performance is 
interconnected.
   2) Both are metadata fields. `populateMetaFields` is the main switch, while 
`allowOperationMetadataField` just controls the activation of specific metadata 
fields. When the main switch is off, the sub-switches should have no effect.
   
   #### 5. In #11028, we already fixed the unnecessary rewrite when the schemas 
are exactly the same, is your benchmark based on the fix then?
   Yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to