usberkeley commented on PR #12404: URL: https://github.com/apache/hudi/pull/12404#issuecomment-2513670371
### Summary #### 1. When enable `populateMetaFields`, why the number of record key fields must be equal to one? the number of record key fields must be equal to one, also known as KeyGeneratorType.SIMPLE. The Log Scanner needs to regenerate the Record Key from HoodieRecord. #### 2. Why not support KeyGeneratorType.COMPLEX in this PR? The implementation classes for KeyGenerator are distributed across the hudi-client-common and hudi-spark-client packages, while The Log Scanner is in the hudi-common package. This makes it impossible to directly introduce the KeyGenerator implementation classes into the Log Scanner. Another approach would be to pass the KeyGenerator implementation class object when constructing the Log Scanner. However, this would require modifying all modules that use the Log Scanner, which are over 20 instances. Some of these instances (such as Hudi Reader, which does not depend on the hudi-client-common package) cannot be implemented, and it would also increase the complexity of using the Log Scanner. #### 3. When and how to support KeyGeneratorType.COMPLEX in the future? The next PR will support it by consolidating the KeyGenerator implementation classes from the hudi-client-common package (Flink) and the hudi-spark-client package (Spark) into the hudi-common package. I think that the KeyGenerator implementation classes are fundamental and should belong to the hudi-common package. This will also allow the Log Scanner to support KeyGeneratorType.COMPLEX when meta fields are disabled. #### 4. Why are `allowOperationMetadataField` allowed only when `populateMetaFields` is enabled? 1) Disabling `populateMetaFields` can reduce the performance overhead of decoding HoodieRecords. However, if `allowOperationMetadataField` is enabled, decoding performance is still affected even if `populateMetaFields` is disabled. Therefore, the impact of these two settings on performance is interconnected. 2) Both are metadata fields. `populateMetaFields` is the main switch, while `allowOperationMetadataField` just controls the activation of specific metadata fields. When the main switch is off, the sub-switches should have no effect. #### 5. In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then? Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
