luoyuxia commented on code in PR #21601: URL: https://github.com/apache/flink/pull/21601#discussion_r1081360275
########## docs/content.zh/docs/connectors/table/hive/hive_functions.md: ########## @@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread safety issues](https We recommend users patch their own Hive to fix them. {{< /hint >}} +## Use Native Hive Aggregate Functions via HiveModule + +For Hive's built-in aggregation function, Flink currently uses sort-based aggregation strategy. Compared to hash-based aggregation strategy, the performance is one to two times worse, so from Flink 1.17, we have implemented some of Hive's aggregation functions natively in Flink. +These functions will use the hash-agg strategy to improve performance. Currently, only five functions are supported, namely sum/count/avg/min/max, and more aggregation functions will be supported in the future. +Users can use the native aggregation function by turning on the option `table.exec.hive.native-agg-function.enabled`, which brings significant performance improvement to the job. Review Comment: I think we also should remind user that when `table.exec.hive.native-agg-function.enabled` = `true`, there'll be some incompatibility issue, e.g, some dataypes may not be supported in native implementation. ########## docs/content.zh/docs/connectors/table/hive/hive_functions.md: ########## @@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread safety issues](https We recommend users patch their own Hive to fix them. {{< /hint >}} +## Use Native Hive Aggregate Functions via HiveModule Review Comment: May be we can add a link for `HiveModule` like [HiveModule]({{< ref "docs/dev/table/modules" >}}#hivemodule) ########## docs/content.zh/docs/connectors/table/hive/hive_functions.md: ########## @@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread safety issues](https We recommend users patch their own Hive to fix them. {{< /hint >}} +## Use Native Hive Aggregate Functions via HiveModule + +For Hive's built-in aggregation function, Flink currently uses sort-based aggregation strategy. Compared to hash-based aggregation strategy, the performance is one to two times worse, so from Flink 1.17, we have implemented some of Hive's aggregation functions natively in Flink. +These functions will use the hash-agg strategy to improve performance. Currently, only five functions are supported, namely sum/count/avg/min/max, and more aggregation functions will be supported in the future. +Users can use the native aggregation function by turning on the option `table.exec.hive.native-agg-function.enabled`, which brings significant performance improvement to the job. + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left" style="width: 20%">Key</th> + <th class="text-left" style="width: 15%">Default</th> + <th class="text-left" style="width: 10%">Type</th> + <th class="text-left" style="width: 55%">Description</th> + </tr> + </thead> + <tbody> + <tr> + <td><h5>table.exec.hive.native-agg-function.enabled</h5></td> + <td style="word-wrap: break-word;">false</td> + <td>Boolean</td> + <td>Enabling native aggregate function for hive dialect to use hash-agg strategy that can improve the aggregation performance. This is a job-level option, user can enable it per-job.</td> Review Comment: Maynot precise. Not only for hive dialect, but also for HiveModule is loaded. ```suggestion <td>Enabling to use native aggregate function to use hash-agg strategy which can improve the aggregation performance after loading HiveModule. This is a job-level option, user can enable it per-job.</td> ``` ########## docs/content.zh/docs/connectors/table/hive/hive_functions.md: ########## @@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread safety issues](https We recommend users patch their own Hive to fix them. {{< /hint >}} +## Use Native Hive Aggregate Functions via HiveModule Review Comment: If HiveModule is loaded with a higher priority than CoreMoudle, Flink will try to use the Hive built-in function first. And then for Hive built-in aggregation function, Flink will use sort-based aggregation strategy. So, I think the title `Use Native Hive Aggregate Functions via HiveModule` is not correct. May be can be `Use Native Hive Aggregate Functions`. And add some explaination about it. May be we can put the sentence ` If HiveModule is loaded with a higher priority than CoreMoudle, Flink will try to use the Hive built-in function first. And then for Hive built-in aggregation functio ` in here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org