[GitHub] [flink] luoyuxia commented on a diff in pull request #21601: [FLINK-29720][hive] Supports native avg function for hive dialect

GitBox Thu, 19 Jan 2023 07:10:49 -0800


luoyuxia commented on code in PR #21601:
URL: https://github.com/apache/flink/pull/21601#discussion_r1081360275



##########
docs/content.zh/docs/connectors/table/hive/hive_functions.md:
##########
@@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread 
safety issues](https
 We recommend users patch their own Hive to fix them.
 {{< /hint >}}
 
+## Use Native Hive Aggregate Functions via HiveModule
+
+For Hive's built-in aggregation function, Flink currently uses sort-based 
aggregation strategy. Compared to hash-based aggregation strategy, the 
performance is one to two times worse, so from Flink 1.17, we have implemented 
some of Hive's aggregation functions natively in Flink.
+These functions will use the hash-agg strategy to improve performance. 
Currently, only five functions are supported, namely sum/count/avg/min/max, and 
more aggregation functions will be supported in the future.
+Users can use the native aggregation function by turning on the option 
`table.exec.hive.native-agg-function.enabled`, which brings significant 
performance improvement to the job.

Review Comment:
   I think we also should remind user that when 
`table.exec.hive.native-agg-function.enabled` = `true`, there'll be some 
incompatibility issue, e.g, some dataypes may not be supported in native 
implementation.



##########
docs/content.zh/docs/connectors/table/hive/hive_functions.md:
##########
@@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread 
safety issues](https
 We recommend users patch their own Hive to fix them.
 {{< /hint >}}
 
+## Use Native Hive Aggregate Functions via HiveModule

Review Comment:
   May be we can add a link for `HiveModule` like 
   [HiveModule]({{< ref "docs/dev/table/modules" >}}#hivemodule)
   



##########
docs/content.zh/docs/connectors/table/hive/hive_functions.md:
##########
@@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread 
safety issues](https
 We recommend users patch their own Hive to fix them.
 {{< /hint >}}
 
+## Use Native Hive Aggregate Functions via HiveModule
+
+For Hive's built-in aggregation function, Flink currently uses sort-based 
aggregation strategy. Compared to hash-based aggregation strategy, the 
performance is one to two times worse, so from Flink 1.17, we have implemented 
some of Hive's aggregation functions natively in Flink.
+These functions will use the hash-agg strategy to improve performance. 
Currently, only five functions are supported, namely sum/count/avg/min/max, and 
more aggregation functions will be supported in the future.
+Users can use the native aggregation function by turning on the option 
`table.exec.hive.native-agg-function.enabled`, which brings significant 
performance improvement to the job.
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+        <th class="text-left" style="width: 20%">Key</th>
+        <th class="text-left" style="width: 15%">Default</th>
+        <th class="text-left" style="width: 10%">Type</th>
+        <th class="text-left" style="width: 55%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td><h5>table.exec.hive.native-agg-function.enabled</h5></td>
+        <td style="word-wrap: break-word;">false</td>
+        <td>Boolean</td>
+        <td>Enabling native aggregate function for hive dialect to use 
hash-agg strategy that can improve the aggregation performance. This is a 
job-level option, user can enable it per-job.</td>

Review Comment:
   Maynot precise. Not only for hive dialect, but also for  HiveModule is 
loaded.
   ```suggestion
           <td>Enabling to use native aggregate function to use hash-agg 
strategy which can improve the aggregation performance after loading 
HiveModule. This is a job-level option, user can enable it per-job.</td>
   ```



##########
docs/content.zh/docs/connectors/table/hive/hive_functions.md:
##########
@@ -73,6 +73,31 @@ Some Hive built-in functions in older versions have [thread 
safety issues](https
 We recommend users patch their own Hive to fix them.
 {{< /hint >}}
 
+## Use Native Hive Aggregate Functions via HiveModule

Review Comment:
   If HiveModule is loaded with a higher priority than CoreMoudle, Flink will 
try to use the Hive built-in function first. And then for Hive built-in 
aggregation function, Flink will use sort-based aggregation strategy.
   
   So, I think the title `Use Native Hive Aggregate Functions via HiveModule` 
is not correct.
   May be can be `Use Native Hive Aggregate Functions`. And add some 
explaination about it.
   May be we can put the sentence 
   `
   If HiveModule is loaded with a higher priority than CoreMoudle, Flink will 
try to use the Hive built-in function first. And then for Hive built-in 
aggregation functio
   ` in here.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] luoyuxia commented on a diff in pull request #21601: [FLINK-29720][hive] Supports native avg function for hive dialect

Reply via email to