[GitHub] [flink] wuchong commented on a diff in pull request #20419: [FLINK-28774][hive] Allow user to configure whether to enable sort not when it's for dynamic partition writing for HiveSource

GitBox Mon, 08 Aug 2022 04:59:00 -0700


wuchong commented on code in PR #20419:
URL: https://github.com/apache/flink/pull/20419#discussion_r940136586



##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java:
##########
@@ -77,6 +77,19 @@ public class HiveOptions {
                     .withDescription(
                             "The thread number to split hive's partitions to 
splits. It should be bigger than 0.");
 
+    public static final ConfigOption<Boolean> 
TABLE_EXEC_HIVE_DYNAMIC_GROUPING_ENABLED =
+            key("table.exec.hive.dynamic-grouping.enabled")

Review Comment:
   `table.exec.hive.sink.sort-by-dynamic-partition.enable`



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -472,6 +472,14 @@ This configuration is set in the `TableConfig` and will 
affect all sinks of the
   </tbody>
 </table>
 
+### Configuration for Dynamic Partition Inserting 
+By default, if it's for dynamic partition inserting, Flink will sort the data 
additionally by dynamic partition columns before writing into sink table.

Review Comment:
   Add the following words:
   
   That means the sink will receive all elements of one partition and then all 
elements of another partition. Elements of different partitions will not be 
mixed. This is helpful for Hive sink to reduce the number of partition writers 
and improve writing performance by writing one partition at a time. Otherwise, 
too many partition writers may cause the OutOfMemory exception. 



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -472,6 +472,14 @@ This configuration is set in the `TableConfig` and will 
affect all sinks of the
   </tbody>
 </table>
 
+### Configuration for Dynamic Partition Inserting 

Review Comment:
   ```suggestion
   ### Dynamic Partition Writing 
   ```



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -472,6 +472,14 @@ This configuration is set in the `TableConfig` and will 
affect all sinks of the
   </tbody>
 </table>
 
+### Configuration for Dynamic Partition Inserting 
+By default, if it's for dynamic partition inserting, Flink will sort the data 
additionally by dynamic partition columns before writing into sink table.
+
+To avoid the extra sorting, you can set job configuration 
`table.exec.hive.dynamic-grouping.enabled` (`true` by default) to `false`.
+But with such configuration, it'll throw OOM exception if there are too may 
dynamic partitions.
+

Review Comment:
   Add some hints about how to tune the dynamic partitioning. For example, add 
`DISTRIBUTED BY <partition_fields>` for hash shuffling when data is not skewed. 
You can also manually add `SORTED BY <partition_fields>` to achieve the same 
purpose as `table.exec.hive.dynamic-grouping.enabled=true`.



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -472,6 +472,14 @@ This configuration is set in the `TableConfig` and will 
affect all sinks of the
   </tbody>
 </table>
 
+### Configuration for Dynamic Partition Inserting 

Review Comment:
   Introduce what's is dynamic partition writing at the beginning. 



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -472,6 +472,14 @@ This configuration is set in the `TableConfig` and will 
affect all sinks of the
   </tbody>
 </table>
 
+### Configuration for Dynamic Partition Inserting 
+By default, if it's for dynamic partition inserting, Flink will sort the data 
additionally by dynamic partition columns before writing into sink table.
+
+To avoid the extra sorting, you can set job configuration 
`table.exec.hive.dynamic-grouping.enabled` (`true` by default) to `false`.
+But with such configuration, it'll throw OOM exception if there are too may 
dynamic partitions.

Review Comment:
   ```suggestion
   But with such a configuration, it may throw OutOfMemory exception if there 
are too many dynamic partitions.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] wuchong commented on a diff in pull request #20419: [FLINK-28774][hive] Allow user to configure whether to enable sort not when it's for dynamic partition writing for HiveSource

Reply via email to