[GitHub] [flink] godfreyhe commented on a diff in pull request #20377: [FLINK-27338][hive] Improve splitting file for Hive source

GitBox Sun, 07 Aug 2022 19:45:04 -0700


godfreyhe commented on code in PR #20377:
URL: https://github.com/apache/flink/pull/20377#discussion_r939777764



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -166,6 +166,39 @@ following parameters in `TableConfig` (note that these 
parameters affect all sou
   </tbody>
 </table>
 
+### Tuning Split Size While Reading Hive Table
+While reading Hive table, the data files will be enumerated into splits, one 
of which is a portion of data consumed by the source.
+Splits are granularity by which the source distributes the work and 
parallelize the data reading.
+Users can to do some performance tuning by tuning the split's size with the 
follow configurations.

Review Comment:
   can do



##########
flink-connectors/flink-connector-hive/src/test/java/org/apache/flink/connectors/hive/PartitionMonitorTest.java:
##########
@@ -99,6 +99,10 @@ private void commitPartitionWithGivenCreateTime(
     private void preparePartitionMonitor() {
         List<List<String>> seenPartitionsSinceOffset = new ArrayList<>();
         JobConf jobConf = new JobConf();

Review Comment:
   do we have nay more test to verify the changes, e.g. the split number will 
change with `table.exec.hive.file-open-cost` change and 
`table.exec.hive.split-max-size` change 



##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java:
##########
@@ -77,6 +78,22 @@ public class HiveOptions {
                     .withDescription(
                             "The thread number to split hive's partitions to 
splits. It should be bigger than 0.");
 
+    public static final ConfigOption<MemorySize> 
TABLE_EXEC_HIVE_SPLIT_MAX_BYTES =
+            key("table.exec.hive.split-max-size")
+                    .memoryType()
+                    .defaultValue(MemorySize.parse("128mb"))
+                    .withDescription(
+                            "The maximum number of bytes (default is 128MB) to 
pack into a split while reading Hive table. A split will be assigned to a 
reader.");
+
+    public static final ConfigOption<MemorySize> 
TABLE_EXEC_HIVE_FILE_OPEN_COST =
+            key("table.exec.hive.file-open-cost")
+                    .memoryType()
+                    .defaultValue(MemorySize.parse("4mb"))
+                    .withDescription(
+                            "The estimated cost (default is 4MB) to open a 
file. Used to split Hive's files to splits."
+                                    + " When the value is over estimated, 
Flink wll tend to pack Hive's data into less splits, which will help when 
Hive's table contains many some files."

Review Comment:
   The comment should be update: `which will be helpful when Hive's table 
contains many small files` ?



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -166,6 +166,39 @@ following parameters in `TableConfig` (note that these 
parameters affect all sou
   </tbody>
 </table>
 
+### Tuning Split Size While Reading Hive Table
+While reading Hive table, the data files will be enumerated into splits, one 
of which is a portion of data consumed by the source.
+Splits are granularity by which the source distributes the work and 
parallelize the data reading.
+Users can to do some performance tuning by tuning the split's size with the 
follow configurations.
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+        <th class="text-left" style="width: 20%">Key</th>
+        <th class="text-left" style="width: 15%">Default</th>
+        <th class="text-left" style="width: 10%">Type</th>
+        <th class="text-left" style="width: 55%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td><h5>table.exec.hive.split-max-size</h5></td>
+        <td style="word-wrap: break-word;">128mb</td>
+        <td>MemorySize</td>
+        <td>The maximum number of bytes (default is 128MB) to pack into a 
split while reading Hive table.</td>
+    </tr>
+    <tr>
+        <td><h5>table.exec.hive.file-open-cost</h5></td>
+        <td style="word-wrap: break-word;">4mb</td>
+        <td>MemorySize</td>
+        <td>The estimated cost (default is 4MB) to open a file. Used to 
enumerate Hive's files to splits.
+            If the value is over estimated, Flink wll tend to pack Hive's data 
into less splits, which will help when Hive's table contains many some files.

Review Comment:
   overestimated
   wll -> will
   which will help -> which will be helpful
   some -> small



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] godfreyhe commented on a diff in pull request #20377: [FLINK-27338][hive] Improve splitting file for Hive source

Reply via email to