godfreyhe commented on code in PR #20377:
URL: https://github.com/apache/flink/pull/20377#discussion_r939777764
##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -166,6 +166,39 @@ following parameters in `TableConfig` (note that these
parameters affect all sou
</tbody>
</table>
+### Tuning Split Size While Reading Hive Table
+While reading Hive table, the data files will be enumerated into splits, one
of which is a portion of data consumed by the source.
+Splits are granularity by which the source distributes the work and
parallelize the data reading.
+Users can to do some performance tuning by tuning the split's size with the
follow configurations.
Review Comment:
can do
##########
flink-connectors/flink-connector-hive/src/test/java/org/apache/flink/connectors/hive/PartitionMonitorTest.java:
##########
@@ -99,6 +99,10 @@ private void commitPartitionWithGivenCreateTime(
private void preparePartitionMonitor() {
List<List<String>> seenPartitionsSinceOffset = new ArrayList<>();
JobConf jobConf = new JobConf();
Review Comment:
do we have nay more test to verify the changes, e.g. the split number will
change with `table.exec.hive.file-open-cost` change and
`table.exec.hive.split-max-size` change
##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java:
##########
@@ -77,6 +78,22 @@ public class HiveOptions {
.withDescription(
"The thread number to split hive's partitions to
splits. It should be bigger than 0.");
+ public static final ConfigOption<MemorySize>
TABLE_EXEC_HIVE_SPLIT_MAX_BYTES =
+ key("table.exec.hive.split-max-size")
+ .memoryType()
+ .defaultValue(MemorySize.parse("128mb"))
+ .withDescription(
+ "The maximum number of bytes (default is 128MB) to
pack into a split while reading Hive table. A split will be assigned to a
reader.");
+
+ public static final ConfigOption<MemorySize>
TABLE_EXEC_HIVE_FILE_OPEN_COST =
+ key("table.exec.hive.file-open-cost")
+ .memoryType()
+ .defaultValue(MemorySize.parse("4mb"))
+ .withDescription(
+ "The estimated cost (default is 4MB) to open a
file. Used to split Hive's files to splits."
+ + " When the value is over estimated,
Flink wll tend to pack Hive's data into less splits, which will help when
Hive's table contains many some files."
Review Comment:
The comment should be update: `which will be helpful when Hive's table
contains many small files` ?
##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -166,6 +166,39 @@ following parameters in `TableConfig` (note that these
parameters affect all sou
</tbody>
</table>
+### Tuning Split Size While Reading Hive Table
+While reading Hive table, the data files will be enumerated into splits, one
of which is a portion of data consumed by the source.
+Splits are granularity by which the source distributes the work and
parallelize the data reading.
+Users can to do some performance tuning by tuning the split's size with the
follow configurations.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 20%">Key</th>
+ <th class="text-left" style="width: 15%">Default</th>
+ <th class="text-left" style="width: 10%">Type</th>
+ <th class="text-left" style="width: 55%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>table.exec.hive.split-max-size</h5></td>
+ <td style="word-wrap: break-word;">128mb</td>
+ <td>MemorySize</td>
+ <td>The maximum number of bytes (default is 128MB) to pack into a
split while reading Hive table.</td>
+ </tr>
+ <tr>
+ <td><h5>table.exec.hive.file-open-cost</h5></td>
+ <td style="word-wrap: break-word;">4mb</td>
+ <td>MemorySize</td>
+ <td>The estimated cost (default is 4MB) to open a file. Used to
enumerate Hive's files to splits.
+ If the value is over estimated, Flink wll tend to pack Hive's data
into less splits, which will help when Hive's table contains many some files.
Review Comment:
overestimated
wll -> will
which will help -> which will be helpful
some -> small
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]