[GitHub] [flink] swuferhong commented on a diff in pull request #20084: [FLINK-27988][table-planner] Let HiveTableSource extend from SupportStatisticsReport

GitBox Mon, 01 Aug 2022 04:41:03 -0700


swuferhong commented on code in PR #20084:
URL: https://github.com/apache/flink/pull/20084#discussion_r934433419



##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveTableSource.java:
##########
@@ -261,6 +273,99 @@ public DynamicTableSource copy() {
         return source;
     }
 
+    @Override
+    public TableStats reportStatistics() {
+        try {
+            // only support BOUNDED source
+            if (isStreamingSource()) {
+                return TableStats.UNKNOWN;
+            }
+            if 
(flinkConf.get(FileSystemConnectorOptions.SOURCE_REPORT_STATISTICS)
+                    != FileSystemConnectorOptions.FileStatisticsType.ALL) {
+                return TableStats.UNKNOWN;
+            }
+
+            HiveSourceBuilder sourceBuilder =
+                    new HiveSourceBuilder(jobConf, flinkConf, tablePath, 
hiveVersion, catalogTable)
+                            .setProjectedFields(projectedFields)
+                            .setLimit(limit);
+            int threadNum =
+                    
flinkConf.get(HiveOptions.TABLE_EXEC_HIVE_LOAD_PARTITION_SPLITS_THREAD_NUM);
+            List<HiveTablePartition> hivePartitionsToRead =
+                    getAllPartitions(
+                            jobConf,
+                            hiveVersion,
+                            tablePath,
+                            catalogTable.getPartitionKeys(),
+                            remainingPartitions);
+            BulkFormat<RowData, HiveSourceSplit> defaultBulkFormat =
+                    sourceBuilder.createDefaultBulkFormat();
+            List<HiveSourceSplit> inputSplits =
+                    HiveSourceFileEnumerator.createInputSplits(
+                            0, hivePartitionsToRead, threadNum, jobConf);
+            if (inputSplits.size() != 0) {
+                TableStats tableStats;
+                if (defaultBulkFormat instanceof 
FileBasedStatisticsReportableInputFormat) {
+                    // If HiveInputFormat's variable useMapRedReader is false, 
Flink will use hadoop
+                    // mapRed record to read data.
+                    tableStats =
+                            ((FileBasedStatisticsReportableInputFormat) 
defaultBulkFormat)
+                                    .reportStatistics(
+                                            inputSplits.stream()
+                                                    .map(FileSourceSplit::path)
+                                                    
.collect(Collectors.toList()),
+                                            
catalogTable.getSchema().toRowDataType());
+                } else {
+                    // If HiveInputFormat's variable useMapRedReader is true, 
Hive using MapRed
+                    // InputFormat to read data.
+                    tableStats =
+                            getMapRedInputFormatStatistics(
+                                    inputSplits, 
catalogTable.getSchema().toRowDataType());
+                }
+                if (limit == null) {

Review Comment:
   > Since we has built the HiveSourceBuilder with limit, is `limit` also 
needed here ?
   
   Yes, `HiveSourceBuilder.setLimit()` is only used for choosing 
`LimitableBulkFormat` or  `LimitableBulkFormat`.  We will read all files even 
if `limit != null`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] swuferhong commented on a diff in pull request #20084: [FLINK-27988][table-planner] Let HiveTableSource extend from SupportStatisticsReport

Reply via email to