bvaradar commented on a change in pull request #1122: [HUDI-29]: Support hudi
COW table to use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table
current rows
URL: https://github.com/apache/incubator-hudi/pull/1122#discussion_r363992460
##########
File path:
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
##########
@@ -196,7 +199,58 @@ public Configuration getConf() {
// ParquetInputFormat.setFilterPredicate(job, predicate);
// clearOutExistingPredicate(job);
// }
- return super.getRecordReader(split, job, reporter);
+
+ final Path finalPath = ((FileSplit) split).getPath();
+ FileSystem fileSystem = finalPath.getFileSystem(conf);
+ FileStatus curFileStatus = fileSystem.getFileStatus(finalPath);
+
+ HoodieTableMetaClient metadata;
+ try {
+ metadata = getTableMetaClient(finalPath.getFileSystem(conf),
+ curFileStatus.getPath().getParent());
+ } catch (DatasetNotFoundException | InvalidDatasetException e) {
+ LOG.info("Handling a non-hoodie path " + curFileStatus.getPath());
+ return super.getRecordReader(split, job, reporter);
+ }
+
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Hoodie Metadata initialized with completed commit Ts as :" +
metadata);
+ }
+ String tableName = metadata.getTableConfig().getTableName();
+ String mode = HoodieHiveUtil.readMode(Job.getInstance(job), tableName);
+
+ if (HoodieHiveUtil.INCREMENTAL_SCAN_MODE.equals(mode)) {
+ return super.getRecordReader(split, job, reporter);
+ } else {
+ List<String> partitions =
FSUtils.getAllFoldersWithPartitionMetaFile(metadata.getFs(),
metadata.getBasePath());
Review comment:
@cdmikechen : For each input file split, we are essentially listing all
partitions. At the minimum, we should only list the partition where the input
split is. You can get the relative partition path from (1) basePath and (2)
fullPath of the input split and use FileSystemView. Even then, this would be
slow and resource intensive. The better solution would be to use consolidated
Metadata but it is not available yet. Is it possible to enable this new
codepath only for Compute Statistics but keep regular select queries go through
the original path ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services