umehrot2 commented on a change in pull request #2417:
URL: https://github.com/apache/hudi/pull/2417#discussion_r554242390
##########
File path:
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
##########
@@ -49,12 +60,48 @@ public
FileSystemBackedTableMetadata(SerializableConfiguration conf, String data
@Override
public List<String> getAllPartitionPaths() throws IOException {
- FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get());
if (assumeDatePartitioning) {
+ FileSystem fs = new
Path(datasetBasePath).getFileSystem(hadoopConf.get());
return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs,
datasetBasePath);
- } else {
- return FSUtils.getAllFoldersWithPartitionMetaFile(fs, datasetBasePath);
}
+
+ List<Path> pathsToList = new LinkedList<>();
+ pathsToList.add(new Path(datasetBasePath));
+ List<String> partitionPaths = new ArrayList<>();
+
+ // TODO: Get the parallelism from HoodieWriteConfig
+ final int fileListingParallelism = 1500;
Review comment:
Yeah 1500 is just the maximum. The only downside with this current code
will be that one cannot increase parallelism beyond 1500. These classes already
have too many parameters and adding a new one here again results in changing
all the consumers (and their consumers) which again affects more and more files.
I think it would be better that in a separate PR, we should get rid of these
individual parameters and have the consumers pass in HoodieMetadataConfig
(which has all these params).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]