prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072738
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -419,52 +394,53 @@ private boolean
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
* @param dataMetaClient
* @return Map of partition names to a list of FileStatus for all the files
in the partition
*/
- private Map<String, List<FileStatus>>
getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+ private List<DirectoryInfo> listAllPartitions(HoodieTableMetaClient
datasetMetaClient) {
List<Path> pathsToList = new LinkedList<>();
pathsToList.add(new Path(dataWriteConfig.getBasePath()));
- Map<String, List<FileStatus>> partitionToFileStatus = new HashMap<>();
+ List<DirectoryInfo> foundPartitionsList = new LinkedList<>();
final int fileListingParallelism =
metadataWriteConfig.getFileListingParallelism();
SerializableConfiguration conf = new
SerializableConfiguration(dataMetaClient.getHadoopConf());
final String dirFilterRegex =
dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+ final String datasetBasePath = dataMetaClient.getBasePath();
while (!pathsToList.isEmpty()) {
- int listingParallelism = Math.min(fileListingParallelism,
pathsToList.size());
+ // In each round we will list a section of directories
+ int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
// List all directories in parallel
- List<Pair<Path, FileStatus[]>> dirToFileListing =
engineContext.map(pathsToList, path -> {
+ List<DirectoryInfo> foundDirsList =
engineContext.map(pathsToList.subList(0, numDirsToList), path -> {
FileSystem fs = path.getFileSystem(conf.get());
- return Pair.of(path, fs.listStatus(path));
- }, listingParallelism);
- pathsToList.clear();
+ String relativeDirPath = FSUtils.getRelativePartitionPath(new
Path(datasetBasePath), path);
+ return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+ }, numDirsToList);
+
+ pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList,
pathsToList.size()));
// If the listing reveals a directory, add it to queue. If the listing
reveals a hoodie partition, add it to
// the results.
- dirToFileListing.forEach(p -> {
- if (!dirFilterRegex.isEmpty() &&
p.getLeft().getName().matches(dirFilterRegex)) {
- LOG.info("Ignoring directory " + p.getLeft() + " which matches the
filter regex " + dirFilterRegex);
- return;
+ for (DirectoryInfo dirInfo : foundDirsList) {
+ if (!dirFilterRegex.isEmpty()) {
+ final String relativePath = dirInfo.getRelativePath();
+ if (!relativePath.isEmpty()) {
+ Path partitionPath = new Path(datasetBasePath, relativePath);
+ if (partitionPath.getName().matches(dirFilterRegex)) {
+ LOG.info("Ignoring directory " + partitionPath + " which matches
the filter regex " + dirFilterRegex);
+ continue;
+ }
+ }
}
- List<FileStatus> filesInDir = Arrays.stream(p.getRight()).parallel()
- .filter(fs ->
!fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
- .collect(Collectors.toList());
-
- if (p.getRight().length > filesInDir.size()) {
- String partitionName = FSUtils.getRelativePartitionPath(new
Path(dataMetaClient.getBasePath()), p.getLeft());
- // deal with Non-partition table, we should exclude .hoodie
- partitionToFileStatus.put(partitionName, filesInDir.stream()
- .filter(f ->
!f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+ if (dirInfo.isPartition()) {
+ // Add to result
+ foundPartitionsList.add(dirInfo);
} else {
// Add sub-dirs to the queue
- pathsToList.addAll(Arrays.stream(p.getRight())
- .filter(fs -> fs.isDirectory() &&
!fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
- .map(fs -> fs.getPath())
- .collect(Collectors.toList()));
+ pathsToList.addAll(dirInfo.getSubdirs());
Review comment:
DirectoryInfo constructor parses the FileStatus[] array and constructs:
1. A list of sub-directories
2. Whether the directory is a partition (presence of partition meta file)
So in the code above, dirInfo.getSubdirs() should only return the
sub-directories.
The DirectoryInfo constructor was not ignoring the .hoodie directory and I
will code for that. The .hoodie and its sub-dirs will be listed (sub-optimal)
but none of them will be found to be partition due to lack of partition meta
files. I will update the code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]