Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521170343
> Can you elaborate a little more what the gains we get here?
(1) `getFilesToReadOfInstant` will traverse the metadata list to get files
in each metadata, then check whether the file exists through `fs.exists`, and
then add it to `uniqueIdToFileStatus`;
The pair with the same key in the latest metadata will overwrite the former
when add to `uniqueIdToFileStatus`, so we can just traverse metadata list in
`reverse order`, and then skip the keys that have already appeared, this may
reduce the cost of `fs.exist`
(2) `getRawWritePathsOfInstants` does not check whether the file exists, but
still need to check in subsequent process, like this
```scala
FileStatus[] files = WriteProfiles.getRawWritePathsOfInstants(path,
hadoopConf, metadataList, metaClient.getTableType());
FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
if (Arrays.stream(files).anyMatch(fileStatus ->
!StreamerUtil.fileExists(fs, fileStatus.getPath()))) {
LOG.warn("Found deleted files in metadata, fall back to full table
scan.");
// fallback to full table scan
// reading from the earliest, scans the partitions and files directly
...
} else {
fileStatuses = files;
}
```
Therefore, we can still check in advance, so i add a param
`tolerateNonExist` to combine `getRawWritePathsOfInstants` and
`getFilesToReadOfInstant` into one function called `getExistFileFromMetadata`,
when set `tolerateNonExist` to false and meet file non-exist, immediately
return null
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]