Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521170343

   > Can you elaborate a little more what the gains we get here?
   
   (1) `getFilesToReadOfInstant` will traverse the metadata list to get files 
in each metadata, then check whether the file exists through `fs.exists`, and 
then add it to `uniqueIdToFileStatus`;
   
   The pair with the same key in the latest metadata will overwrite the former 
when add to `uniqueIdToFileStatus`, so we can just traverse metadata list in 
`reverse order`, and then skip the keys that have already appeared, this may 
reduce the cost of `fs.exist`
   
   (2) `getRawWritePathsOfInstants` does not check whether the file exists, but 
still need to check in subsequent process, like this
   
   ```scala
         FileStatus[] files = WriteProfiles.getRawWritePathsOfInstants(path, 
hadoopConf, metadataList, metaClient.getTableType());
         FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
         if (Arrays.stream(files).anyMatch(fileStatus -> 
!StreamerUtil.fileExists(fs, fileStatus.getPath()))) {
           LOG.warn("Found deleted files in metadata, fall back to full table 
scan.");
           // fallback to full table scan
           // reading from the earliest, scans the partitions and files directly
          ...
         } else {
           fileStatuses = files;
         }
   ```
   Therefore, we can still check in advance, so i add a param 
`tolerateNonExist` to combine `getRawWritePathsOfInstants` and 
`getFilesToReadOfInstant` into one function called `getExistFileFromMetadata`, 
when set `tolerateNonExist` to false and meet file non-exist, immediately 
return null
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to