[
https://issues.apache.org/jira/browse/HIVE-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vihang Karajgaonkar reassigned HIVE-21040:
------------------------------------------
> msck does unnecessary file listing at last level of partitions
> --------------------------------------------------------------
>
> Key: HIVE-21040
> URL: https://issues.apache.org/jira/browse/HIVE-21040
> Project: Hive
> Issue Type: Improvement
> Reporter: Vihang Karajgaonkar
> Assignee: Vihang Karajgaonkar
> Priority: Major
>
> Here is the code snippet which is run by {{msck}} to list directories
> {noformat}
> final Path currentPath = pd.p;
> final int currentDepth = pd.depth;
> FileStatus[] fileStatuses = fs.listStatus(currentPath,
> FileUtils.HIDDEN_FILES_PATH_FILTER);
> // found no files under a sub-directory under table base path; it is
> possible that the table
> // is empty and hence there are no partition sub-directories created
> under base path
> if (fileStatuses.length == 0 && currentDepth > 0 && currentDepth <
> partColNames.size()) {
> // since maxDepth is not yet reached, we are missing partition
> // columns in currentPath
> logOrThrowExceptionWithMsg(
> "MSCK is missing partition columns under " +
> currentPath.toString());
> } else {
> // found files under currentPath add them to the queue if it is a
> directory
> for (FileStatus fileStatus : fileStatuses) {
> if (!fileStatus.isDirectory() && currentDepth <
> partColNames.size()) {
> // found a file at depth which is less than number of partition
> keys
> logOrThrowExceptionWithMsg(
> "MSCK finds a file rather than a directory when it searches
> for "
> + fileStatus.getPath().toString());
> } else if (fileStatus.isDirectory() && currentDepth <
> partColNames.size()) {
> // found a sub-directory at a depth less than number of partition
> keys
> // validate if the partition directory name matches with the
> corresponding
> // partition colName at currentDepth
> Path nextPath = fileStatus.getPath();
> String[] parts = nextPath.getName().split("=");
> if (parts.length != 2) {
> logOrThrowExceptionWithMsg("Invalid partition name " +
> nextPath);
> } else if
> (!parts[0].equalsIgnoreCase(partColNames.get(currentDepth))) {
> logOrThrowExceptionWithMsg(
> "Unexpected partition key " + parts[0] + " found at " +
> nextPath);
> } else {
> // add sub-directory to the work queue if maxDepth is not yet
> reached
> pendingPaths.add(new PathDepthInfo(nextPath, currentDepth + 1));
> }
> }
> }
> if (currentDepth == partColNames.size()) {
> return currentPath;
> }
> }
> {noformat}
> You can see that when the {{currentDepth}} at the {{maxDepth}} it still does
> a unnecessary listing of the files. We can improve this call by checking the
> currentDepth and bailing out early.
> This can improve the performance of msck command significantly especially
> when there are lot of files in each partitions on remote filesystems like S3
> or ADLS
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)