Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )
Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories ...................................................................... Patch Set 2: (9 comments) http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7 PS3, Line 7: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories > nit: wrap Done http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549 PS3, Line 549: if (recursive) { : // The Hadoop FileSystem API doesn't provide a recursive listStatus call that : // doesn't also fetch block locations, and fetching block locations is expensive. : // Here, our caller specifically doesn't need block locations, so we don't want to : // call the expensive 'listFiles' call on HDFS. Instead, we need to "manually" : // recursively call FileSystem.listStatusIterator(). : // : // Note that this "manual" recursion is not actually any slower than the recursion : // provided by the HDFS 'listFiles(recursive=true)' API, since the HDFS wire : // protocol doesn't provide any such recursive support anyway. In other words, : // the API that looks like a single recursive call is just as bad as what we're : // doing here. : // : // However, S3 actually implements 'listFiles(recursive=true)' with a faster path : // which natively recurses. In that case, it's quite preferable to use 'listFiles' : // even though it returns LocatedFileStatus objects with "fake" blocks which we : // will ignore. : if (isS3AFileSystem(fs)) { : return listFiles(fs, p, recursive); : } : : return new RecursingIteratorWithFilter(fs, p, true); : } : : return new RecursingIteratorWithFilter(fs, p, fal > I think this is equivalent to... yeah, makes sense. Changed. http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581 PS3, Line 581: erface > nit: used ? Done http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585 PS3, Line 585: > nit: FILTER? (predicate is a java construct) Done http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642 PS3, Line 642: * certain files/directories. This is a implementation of > This is always used, no? Hard code it? I was hoping to make this class generic enough to take in any predicate. But for now, probably makes sense to hard code. http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643 PS3, Line 643: * {@link org.apache.hadoop.fs.RemoteIterator} > Document this? Done http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@658 PS3, Line 658: private final boolean useListStatus_; > line too long (94 > 90) Done http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java: http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100 PS2, Line 100: dstFs.deleteOnExit(tmpTestPath); > Are you sure? The FileSystem#close is called when the JVM exits. See this, https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1618 given this discussion, its probably clearer to use try with resources to make it explicit to the reader. Made the change accordingly http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112 PS2, Line 112: 24 > not sure I follow. I think what were asserting here is that the number of l Isn't this approach assuming the load() method itself is not buggy? Eg, in the suggested approach test will pass if load() method is hardcoded to return a fixed number of files (my point being, a unrelated bug in FileMetadataLoader will cause a false positive in the suggested approach) In my opinion, comparing it to a known hard-coded number of files (or a trusted third party library for results) is a better source of truth to compare against. -- To view, visit http://gerrit.cloudera.org:8080/13665 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa Gerrit-Change-Number: 13665 Gerrit-PatchSet: 2 Gerrit-Owner: Vihang Karajgaonkar <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]> Gerrit-Comment-Date: Thu, 20 Jun 2019 21:55:01 +0000 Gerrit-HasComments: Yes
