Vihang Karajgaonkar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp 
directories
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7
PS3, Line 7: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files 
in hidden and tmp directories
> nit: wrap
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549
PS3, Line 549:    if (recursive) {
             :         // The Hadoop FileSystem API doesn't provide a recursive 
listStatus call that
             :         // doesn't also fetch block locations, and fetching 
block locations is expensive.
             :         // Here, our caller specifically doesn't need block 
locations, so we don't want to
             :         // call the expensive 'listFiles' call on HDFS. Instead, 
we need to "manually"
             :         // recursively call FileSystem.listStatusIterator().
             :         //
             :         // Note that this "manual" recursion is not actually any 
slower than the recursion
             :         // provided by the HDFS 'listFiles(recursive=true)' API, 
since the HDFS wire
             :         // protocol doesn't provide any such recursive support 
anyway. In other words,
             :         // the API that looks like a single recursive call is 
just as bad as what we're
             :         // doing here.
             :         //
             :         // However, S3 actually implements 
'listFiles(recursive=true)' with a faster path
             :         // which natively recurses. In that case, it's quite 
preferable to use 'listFiles'
             :         // even though it returns LocatedFileStatus objects with 
"fake" blocks which we
             :         // will ignore.
             :         if (isS3AFileSystem(fs)) {
             :           return listFiles(fs, p, recursive);
             :         }
             :
             :         return new RecursingIteratorWithFilter(fs, p, true);
             :       }
             :
             :       return new RecursingIteratorWithFilter(fs, p, fal
> I think this is equivalent to...
yeah, makes sense. Changed.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581
PS3, Line 581: erface
> nit: used ?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585
PS3, Line 585:
> nit: FILTER? (predicate is a java construct)
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642
PS3, Line 642: * certain files/directories. This is a implementation of
> This is always used, no? Hard code it?
I was hoping to make this class generic enough to take in any predicate. But 
for now, probably makes sense to hard code.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643
PS3, Line 643:    * {@link org.apache.hadoop.fs.RemoteIterator}
> Document this?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@658
PS3, Line 658:     private final boolean useListStatus_;
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
> Are you sure?
The FileSystem#close is called when the JVM exits. See this, 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1618

given this discussion, its probably clearer to use try with resources to make 
it explicit to the reader. Made the change accordingly


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
> not sure I follow. I think what were asserting here is that the number of l
Isn't this approach assuming the load() method itself is not buggy? Eg, in the 
suggested approach test will pass if load() method is hardcoded to return a 
fixed number of files (my point being, a unrelated bug in FileMetadataLoader 
will cause a false positive in the suggested approach)

In my opinion, comparing it to a known hard-coded number of files (or a trusted 
third party library for results) is a better source of truth to compare against.



--
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]>
Gerrit-Comment-Date: Thu, 20 Jun 2019 21:55:01 +0000
Gerrit-HasComments: Yes

Reply via email to