Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11227 )
Change subject: IMPALA-7047. Refreshing partitions should not make an RPC per file ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG@12 PS4, Line 12: since the prior load, it was instead calling it on every file. > General question. In HDFS, we can tell if a file was added or removed. Once In terms of files "changing", we do have to worry about a file being replaced with the same name as a pre-existing file. I think that's why we compare mtimes. In terms of balancing blocks, we detect on the backend if we did a remote read of a block, and if so, we issue a SQL warning which tells the user to issue an INVALIDATE METADATA for that table. Personally I think this sucks -- what we should probably do is have the backend notify the catalogd that the block locations for that file need to be refreshed, and do so automatically. That said, since we're eventually hoping to move more towards a fine-grained caching with TTL, any such incorrect locality info would fall out of the cache within a bounded period of time, so maybe not a big deal anyway. http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG@28 PS4, Line 28: I also tested this by pointing my dev box at a remote filesystem that it's interesitng that there was a "reverse logic" bug in my patch, but I did test it and saw the desired effects. We should make sure to circle back and re-run a similar test to make sure I saw what I thought I was seeing. http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@568 PS4, Line 568: stats = refreshFileMetadata(partDir, Collections.singletonList(partition)); > Question about the implementation of refresh (that code is not in this patc I think Java supports arrays up to 2B elements, but HDFS only supports really up to a few hundred million files. Even so, I think many other parts of Impala are likely to fall over way before this (probably at 100k files in a partition lots of things start to break). I seem to recall that, in terms of the underlying HDFS RPC itself, it already splits it into smaller critical sections of the NS lock on the NN, and maybe even into multiple RPCs. All that to say, I'm not too concerned about this point. http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java: http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@371 PS4, Line 371: assertEquals(0L, (long)opsCounts.getLong("op_get_file_block_locations")); > Something is odd. This test did not fail for the "reverse polarity" bug tha This part of the test is for "REFRESH" on the table level -- does that trigger the code path where there was the bug? http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@379 PS4, Line 379: assertEquals(0L, (long)opsCounts.getLong("op_get_file_block_locations")); this is the one that I'm surprised didn't fail with the bug. Perhaps we can add some log statements or use the debugger to step through the test and understand better what's happening. -- To view, visit http://gerrit.cloudera.org:8080/11227 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956 Gerrit-Change-Number: 11227 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Comment-Date: Tue, 27 Nov 2018 19:29:15 +0000 Gerrit-HasComments: Yes