Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8235 )
Change subject: IMPALA-5429: Multi threaded block metadata loading ...................................................................... Patch Set 5: (22 comments) http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@215 PS5, Line 215: // Block metadata loading stats for a single HDFS path. nit: File/block (since we're also loading/refreshing files). Also, you may want to change the name of the private class to reflect that, e.g. PathMetadataLoadingStats or something like that. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@217 PS5, Line 217: loadedFiles_ You may want to add a comment here. What is loaded vs refreshed? Is the one superset of the other. Good to clarify to avoid confusion. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@218 PS5, Line 218: _ I believe the convention is that we don't use '_' for public members. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@231 PS5, Line 231: Runnable I don't know how this is used later on, but alternatively you can make PathBlockMetadataLoadRequest implement Callable<PathBLockMdLoadingStats>, hence returning the stats when calling call(). Now you seem to access the stats only through the debugString() function. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@247 PS5, Line 247: Blocks on the loadBlockMetadata() call. Not following this comment. run() either calls refreshBlockMetadata() or loadBlockMetadata(), so I can't really interpret what this comment is saying. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@333 PS5, Line 333: loadBlockMetadata I know I am guilty for some of these names but maybe rename this to "resetAndLoadBlockMetadata"? Then it is more clear what the differences are between this function and rerfreshBlockMetadata(). http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@363 PS5, Line 363: numUnknownDiskIds Are you overriding or incrementing the value of numUnknownDiskIds in the create() function? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@368 PS5, Line 368: newFileDescs Hm, that doesn't seem particularly safe (i.e. using the same list for every partition). Are we certain that any other partition modification operation (e.g. alter partition set location) will not try to override the file descriptors, thereby affecting all the other partitions? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@380 PS5, Line 380: changed : * mtime It's not just the changed mtime that we're looking for in order to determiner modification, so you may want to either remove this or mention all the criteria we use. I prefer the former. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@383 PS5, Line 383: The initial table load still uses the listFiles() : * on the data directory that fetches both the FileStatus as well as BlockLocations in : * a single call. remove http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@398 PS5, Line 398: get(0) Comment why we take the file descriptors of one partition and that it doesn't really matter which one we choose. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@399 PS5, Line 399: public String apply(FileDescriptor desc) { : return desc.getFileName(); : } Add @Override and make it a single line (if it fits) http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@448 PS5, Line 448: (fd == null) || (fd.getFileLength() != status.getLen()) || : (fd.getModificationTime() != status.getModificationTime()); I think we were also checking if the partition was cached. Isn't this check needed anymore? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@455 PS5, Line 455: refreshFileMetadata Maybe call it refreshPartitionStorageMetadata? Overall, it may make sense to replace "File/Block" with "Storage" whenever it makes sense. Thoughts? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@694 PS5, Line 694: Exception Why this change? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@805 PS5, Line 805: ExecutorService partitionLoadingPool = Executors.newFixedThreadPool(threadPoolSize); What is the benefit of having a one thread pool per loaded table instead of a global thread pool that all tables can use? In the former approach, it will be hard to impose any constraints on the maximum number of concurrent load operations. Also, there is some overhead in initializing and tearing down the thread pool which you pay every time a table is refreshed. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@821 PS5, Line 821: debugString This will get information from loadingStats but how do you know that loadingStats is not null? I think it may be safer to check and call debugString in the catch section of run() and don't make any assumptions about the internal state of PathBlockMetadataLoadRequest here. If you do that, then you don't even need a Map in L807, only a collection of Futures suffices. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1237 PS5, Line 1237: HashMap nit: Map http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1303 PS5, Line 1303: Returns the HdfsPartition objects associated with the specified list of partition : * names. Update comment. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@626 PS5, Line 626: private Table addHdfsPartitions(Table tbl, List<Partition> partitions) : throws CatalogException { : Preconditions.checkNotNull(tbl); : Preconditions.checkNotNull(partitions); : if (!(tbl instanceof HdfsTable)) { : throw new CatalogException("Table " + tbl.getFullName() + " is not an HDFS table"); : } : HdfsTable hdfsTable = (HdfsTable) tbl; : List<HdfsPartition> hdfsPartitions = hdfsTable.createAndLoadPartitions(partitions); : for (HdfsPartition hdfsPartition: hdfsPartitions) { : catalog_.addPartition(hdfsPartition); : } : return hdfsTable; : } Lot's of duplication with the previous one. How expensive would it be to simply have something like: addHdfsPartition(Table tbl, Partition partition) { addHdfsPartitions(tbl, List.newArrayList(partition)); } http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java File fe/src/main/java/org/apache/impala/util/ListMap.java: http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@59 PS5, Line 59: synchronized (list_) { Why synchronized on list_? This confuses the locking scheme. If you want you can simply do synchronized (this). Same in L76. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@78 PS5, Line 78: list_ = Collections.synchronizedList(list); Hm, I think this new implementation is way too complicated. I'd start with simply making the getIndex and populate methods synchronized. If we see this becoming a bottleneck, then try to optimize it. -- To view, visit http://gerrit.cloudera.org:8080/8235 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 Gerrit-Change-Number: 8235 Gerrit-PatchSet: 5 Gerrit-Owner: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Wed, 11 Oct 2017 22:10:27 +0000 Gerrit-HasComments: Yes
