Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8235 )
Change subject: IMPALA-5429: Multi threaded block metadata loading ...................................................................... Patch Set 5: (9 comments) http://gerrit.cloudera.org:8080/#/c/8235/5/be/src/catalog/catalog.cc File be/src/catalog/catalog.cc: http://gerrit.cloudera.org:8080/#/c/8235/5/be/src/catalog/catalog.cc@35 PS5, Line 35: DEFINE_int32(num_metadata_loading_threads, 16, : "(Advanced) The number of metadata loading threads (degree of parallelism) to use " : "when loading catalog metadata."); I'm confused by the commit message which talks about not loading from hms using multiple threads and this flag which indicates that hms is loaded using multiple threads. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@217 PS5, Line 217: public int loadedFiles_ = 0; : public int refreshedFiles_ = 0; : public int ignoredFiles_ = 0; add comments for these-- see the question regarding refreshedFiles below, for example. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@368 PS5, Line 368: for (HdfsPartition partition: partitions) partition.setFileDescriptors( Am I misreading this or does each partition get set to the same list of newly found descriptors? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@426 PS5, Line 426: new Reference<Long>(Long.valueOf(0) why not use numUnknownDiskIds here? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@431 PS5, Line 431: ++loadStats.refreshedFiles_; does refreshedFiles mean "file blocks reloaded" or "file checked for reload and possibly reloaded"? would be good to track how many times the if-block on L418 was entered since this method is intended to be used when few changes are present. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@433 PS5, Line 433: for (HdfsPartition partition: partitions) partition.setFileDescriptors(n same question as in the load method. http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@773 PS5, Line 773: HDFS and S3 just to clarify, HdfsTable covers both hdfs table metadata as well as metadata needed for s3? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@783 PS5, Line 783: getFileSystem(CONF) I noticed that this is called in many places in this class-- is it bc a given table can be stored on multiple filesystems? http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@801 PS5, Line 801: getLoadingThreadPoolSize can different partitions have different number of files? if so, work across threads may vary. what's costly here: per file call, per partition call, or number of blocks per file? -- To view, visit http://gerrit.cloudera.org:8080/8235 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 Gerrit-Change-Number: 8235 Gerrit-PatchSet: 5 Gerrit-Owner: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Wed, 11 Oct 2017 18:51:23 +0000 Gerrit-HasComments: Yes
