Bharath Vissapragada has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8235
Change subject: IMPALA-5429: Multi threaded block metadata loading ...................................................................... IMPALA-5429: Multi threaded block metadata loading Implements multi threaded block metadata loading on the Catalog server where we fetch block metadata for multiple partitions in parallel. Number of threads to load the metadata is controlled by the following two parameters (set on the Catalog server startup) -max_hdfs_parts_parallel_load(default=5) -max_s3_parts_parallel_load(default=10) We use different thread pool sizes for HDFS and S3 based tables since S3 supports much higher throughput of RPC calls for listStatus /listFiles. Based on our experiments, S3 showed a linear speed up (up to ~113x) with increasing number of loading threads where as the HDFS throughput was limited to ~5x in un-secure clusters and up to ~3.7x in secure clusters. We narrowed it down to scalability bottlenecks in HDFS RPC implementation (HADOOP-14558) on both the server and the client side. One thing to note here is that the thread pool based metadata fetching is implemented only for loading HDFS block metadata and not for loading HMS partition information. Our experiments showed that while loading large partitioned tables, ~90% of the time is spent in connecting to NN and loading the HDFS block information and optimizing the rest ~10% makes the code unnecessarily complex without much gain. Additional notes: - The multithreading approach is implemented for * INVALIDATE (loading from scratch), * REFRESH (reusing existing md) code paths, * ALTER TABLE ADD/RECOVER PARTITIONS. - This patch makes the implementation of ListMap thread-safe since we use that datastructure as a shared state between multiple partition metadata loding threads. - While the configuration param max_s3_parts_parallel_load says s3, it applies for any FileSystem implementation that doesn't support storage IDs (like ADLS). Testing and Results: - This patch doesn't add any new tests since there is enough test coverage already. Passed core/exhaustive runs with HDFS/S3. - We noticed up to ~113x speed up on S3 tables(thread_pool_size=160) and up to ~5x speed up in un-secure HDFS clusters and ~3.7x in secure HDFS clusters. - Benchmark improvements on a 16 node cluster (I = Improvement) 100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION I -16.4% 100K-PARTITIONS-1M-FILES-CUSTOM-08-ADD-PARTITION I -17.25% 100K-PARTITIONS-1M-FILES-CUSTOM-11-DROP-PARTITION I -18.53% 80-PARTITIONS-250K-FILES-11-REFRESH-PARTITION I -23.57% 80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION I -23.87% 80-PARTITIONS-250K-FILES-09-INVALIDATE I -24.88% 80-PARTITIONS-250K-FILES-01-DROP I -34.82% 80-PARTITIONS-250K-FILES-03-RECOVER I -35.90% 80-PARTITIONS-250K-FILES-07-REFRESH I -43.03% 100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS I -43.93% 100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV I -46.59% 80-PARTITIONS-250K-FILES-10-REFRESH-AFTER-ADD-PARTITION I -48.71% 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH I -49.02% 80-PARTITIONS-250K-FILES-05-QUERY-AFTER-INV I -49.05% 100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTI -51.87% 80-PARTITIONS-250K-FILES-S3-03-RECOVER I -67.17% 80-PARTITIONS-250K-FILES-S3-01-DROP I -70.38% 80-PARTITIONS-250K-FILES-S3-05-QUERY-AFTER-INV I -76.45% 80-PARTITIONS-250K-FILES-S3-07-REFRESH I -87.04% 80-PARTITIONS-250K-FILES-S3-10-REFRESH-AFTER-ADD-PART I -88.57% Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 --- M be/src/catalog/catalog.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartitionLocationCompressor.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/JniCatalog.java M fe/src/main/java/org/apache/impala/util/ListMap.java 9 files changed, 386 insertions(+), 212 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/8235/4 -- To view, visit http://gerrit.cloudera.org:8080/8235 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 Gerrit-Change-Number: 8235 Gerrit-PatchSet: 4 Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>