Bharath Vissapragada has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8235


Change subject: IMPALA-5429: Multi threaded block metadata loading
......................................................................

IMPALA-5429: Multi threaded block metadata loading

Implements multi threaded block metadata loading on the Catalog
server where we fetch block metadata for multiple partitions
in parallel. Number of threads to load the metadata is controlled
by the following two parameters (set on the Catalog server startup)

-max_hdfs_parts_parallel_load(default=5)
-max_s3_parts_parallel_load(default=10)

We use different thread pool sizes for HDFS and S3 based tables
since S3 supports much higher throughput of RPC calls for listStatus
/listFiles. Based on our experiments, S3 showed a linear speed up
(up to ~113x) with increasing number of loading threads where as the
HDFS throughput was limited to ~5x in un-secure clusters and up to
~3.7x in secure clusters. We narrowed it down to scalability
bottlenecks in HDFS RPC implementation (HADOOP-14558) on both the
server and the client side.

One thing to note here is that the thread pool based metadata fetching
is implemented only for loading HDFS block metadata and not for loading
HMS partition information. Our experiments showed that while loading large
partitioned tables, ~90% of the time is spent in connecting to NN and
loading the HDFS block information and optimizing the rest ~10% makes
the code unnecessarily complex without much gain.

Additional notes:

- The multithreading approach is implemented for
  * INVALIDATE (loading from scratch),
  * REFRESH (reusing existing md) code paths,
  * ALTER TABLE ADD/RECOVER PARTITIONS.

- This patch makes the implementation of ListMap thread-safe since
we use that datastructure as a shared state between multiple partition
metadata loding threads.

- While the configuration param max_s3_parts_parallel_load says s3, it
applies for any FileSystem implementation that doesn't support storage
IDs (like ADLS).

Testing and Results:

- This patch doesn't add any new tests since there is enough test
coverage already. Passed core/exhaustive runs with HDFS/S3.

- We noticed up to ~113x speed up on S3 tables(thread_pool_size=160)
and up to ~5x speed up in un-secure HDFS clusters and ~3.7x in secure
HDFS clusters.

- Benchmark improvements on a 16 node cluster (I = Improvement)

 100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION     I -16.4%
 100K-PARTITIONS-1M-FILES-CUSTOM-08-ADD-PARTITION         I -17.25%
 100K-PARTITIONS-1M-FILES-CUSTOM-11-DROP-PARTITION        I -18.53%
 80-PARTITIONS-250K-FILES-11-REFRESH-PARTITION            I -23.57%
 80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION             I -23.87%
 80-PARTITIONS-250K-FILES-09-INVALIDATE                   I -24.88%
 80-PARTITIONS-250K-FILES-01-DROP                         I -34.82%
 80-PARTITIONS-250K-FILES-03-RECOVER                      I -35.90%
 80-PARTITIONS-250K-FILES-07-REFRESH                      I -43.03%
 100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS      I -43.93%
 100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV       I -46.59%
 80-PARTITIONS-250K-FILES-10-REFRESH-AFTER-ADD-PARTITION  I -48.71%
 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH               I -49.02%
 80-PARTITIONS-250K-FILES-05-QUERY-AFTER-INV              I -49.05%
 100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTI -51.87%
 80-PARTITIONS-250K-FILES-S3-03-RECOVER                   I -67.17%
 80-PARTITIONS-250K-FILES-S3-01-DROP                      I -70.38%
 80-PARTITIONS-250K-FILES-S3-05-QUERY-AFTER-INV           I -76.45%
 80-PARTITIONS-250K-FILES-S3-07-REFRESH                   I -87.04%
 80-PARTITIONS-250K-FILES-S3-10-REFRESH-AFTER-ADD-PART    I -88.57%

Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
---
M be/src/catalog/catalog.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M 
fe/src/main/java/org/apache/impala/catalog/HdfsPartitionLocationCompressor.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/JniCatalog.java
M fe/src/main/java/org/apache/impala/util/ListMap.java
9 files changed, 386 insertions(+), 212 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/8235/4
--
To view, visit http://gerrit.cloudera.org:8080/8235
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 4
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>

Reply via email to