Csaba Ringhofer created IMPALA-12476:
----------------------------------------

             Summary: Single thread permission check can bottleneck table loadin
                 Key: IMPALA-12476
                 URL: https://issues.apache.org/jira/browse/IMPALA-12476
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Csaba Ringhofer


Partitioned tables use multiple threads to list files in different partitions, 
but access permission checks are done before this on a single thread. 
IMPALA-7320 optimized this for full table loads (more exactly if a high 
percentage of partitions have changed), but in some cases we still do namenode 
RPCs on a single thread to get access level:
1. as mentioned above, if only a small subset of partitions are changed
2. if the path has ACL (access control list), then after getting file status an 
extra getAclStatus RPC is done, leading to partition_count number of RPCs on a 
single thread if ACL is enabled for all partitions
3. if there is some error when doing the optimized path

1. is especially problematic for metastore event processing, as partition 
events will often change  only a subset of partitions. Even if all partitions 
are changed, the catalogd may not process them in one batch (see IMPALA-12463), 
leading to choosing the unoptimized path for several smaller batches

Besides the optimization in  IMPALA-7320, there is no good reason for doing 
access level check on a single thread, so both case 1 and 2 good be made faster 
by moving to the multithreaded stage of table loading.

Note it is also a question whether all these access permission checks are 
really needed, see    IMPALA-12472.

An anomaly caused by doing these on a single thread is that the affect of flag 
max_hdfs_partitions_parallel_load can be ambiguous - while it can significantly 
speed up loading tables with multiple partitions, if the namenode (or the 
thread that communicates with namenode) is contended, then parallel loads will 
get an unfair share of the limited resources, meaning the tables where large 
amount of work is done on single thread can actually get slower.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to