Vihang Karajgaonkar created HIVE-16014:
------------------------------------------

             Summary: HiveMetastoreChecker should use 
hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size
                 Key: HIVE-16014
                 URL: https://issues.apache.org/jira/browse/HIVE-16014
             Project: Hive
          Issue Type: Improvement
            Reporter: Vihang Karajgaonkar
            Assignee: Vihang Karajgaonkar


HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
determining the pool size as below :

{noformat}
private void checkPartitionDirs(Path basePath, Set<Path> allDirs, int maxDepth) 
throws IOException, HiveException {
    ConcurrentLinkedQueue<Path> basePaths = new ConcurrentLinkedQueue<>();
    basePaths.add(basePath);
    Set<Path> dirSet = Collections.newSetFromMap(new ConcurrentHashMap<Path, 
Boolean>());
    // Here we just reuse the THREAD_COUNT configuration for
    // HIVE_MOVE_FILES_THREAD_COUNT
    int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
15);

    // Check if too low config is provided for move files. 2x CPU is reasonable 
max count.
    poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
        Runtime.getRuntime().availableProcessors() * 2);
{noformat}

msck is commonly used to add the missing partitions for the table from the 
Filesystem. In such a case different pool sizes for HMSHandler and 
HiveMetastoreChecker can affect the performance. Eg. If 
{{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
{{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller pool 
will become the bottleneck. If would be good to use 
{{hive.metastore.fshandler.threads}} to size the pool for HiveMetastoreChecker 
since the number missing partitions and number of partitions to be added will 
most likely be the same. In such a case the performance of the query will be 
optimum when both the pool sizes are same.

Since it is possible to tune both the configs individually it will be very 
likely that they may be different. But since there is a strong co-relation 
between amount of work done by HiveMetastoreChecker and 
HiveMetastore.add_partitions call it might be a good idea to use 
{{hive.metastore.fshandler.threads}} for pool size instead of 
{{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to