[jira] [Updated] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing

BELUGA BEHR (JIRA) Mon, 04 Feb 2019 08:11:28 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


BELUGA BEHR updated HIVE-21210:
-------------------------------
    Description: 
Threadpools.

Hive uses threadpools in several different places and each implementation is a 
little different and requires different configurations. I think that Hive needs 
to reign in and standardize the way that threadpools are used and threadpools 
should scale automatically without manual configuration. At any given time, 
there are many hundreds of threads running in the HS2 as the number of 
simultaneous connections increases and they surely cause contention with 
one-another.

Here is an example:
{code:java|title=CombineHiveInputFormat.java}
  // max number of threads we can use to check non-combinable paths
  private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50;
  private static final int DEFAULT_NUM_PATH_PER_THREAD = 100;
{code}
When building the splits for a MR job, there are up to 50 threads running per 
query and there is not much scaling here, it's simply 1 thread : 100 files 
ratio.  This implies that to process 5000 files, there are 50 threads, after 
that, 50 threads are still used. Many Hive jobs these days involve more than 
5000 files so it's not scaling well on bigger sizes.

This is not configurable (even manually), it doesn't change when the hardware 
specs increase, and 50 threads seems like a lot when a service must support up 
to 80 connections:

[https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html]

Not to mention, I have never seen a scenario where HS2 is running on a host all 
by itself and has the entire system dedicated to it. Therefore it should be 
more friendly and spin up fewer threads.

I am attaching a patch here that provides a few features:
 * Common module that produces {{ExecutorService}} which caps the number of 
threads it spins up at the number of processors a host has. Keep in mind that a 
class may submit as much work units ({{Callables}} as they would like, but the 
number of threads in the pool is capped.
 * Common module for partitioning work. That is, allow for a generic framework 
for dividing work into partitions (i.e. batches)
 * Modify {{CombineHiveInputFormat}} to take advantage of both modules, 
performing its same duties in a more Java OO way that is currently implemented
 * Add a partitioning (batching) implementation that enforces partitioning of a 
{{Collection}} based on the natural log of the {{Collection}} size so that it 
scales more slowly than a simple 1:100 ratio.
 * Simplify unit test code for {{CombineHiveInputFormat}}

My hope is to introduce these tools to {{CombineHiveInputFormat}} and then to 
drop it into other places.  One of the things I will introduce here is a 
"direct thread" {{ExecutorService}} so that even if there is a configuration 
for a thread pool to be disabled, it will still use an {{ExecutorService}} so 
that the project can avoid logic like "if this function is services by a thread 
pool, use a {{ExecutorService}} (and remember to close it later!) otherwise, 
create a single thread" so that things like [HIVE-16949] can be avoided in the 
future.

  was:
Threadpools.

Hive uses threadpools in several different places and each implementation is a 
little different and requires different configurations. I think that Hive needs 
to reign in and standardize the way that threadpools are used and threadpools 
should scale automatically without manual configuration. At any given time, 
there are many hundreds of threads running in the HS2 as the number of 
simultaneous connections increases and they surely cause contention with 
one-another.

Here is an example:
{code:java|title=CombineHiveInputFormat.java}
  // max number of threads we can use to check non-combinable paths
  private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50;
  private static final int DEFAULT_NUM_PATH_PER_THREAD = 100;
{code}
When building the splits for a MR job, there are up to 50 threads running per 
query and there is not much scaling here, it's simply 1 thread : 100 files 
ratio.  This implies that to process 5000 files, there are 50 threads, after 
that, 50 threads are still used. Many Hive jobs these days involve more than 
5000 files so it's not scaling well on bigger sizes.

This is not configurable (even manually), it doesn't change when the hardware 
specs increase, and 50 threads seems like a lot when a service must support up 
to 80 connections:

[https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html]

Not to mention, I have never seen a scenario where HS2 is running on a host all 
by itself and has the entire system dedicated to it. Therefore it should be 
more friendly and spin up fewer threads.

I am attaching a patch here that provides a few features:
 * Common module that produces {{ExecutorService}} which caps the number of 
threads it spins up at the number of processors a host has. Keep in mind that a 
class may submit as much work units ({{Callables}} as they would like, but the 
number of threads in the pool is capped.
 * Common module for partitioning work. That is, allow for a generic framework 
for dividing work into partitions (i.e. batches)
 * Modify {{CombineHiveInputFormat}} to take advantage of both modules, 
performing its same duties in a more Java OO way that is currently implemented
 * Add a partitioning (batching) implementation that enforces partitioning of a 
{{Collection}} based on the natural log of the {{Collection}} size so that it 
scales more slowly than a simple 1:100 ratio.
 * Simplify unit test code for {{CombineHiveInputFormat}}


> CombineHiveInputFormat Thread Pool Sizing
> -----------------------------------------
>
>                 Key: HIVE-21210
>                 URL: https://issues.apache.org/jira/browse/HIVE-21210
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 4.0.0, 3.2.0
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> Threadpools.
> Hive uses threadpools in several different places and each implementation is 
> a little different and requires different configurations. I think that Hive 
> needs to reign in and standardize the way that threadpools are used and 
> threadpools should scale automatically without manual configuration. At any 
> given time, there are many hundreds of threads running in the HS2 as the 
> number of simultaneous connections increases and they surely cause contention 
> with one-another.
> Here is an example:
> {code:java|title=CombineHiveInputFormat.java}
>   // max number of threads we can use to check non-combinable paths
>   private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50;
>   private static final int DEFAULT_NUM_PATH_PER_THREAD = 100;
> {code}
> When building the splits for a MR job, there are up to 50 threads running per 
> query and there is not much scaling here, it's simply 1 thread : 100 files 
> ratio.  This implies that to process 5000 files, there are 50 threads, after 
> that, 50 threads are still used. Many Hive jobs these days involve more than 
> 5000 files so it's not scaling well on bigger sizes.
> This is not configurable (even manually), it doesn't change when the hardware 
> specs increase, and 50 threads seems like a lot when a service must support 
> up to 80 connections:
> [https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html]
> Not to mention, I have never seen a scenario where HS2 is running on a host 
> all by itself and has the entire system dedicated to it. Therefore it should 
> be more friendly and spin up fewer threads.
> I am attaching a patch here that provides a few features:
>  * Common module that produces {{ExecutorService}} which caps the number of 
> threads it spins up at the number of processors a host has. Keep in mind that 
> a class may submit as much work units ({{Callables}} as they would like, but 
> the number of threads in the pool is capped.
>  * Common module for partitioning work. That is, allow for a generic 
> framework for dividing work into partitions (i.e. batches)
>  * Modify {{CombineHiveInputFormat}} to take advantage of both modules, 
> performing its same duties in a more Java OO way that is currently implemented
>  * Add a partitioning (batching) implementation that enforces partitioning of 
> a {{Collection}} based on the natural log of the {{Collection}} size so that 
> it scales more slowly than a simple 1:100 ratio.
>  * Simplify unit test code for {{CombineHiveInputFormat}}
> My hope is to introduce these tools to {{CombineHiveInputFormat}} and then to 
> drop it into other places.  One of the things I will introduce here is a 
> "direct thread" {{ExecutorService}} so that even if there is a configuration 
> for a thread pool to be disabled, it will still use an {{ExecutorService}} so 
> that the project can avoid logic like "if this function is services by a 
> thread pool, use a {{ExecutorService}} (and remember to close it later!) 
> otherwise, create a single thread" so that things like [HIVE-16949] can be 
> avoided in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing

Reply via email to