[
https://issues.apache.org/jira/browse/HIVE-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546625#comment-14546625
]
Raj Bains commented on HIVE-10725:
----------------------------------
Please consider the following when looking at resource design.
1. Yes, the end goal is to prevent HS2 from becoming unresponsive/going OOM -
and maybe we add more context to the target here - are we going to target 100
concurrent sessions on a box with 8 cores and 64GB of RAM with a particular
query with load balancing across two HS2 instances. This will give us a better
target and help test our resource management. Operational databases can handle
thousands of queries per second - though the queries are simpler. HS2 should
not be a bottleneck.
2. Admission control should be simple, but richer than the number of queries.
Also, it should be in terms that a database user understands - number of
threads is a bad metric to expose since it is not in terms of user workload -
this is punting our engineering work to the end user. To show a relevant
metric, assume that I setup a HS2 for 100 short concurrent queries and someone
throws in a 5 fact table join, then what should happen? Admission control
should allow rules/limits on number of queries AND query attributes. Number of
tables accessed/ joins might be a good place to start. Admission control
definitely needs to have an API with the assumption that there will be an
external Workload Management module / UI that will allow the user to set it up
along with other aspects of workload management.
3. We'll have to implement parallel compilation at some point, we should have
guardrails around allowed parallelism and ensure resource management will work
then. Is the lifetime of a query a pipeline with stages? Can we know how much
resources the query takes in each stage? Should we allow limits on how many
queries are allowed in each stage? You might allow 100 threads, but only 5 to
run CBO at one time. There can be 50 waiting on a stats fetch from remote
server or on query execution to finish.
4. We'll have caches inside HS2 - statistics cache, query compile cache - how
will the resource utilization of those be handled?
5. What's the performance degradation model? A good model is that we allow
entry of queries till we reach maximum throughput. Then we queue the queries -
increasing latency but not reducing throughput - degrading gracefully to a
point. After a certain amount of latency increase, we reject further incoming
queries from the queue. Is the architecture setup to be able to reason in such
terms? I haven't looked at the code.
> Better resource management in HiveServer2
> -----------------------------------------
>
> Key: HIVE-10725
> URL: https://issues.apache.org/jira/browse/HIVE-10725
> Project: Hive
> Issue Type: Improvement
> Components: HiveServer2, JDBC
> Affects Versions: 1.3.0
> Reporter: Vaibhav Gumashta
>
> We have various ways to control the number of queries that can be run on one
> HS2 instance (max threads, thread pool queuing etc). We also have ways to run
> multiple HS2 instances using dynamic service discovery. We should do a better
> job at:
> 1. Monitoring resource utilization (sessions, ophandles, memory, threads etc).
> 2. Being upfront to the client when we cannot accept new queries.
> 3. Throttle among different server instances in case dynamic service
> discovery is used.
> 4. Consolidate existing ways to control #queries into a simpler model.
> 5. See if we can recommend reasonable values for OS resources or provide
> alerts if we run out of those.
> 6. Health reports, server status API (to get number of queries, sessions etc).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)