Repository: incubator-impala Updated Branches: refs/heads/master e0ba5ef6a -> 67b63f37e
IMPALA-5503: [DOCS] Document how to specify coordinator/executor nodes Cf. IMPALA-3807 and IMPALA-5147. Change-Id: Ia20db6af212122b1f87fc6999f8683860beb2bad Reviewed-on: http://gerrit.cloudera.org:8080/7237 Reviewed-by: John Russell <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/27b0a5e3 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/27b0a5e3 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/27b0a5e3 Branch: refs/heads/master Commit: 27b0a5e3bd5864277640a7c018b3d22bc169820a Parents: e0ba5ef Author: John Russell <[email protected]> Authored: Tue Jun 20 15:30:48 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Mon Jun 26 18:01:51 2017 +0000 ---------------------------------------------------------------------- docs/impala_keydefs.ditamap | 1 + docs/topics/impala_components.xml | 6 ++ docs/topics/impala_new_features.xml | 10 +++ docs/topics/impala_scalability.xml | 112 +++++++++++++++++++++++++++++++ 4 files changed, 129 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/impala_keydefs.ditamap ---------------------------------------------------------------------- diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap index b82116e..7c9bb60 100644 --- a/docs/impala_keydefs.ditamap +++ b/docs/impala_keydefs.ditamap @@ -10926,6 +10926,7 @@ under the License. <keydef href="topics/impala_scalability.xml" keys="scalability"/> <keydef audience="hidden" href="topics/impala_scalability.xml#scalability_memory" keys="scalability_memory"/> <keydef href="topics/impala_scalability.xml#scalability_catalog" keys="scalability_catalog"/> + <keydef href="topics/impala_scalability.xml#scalability_coordinator" keys="scalability_coordinator"/> <keydef href="topics/impala_scalability.xml#statestore_scalability" keys="statestore_scalability"/> <keydef audience="hidden" href="topics/impala_scalability.xml#scalability_cluster_size" keys="scalability_cluster_size"/> <keydef audience="hidden" href="topics/impala_scalability.xml#concurrent_connections" keys="concurrent_connections"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_components.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_components.xml b/docs/topics/impala_components.xml index 7480e04..63090d5 100644 --- a/docs/topics/impala_components.xml +++ b/docs/topics/impala_components.xml @@ -78,6 +78,12 @@ under the License. METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2. </p> + <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503"> + In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators + and which act as query executors, to improve scalability for highly concurrent workloads on large clusters. + See <xref keyref="scalability_coordinator"/> for details. + </p> + <p> <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>, http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_new_features.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_new_features.xml b/docs/topics/impala_new_features.xml index e6f2afa..c8d90cc 100644 --- a/docs/topics/impala_new_features.xml +++ b/docs/topics/impala_new_features.xml @@ -72,6 +72,16 @@ under the License. See <xref keyref="string_functions"/> for details. </p> </li> + <li> + <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503"> + Startup flags for the <cmdname>impalad</cmdname> daemon, <codeph>is_executor</codeph> + and <codeph>is_coordinator</codeph>, let you divide the work on a large, busy cluster + between a small number of hosts acting as query coordinators, and a larger number of + hosts acting as query executors. By default, each host can act in both roles, + potentially introducing bottlenecks during heavily concurrent workloads. + See <xref keyref="scalability_coordinator"/> for details. + </p> + </li> </ul> </conbody> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_scalability.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_scalability.xml b/docs/topics/impala_scalability.xml index 0ccf0c0..af61e7a 100644 --- a/docs/topics/impala_scalability.xml +++ b/docs/topics/impala_scalability.xml @@ -305,6 +305,118 @@ Memory Usage: Additional Notes </conbody> </concept> + <concept id="scalability_coordinator" rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503"> + + <title>Controlling which Hosts are Coordinators and Executors</title> + + <conbody> + + <p> + By default, each host in the cluster that runs the <cmdname>impalad</cmdname> + daemon can act as the coordinator for an Impala query, execute the fragments + of the execution plan for the query, or both. During highly concurrent + workloads for large-scale queries, especially on large clusters, the dual + roles can cause scalability issues: + </p> + + <ul> + <li> + <p> + The extra work required for a host to act as the coordinator could interfere + with its capacity to perform other work for the earlier phases of the query. + For example, the coordinator can experience significant network and CPU overhead + during queries containing a large number of query fragments. Each coordinator + caches metadata for all table partitions and data files, which can be substantial + and contend with memory needed to process joins, aggregations, and other operations + performed by query executors. + </p> + </li> + <li> + <p> + Having a large number of hosts act as coordinators can cause unnecessary network + overhead, or even timeout errors, as each of those hosts communicates with the + <cmdname>statestored</cmdname> daemon for metadata updates. + </p> + </li> + <li> + <p> + The <q>soft limits</q> imposed by the admission control feature are more likely + to be exceeded when there are a large number of heavily loaded hosts acting as + coordinators. + </p> + </li> + </ul> + + <p> + If such scalability bottlenecks occur, you can explicitly specify that certain + hosts act as query coordinators, but not executors for query fragments. + These hosts do not participate in I/O-intensive operations such as scans, + and CPU-intensive operations such as aggregations. + </p> + + <p> + Then, you specify that the + other hosts act as executors but not coordinators. These hosts do not communicate + with the <cmdname>statestored</cmdname> daemon or process the final result sets + from queries. You cannot connect to these hosts through clients such as + <cmdname>impala-shell</cmdname> or business intelligence tools. + </p> + + <p> + This feature is available in <keyword keyref="impala29_full"/> and higher. + </p> + + <p> + To use this feature, you specify one of the following startup flags for the + <cmdname>impalad</cmdname> daemon on each host: + </p> + + <ul> + <li> + <p> + <codeph>is_executor=false</codeph> for each host that + does not act as an executor for Impala queries. + These hosts act exclusively as query coordinators. + This setting typically applies to a relatively small number of + hosts, because the most common topology is to have nearly all + DataNodes doing work for query execution. + </p> + </li> + <li> + <p> + <codeph>is_coordinator=false</codeph> for each host that + does not act as a coordinator for Impala queries. + These hosts act exclusively as executors. + The number of hosts with this setting typically increases + as the cluster grows larger and handles more table partitions, + data files, and concurrent queries. As the overhead for query + coordination increases, it becomes more important to centralize + that work on dedicated hosts. + </p> + </li> + </ul> + + <p> + By default, both of these settings are enabled for each <codeph>impalad</codeph> + instance, allowing all such hosts to act as both executors and coordinators. + </p> + + <p> + For example, on a 100-node cluster, you might specify <codeph>is_executor=false</codeph> + for 10 hosts, to dedicate those hosts as query coordinators. Then specify + <codeph>is_coordinator=false</codeph> for the remaining 90 hosts. All explicit or + load-balanced connections must go to the 10 hosts acting as coordinators. These hosts + perform the network communication to keep metadata up-to-date and route query results + to the appropriate clients. The remaining 90 hosts perform the intensive I/O, CPU, and + memory operations that make up the bulk of the work for each query. If a bottleneck or + other performance issue arises on a specific host, you can narrow down the cause more + easily because each host is dedicated to specific operations within the overall + Impala workload. + </p> + + </conbody> + </concept> + <concept audience="hidden" id="scalability_cluster_size"> <title>Scalability Considerations for Impala Cluster Size and Topology</title>
