Repository: incubator-impala
Updated Branches:
  refs/heads/master e0ba5ef6a -> 67b63f37e


IMPALA-5503: [DOCS] Document how to specify coordinator/executor nodes

Cf. IMPALA-3807 and IMPALA-5147.

Change-Id: Ia20db6af212122b1f87fc6999f8683860beb2bad
Reviewed-on: http://gerrit.cloudera.org:8080/7237
Reviewed-by: John Russell <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/27b0a5e3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/27b0a5e3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/27b0a5e3

Branch: refs/heads/master
Commit: 27b0a5e3bd5864277640a7c018b3d22bc169820a
Parents: e0ba5ef
Author: John Russell <[email protected]>
Authored: Tue Jun 20 15:30:48 2017 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Mon Jun 26 18:01:51 2017 +0000

----------------------------------------------------------------------
 docs/impala_keydefs.ditamap         |   1 +
 docs/topics/impala_components.xml   |   6 ++
 docs/topics/impala_new_features.xml |  10 +++
 docs/topics/impala_scalability.xml  | 112 +++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/impala_keydefs.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap
index b82116e..7c9bb60 100644
--- a/docs/impala_keydefs.ditamap
+++ b/docs/impala_keydefs.ditamap
@@ -10926,6 +10926,7 @@ under the License.
   <keydef href="topics/impala_scalability.xml" keys="scalability"/>
   <keydef audience="hidden" 
href="topics/impala_scalability.xml#scalability_memory" 
keys="scalability_memory"/>
   <keydef href="topics/impala_scalability.xml#scalability_catalog" 
keys="scalability_catalog"/>
+  <keydef href="topics/impala_scalability.xml#scalability_coordinator" 
keys="scalability_coordinator"/>
   <keydef href="topics/impala_scalability.xml#statestore_scalability" 
keys="statestore_scalability"/>
   <keydef audience="hidden" 
href="topics/impala_scalability.xml#scalability_cluster_size" 
keys="scalability_cluster_size"/>
   <keydef audience="hidden" 
href="topics/impala_scalability.xml#concurrent_connections" 
keys="concurrent_connections"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_components.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_components.xml 
b/docs/topics/impala_components.xml
index 7480e04..63090d5 100644
--- a/docs/topics/impala_components.xml
+++ b/docs/topics/impala_components.xml
@@ -78,6 +78,12 @@ under the License.
         METADATA</codeph> statements that were needed to coordinate metadata 
across nodes prior to Impala 1.2.
       </p>
 
+      <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
+        In <keyword keyref="impala29_full"/> and higher, you can control which 
hosts act as query coordinators
+        and which act as query executors, to improve scalability for highly 
concurrent workloads on large clusters.
+        See <xref keyref="scalability_coordinator"/> for details.
+      </p>
+
       <p>
         <b>Related information:</b> <xref 
href="impala_config_options.xml#config_options"/>,
         <xref href="impala_processes.xml#processes"/>, <xref 
href="impala_timeouts.xml#impalad_timeout"/>,

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_new_features.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_new_features.xml 
b/docs/topics/impala_new_features.xml
index e6f2afa..c8d90cc 100644
--- a/docs/topics/impala_new_features.xml
+++ b/docs/topics/impala_new_features.xml
@@ -72,6 +72,16 @@ under the License.
             See <xref keyref="string_functions"/> for details.
           </p>
         </li>
+        <li>
+          <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
+            Startup flags for the <cmdname>impalad</cmdname> daemon, 
<codeph>is_executor</codeph>
+            and <codeph>is_coordinator</codeph>, let you divide the work on a 
large, busy cluster
+            between a small number of hosts acting as query coordinators, and 
a larger number of
+            hosts acting as query executors. By default, each host can act in 
both roles,
+            potentially introducing bottlenecks during heavily concurrent 
workloads.
+            See <xref keyref="scalability_coordinator"/> for details.
+          </p>
+        </li>
       </ul>
 
     </conbody>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/27b0a5e3/docs/topics/impala_scalability.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_scalability.xml 
b/docs/topics/impala_scalability.xml
index 0ccf0c0..af61e7a 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -305,6 +305,118 @@ Memory Usage: Additional Notes
     </conbody>
   </concept>
 
+  <concept id="scalability_coordinator" rev="2.9.0 IMPALA-3807 IMPALA-5147 
IMPALA-5503">
+
+    <title>Controlling which Hosts are Coordinators and Executors</title>
+
+    <conbody>
+
+      <p>
+        By default, each host in the cluster that runs the 
<cmdname>impalad</cmdname>
+        daemon can act as the coordinator for an Impala query, execute the 
fragments
+        of the execution plan for the query, or both. During highly concurrent
+        workloads for large-scale queries, especially on large clusters, the 
dual
+        roles can cause scalability issues:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            The extra work required for a host to act as the coordinator could 
interfere
+            with its capacity to perform other work for the earlier phases of 
the query.
+            For example, the coordinator can experience significant network 
and CPU overhead
+            during queries containing a large number of query fragments. Each 
coordinator
+            caches metadata for all table partitions and data files, which can 
be substantial
+            and contend with memory needed to process joins, aggregations, and 
other operations
+            performed by query executors.
+          </p>
+        </li>
+        <li>
+          <p>
+            Having a large number of hosts act as coordinators can cause 
unnecessary network
+            overhead, or even timeout errors, as each of those hosts 
communicates with the
+            <cmdname>statestored</cmdname> daemon for metadata updates.
+          </p>
+        </li>
+        <li>
+          <p>
+            The <q>soft limits</q> imposed by the admission control feature 
are more likely
+            to be exceeded when there are a large number of heavily loaded 
hosts acting as
+            coordinators.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        If such scalability bottlenecks occur, you can explicitly specify that 
certain
+        hosts act as query coordinators, but not executors for query fragments.
+        These hosts do not participate in I/O-intensive operations such as 
scans,
+        and CPU-intensive operations such as aggregations.
+      </p>
+
+      <p>
+        Then, you specify that the
+        other hosts act as executors but not coordinators. These hosts do not 
communicate
+        with the <cmdname>statestored</cmdname> daemon or process the final 
result sets
+        from queries. You cannot connect to these hosts through clients such as
+        <cmdname>impala-shell</cmdname> or business intelligence tools.
+      </p>
+
+      <p>
+        This feature is available in <keyword keyref="impala29_full"/> and 
higher.
+      </p>
+
+      <p>
+        To use this feature, you specify one of the following startup flags 
for the
+        <cmdname>impalad</cmdname> daemon on each host:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            <codeph>is_executor=false</codeph> for each host that
+            does not act as an executor for Impala queries.
+            These hosts act exclusively as query coordinators.
+            This setting typically applies to a relatively small number of
+            hosts, because the most common topology is to have nearly all
+            DataNodes doing work for query execution.
+          </p>
+        </li>
+        <li>
+          <p>
+            <codeph>is_coordinator=false</codeph> for each host that
+            does not act as a coordinator for Impala queries.
+            These hosts act exclusively as executors.
+            The number of hosts with this setting typically increases
+            as the cluster grows larger and handles more table partitions,
+            data files, and concurrent queries. As the overhead for query
+            coordination increases, it becomes more important to centralize
+            that work on dedicated hosts.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        By default, both of these settings are enabled for each 
<codeph>impalad</codeph>
+        instance, allowing all such hosts to act as both executors and 
coordinators.
+      </p>
+
+      <p>
+        For example, on a 100-node cluster, you might specify 
<codeph>is_executor=false</codeph>
+        for 10 hosts, to dedicate those hosts as query coordinators. Then 
specify
+        <codeph>is_coordinator=false</codeph> for the remaining 90 hosts. All 
explicit or
+        load-balanced connections must go to the 10 hosts acting as 
coordinators. These hosts
+        perform the network communication to keep metadata up-to-date and 
route query results
+        to the appropriate clients. The remaining 90 hosts perform the 
intensive I/O, CPU, and
+        memory operations that make up the bulk of the work for each query. If 
a bottleneck or
+        other performance issue arises on a specific host, you can narrow down 
the cause more
+        easily because each host is dedicated to specific operations within 
the overall
+        Impala workload.
+      </p>
+
+    </conbody>
+  </concept>
+
   <concept audience="hidden" id="scalability_cluster_size">
 
     <title>Scalability Considerations for Impala Cluster Size and 
Topology</title>

Reply via email to