IMPALA-6886: [DOCS] Removed impala_cluster_sizing.xml Change-Id: I03d605d33ed6ced809074b1fc96def30ad0887fd Reviewed-on: http://gerrit.cloudera.org:8080/10109 Reviewed-by: Alex Behm <[email protected]> Tested-by: Impala Public Jenkins <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/dfc17b86 Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/dfc17b86 Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/dfc17b86 Branch: refs/heads/2.x Commit: dfc17b86a2df2e6a4675bce2d64676a29ec59231 Parents: 0ec3cd7 Author: Alex Rodoni <[email protected]> Authored: Wed Apr 18 16:49:51 2018 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Thu Apr 19 22:10:21 2018 +0000 ---------------------------------------------------------------------- docs/impala.ditamap | 2 +- docs/topics/impala_cluster_sizing.xml | 371 ----------------------------- 2 files changed, 1 insertion(+), 372 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/impala.ditamap ---------------------------------------------------------------------- diff --git a/docs/impala.ditamap b/docs/impala.ditamap index 7ef7d47..ec6d313 100644 --- a/docs/impala.ditamap +++ b/docs/impala.ditamap @@ -47,7 +47,7 @@ under the License. </topicref> <topicref href="topics/impala_planning.xml"> <topicref href="topics/impala_prereqs.xml#prereqs"/> - <topicref href="topics/impala_cluster_sizing.xml"/> + <!-- Removed per Alan Choi's request on 4/18/2018 <topicref href="topics/impala_cluster_sizing.xml"/> --> <topicref href="topics/impala_schema_design.xml"/> </topicref> <topicref audience="standalone" href="topics/impala_install.xml#install"> http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/topics/impala_cluster_sizing.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_cluster_sizing.xml b/docs/topics/impala_cluster_sizing.xml deleted file mode 100644 index 7b395c5..0000000 --- a/docs/topics/impala_cluster_sizing.xml +++ /dev/null @@ -1,371 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> -<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> -<concept id="cluster_sizing"> - - <title>Cluster Sizing Guidelines for Impala</title> - <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts> - <prolog> - <metadata> - <data name="Category" value="Impala"/> - <data name="Category" value="Clusters"/> - <data name="Category" value="Planning"/> - <data name="Category" value="Sizing"/> - <data name="Category" value="Deploying"/> - <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. --> - <data name="Category" value="Sectionated Pages"/> - <data name="Category" value="Memory"/> - <data name="Category" value="Scalability"/> - <data name="Category" value="Proof of Concept"/> - <data name="Category" value="Requirements"/> - <data name="Category" value="Guidelines"/> - <data name="Category" value="Best Practices"/> - <data name="Category" value="Administrators"/> - </metadata> - </prolog> - - <conbody> - - <p> - <indexterm audience="hidden">cluster sizing</indexterm> - This document provides a very rough guideline to estimate the size of a cluster needed for a specific - customer application. You can use this information when planning how much and what type of hardware to - acquire for a new cluster, or when adding Impala workloads to an existing cluster. - </p> - - <note> - Before making purchase or deployment decisions, consult organizations with relevant experience - to verify the conclusions about hardware requirements based on your data volume and workload. - </note> - -<!-- <p outputclass="toc inpage"/> --> - - <p> - Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, - Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. - Because work can be distributed in different ways for different queries, if some hosts are overloaded - compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance - and overall slowness - </p> - - <p> - For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, - 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of - DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads - similar to TPC-DS benchmark queries: - </p> - - <table> - <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title> - <tgroup cols="6"> - <colspec colnum="1" colname="col1"/> - <colspec colnum="2" colname="col2"/> - <colspec colnum="3" colname="col3"/> - <colspec colnum="4" colname="col4"/> - <colspec colnum="5" colname="col5"/> - <colspec colnum="6" colname="col6"/> - <thead> - <row> - <entry> - Data Size - </entry> - <entry> - 1 query - </entry> - <entry> - 10 queries - </entry> - <entry> - 100 queries - </entry> - <entry> - 1000 queries - </entry> - <entry> - 2000 queries - </entry> - </row> - </thead> - <tbody> - <row> - <entry> - <b>250 GB</b> - </entry> - <entry> - 2 - </entry> - <entry> - 2 - </entry> - <entry> - 5 - </entry> - <entry> - 35 - </entry> - <entry> - 70 - </entry> - </row> - <row> - <entry> - <b>500 GB</b> - </entry> - <entry> - 2 - </entry> - <entry> - 2 - </entry> - <entry> - 10 - </entry> - <entry> - 70 - </entry> - <entry> - 135 - </entry> - </row> - <row> - <entry> - <b>1 TB</b> - </entry> - <entry> - 2 - </entry> - <entry> - 2 - </entry> - <entry> - 15 - </entry> - <entry> - 135 - </entry> - <entry> - 270 - </entry> - </row> - <row> - <entry> - <b>15 TB</b> - </entry> - <entry> - 2 - </entry> - <entry> - 20 - </entry> - <entry> - 200 - </entry> - <entry> - N/A - </entry> - <entry> - N/A - </entry> - </row> - <row> - <entry> - <b>30 TB</b> - </entry> - <entry> - 4 - </entry> - <entry> - 40 - </entry> - <entry> - 400 - </entry> - <entry> - N/A - </entry> - <entry> - N/A - </entry> - </row> - <row> - <entry> - <b>60 TB</b> - </entry> - <entry> - 8 - </entry> - <entry> - 80 - </entry> - <entry> - 800 - </entry> - <entry> - N/A - </entry> - <entry> - N/A - </entry> - </row> - </tbody> - </tgroup> - </table> - - <section id="sizing_factors"> - - <title>Factors Affecting Scalability</title> - - <p> - A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each - node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with - cluster size. However, for some workloads, the scalability might be bounded by the network, or even by - memory. - </p> - - <p> - If the workload is already network bound (on a 10 GB network), increasing the cluster size wonât reduce - the network load; in fact, a larger cluster could increase network traffic because some queries involve - <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query - throughput in a network-constrained environment. - </p> - - <p> - Letâs look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional - concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor - network is saturated yet. This can happen because currently Impala uses only a single core per node to - process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system - cannot run more than 2 such queries at the same time. - </p> - - <p> - Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound - workload. Itâs just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 - GB/sec. Consider increasing the memory per node. - </p> - - <p> - As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the - throughput estimate. - </p> - </section> - - <section id="sizing_details"> - - <title>A More Precise Approach</title> - - <p> - A more precise sizing estimate would require not only queries per minute (QPM), but also an average data - size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total - data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed: - </p> - -<codeblock>Eq 1: N > QPM * D / 100 GB -</codeblock> - - <p> - Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is - required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can - estimate the number of node using our equation above. - </p> - -<codeblock>N > QPM * D / 100GB -N > 400 * 50GB / 100GB -N > 200 -</codeblock> - - <p> - Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500. - </p> - - <p> - Depending on the complexity of the query, the processing rate of query might change. If the query has more - joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the - process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and - filtering on numbers, the processing rate can be higher. - </p> - </section> - - <section id="sizing_mem_estimate"> - - <title>Estimating Memory Requirements</title> - <!-- - <prolog> - <metadata> - <data name="Category" value="Memory"/> - </metadata> - </prolog> - --> - - <p> - Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the - joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE - STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps - below to calculate the minimum memory requirement. - </p> - - <p> - Suppose you are running the following join: - </p> - -<codeblock>select a.*, b.col_1, b.col_2, ⦠b.col_n -from a, b -where a.key = b.key -and b.col_1 in (1,2,4...) -and b.col_4 in (....); -</codeblock> - - <p> - And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table). - </p> - - <p> - The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression, - filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less - than the total memory of the entire cluster. - </p> - -<codeblock>Cluster Total Memory Requirement = Size of the smaller table * - selectivity factor from the predicate * - projection factor * compression ratio -</codeblock> - - <p> - In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The - predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of - the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns. - Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor. - </p> - -<codeblock>Cluster Total Memory Requirement = Size of the smaller table * - selectivity factor from the predicate * - projection factor * compression ratio - = 100TB * 10% * 5/200 * 3 - = 0.75TB - = 750GB -</codeblock> - - <p> - So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 - TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries - of this magnitude. - </p> - </section> - </conbody> -</concept>
