[5/9] impala git commit: IMPALA-6459: [DOCS] Part 2: Stats extrapolation and sampling.

sailesh Mon, 23 Apr 2018 10:40:12 -0700

IMPALA-6459: [DOCS] Part 2: Stats extrapolation and sampling.

Adds new materials under COMPUTE STATS describing
the experimental stats extrapolation and sampling
features.


More cleanup and examples are needed. This patch provides
a reasonable starting point which we can extend.

Change-Id: Idae7a377b5873701e91f60afa62dde2bd8aacd1b
Reviewed-on: http://gerrit.cloudera.org:8080/10112
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/7134d812
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/7134d812
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/7134d812

Branch: refs/heads/master
Commit: 7134d812b57a734c194cc94d32f1e212ed0f17cd
Parents: 34b2f21
Author: Alex Behm <[email protected]>
Authored: Tue Apr 17 17:12:17 2018 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Fri Apr 20 19:45:46 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_perf_stats.xml | 135 +++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/7134d812/docs/topics/impala_perf_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_stats.xml 
b/docs/topics/impala_perf_stats.xml
index f503a68..dab2eb8 100644
--- a/docs/topics/impala_perf_stats.xml
+++ b/docs/topics/impala_perf_stats.xml
@@ -392,6 +392,12 @@ show column stats year_month_day;
               This feature is available since Impala 2.8.
             </p>
           </li>
+          <li>
+            <p>
+              Consider the experimental extrapolation and sampling features 
(see below)
+              to further increase the efficiency of computing stats.
+            </p>
+          </li>
           </ul>
         </p>
 
@@ -415,6 +421,135 @@ show column stats year_month_day;
 
       </conbody>
 
+      <concept id="experimental_stats_features">
+        <title>Experimental: Extrapolation and Sampling</title>
+        <conbody>
+          <p>
+            Impala 2.12 and higher includes two experimental features to 
alleviate
+            common issues for computing and maintaining statistics on very 
large tables.
+            The following shortcomings are improved upon:
+            <ul>
+            <li>
+              <p>
+                Newly added partitions do not have row count statistics. Table 
scans
+                that only access those new partitions are treated as not 
having stats.
+                Similarly, table scans that access both new and old partitions 
estimate
+                the scan cardinality based on those old partitions that have 
stats, and
+                the new partitions without stats are treated as having 0 rows.
+              </p>
+            </li>
+            <li>
+              <p>
+                The row counts of existing partitions become stale when data 
is added
+                or dropped.
+              </p>
+            </li>
+            <li>
+              <p>
+                Computing stats for tables with a 100,000 or more partitions 
might fail
+                or be very slow due to the high cost of updating the partition 
metadata
+                in the Hive Metastore.
+              </p>
+            </li>
+            <li>
+              <p>
+                With transient compute resources it is important to minimize 
the time
+                from starting a new cluster to successfully running queries.
+                Since the cluster might be relatively short-lived, users might 
prefer to
+                quickly collect stats that are "good enough" as opposed to 
spending
+                a lot of time and resouces on computing full-fidelity stats.
+              </p>
+            </li>
+            </ul>
+            For very large tables, it is often wasteful or impractical to run 
a full
+            COMPUTE STATS to address the scenarios above on a frequent basis.
+          </p>
+          <p>
+            The sampling feature makes COMPUTE STATS more efficient by 
processing a
+            fraction of the table data, and the extrapolation feature aims to 
reduce
+            the frequency at which COMPUTE STATS needs to be re-run by 
estimating
+            the row count of new and modified partitions.
+          </p>
+          <p>
+            The sampling and extrapolation features are disabled by default.
+            They can be enabled globally or for specific tables, as follows.
+            Set the impalad start-up configuration 
"--enable_stats_extrapolation" to
+            enable the features globally. To enable them only for a specific 
table, set
+            the "impala.enable.stats.extrapolation" table property to "true" 
for the
+            desired table. The tbale-level property overrides the global 
setting, so
+            it is also possible to enable sampling and extrapolation globally, 
but
+            disable it for specific tables by setting the table property to 
"false".
+            Example:
+            ALTER TABLE mytable test_table SET 
TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
+          </p>
+          <note>
+            Why are these features experimental? Due to their probabilistic 
nature
+            it is possible that these features perform pathologically poorly 
on tables
+            with extreme data/file/size distributions. Since it is not 
feasible for us
+            to test all possible scenarios we only cautiously advertise these 
new
+            capabilities. That said, the features have been thoroughly tested 
and
+            are considered functionally stable. If you decide to give these 
features
+            a try, please tell us about your experience at 
[email protected]!
+            We rely on user feedback to guide future inprovements in statistics
+            collection.
+          </note>
+        </conbody>
+
+        <concept id="experimental_stats_extrapolation">
+          <title>Stats Extrapolation</title>
+          <conbody>
+            <p>
+              The main idea of stats extrapolation is to estimate the row 
count of new
+              and modified partitions based on the result of the last COMPUTE 
STATS.
+              Enabling stats extrapolation changes the behavior of COMPUTE 
STATS,
+              as well as the cardinality estimation of table scans. COMPUTE 
STATS no
+              longer computes and stores per-partition row counts, and 
instead, only
+              computes a table-level row count together with the total number 
of file
+              bytes in the table at that time. No partition metadata is 
modified. The
+              input cardinality of a table scan is estimated by converting the 
data
+              volume of relevant partitions to a row count, based on the 
table-level
+              row count and file bytes statistics. It is assumed that within 
the same
+              table, different sets of files with the same data volume 
correspond
+              to the similar number of rows on average. With extrapolation 
enabled,
+              the scan cardinality estimation ignores per-partition row 
counts. It
+              only relies on the table-level statistics and the scanned data 
volume.
+            </p>
+            <p>
+              The SHOW TABLE STATS and EXPLAIN commands distinguish between 
row counts
+              stored in the Hive Metastore, and the row counts extrapolated 
based on the
+              above process. Consult the SHOW TABLE STATS and EXPLAIN 
documentation
+              for more details.
+            </p>
+          </conbody>
+        </concept>
+
+        <concept id="experimental_stats_sampling">
+          <title>Sampling</title>
+          <conbody>
+            <p>
+              A TABLESAMPLE clause may be added to COMPUTE STATS to limit the
+              percentage of data to be processed. The final statistics are 
obtained
+              by extrapolating the statistics from the data sample over the 
entire table.
+              The extrapolated statistics are stored in the Hive Metastore, 
just as if no
+              sampling was used. The following example runs COMPUTE STATS over 
a 10 percent
+              data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
+            </p>
+            <p>
+            We have found that a 10 percent sampling rate typically offers a 
good
+            tradeoff between statistics accuracy and execution cost. A 
sampling rate
+            well below 10 percent has shown poor results and is not 
recommended.
+            </p>
+            <note type="important">
+              Sampling-based techniques sacrifice result accuracy for execution
+              efficiency, so your mileage may vary for different tables and 
columns
+              depending on their data distribution. The extrapolation 
procedure Impala
+              uses for estimating the number of distinct values per column is 
inherently
+              non-detetministic, so your results may even vary between runs of
+              COMPUTE STATS TABLESAMPLE, even if no data has changed.
+            </note>
+          </conbody>
+        </concept>
+      </concept>
     </concept>
 
     <concept id="concept_bmk_pfl_mdb">

[5/9] impala git commit: IMPALA-6459: [DOCS] Part 2: Stats extrapolation and sampling.

Reply via email to