tuning_guide.md

jamestaylor Fri, 24 Mar 2017 18:45:06 -0700

Author: jamestaylor
Date: Sat Mar 25 01:44:50 2017
New Revision: 1788628

URL: http://svn.apache.org/viewvc?rev=1788628&view=rev
Log:
Add Tuning Guide for Phoenix (Peter Conrad)


Added:
    phoenix/site/publish/tuning_guide.html
    phoenix/site/source/src/site/markdown/tuning_guide.md

Added: phoenix/site/publish/tuning_guide.html
URL: 
http://svn.apache.org/viewvc/phoenix/site/publish/tuning_guide.html?rev=1788628&view=auto
==============================================================================
--- phoenix/site/publish/tuning_guide.html (added)
+++ phoenix/site/publish/tuning_guide.html Sat Mar 25 01:44:50 2017
@@ -0,0 +1,678 @@
+
+<!DOCTYPE html>
+<!--
+ Generated by Apache Maven Doxia at 2017-03-24
+ Rendered using Reflow Maven Skin 1.1.0 
(http://andriusvelykis.github.io/reflow-maven-skin)
+-->
+<html  xml:lang="en" lang="en">
+
+       <head>
+               <meta charset="UTF-8" />
+               <title>Tuning Guide | Apache Phoenix</title>
+               <meta name="viewport" content="width=device-width, 
initial-scale=1.0" />
+               <meta name="description" content="" />
+               <meta http-equiv="content-language" content="en" />
+
+               <link 
href="//netdna.bootstrapcdn.com/bootswatch/2.3.2/flatly/bootstrap.min.css" 
rel="stylesheet" />
+               <link 
href="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.1/css/bootstrap-responsive.min.css"
 rel="stylesheet" />
+               <link href="./css/bootswatch.css" rel="stylesheet" />
+               <link href="./css/reflow-skin.css" rel="stylesheet" />
+
+               <link href="//yandex.st/highlightjs/7.5/styles/default.min.css" 
rel="stylesheet" />
+               
+               <link href="./css/lightbox.css" rel="stylesheet" />
+               
+               <link href="./css/site.css" rel="stylesheet" />
+               <link href="./css/print.css" rel="stylesheet" media="print" />
+               
+               <!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
+               <!--[if lt IE 9]>
+                       <script 
src="http://html5shim.googlecode.com/svn/trunk/html5.js";></script>
+               <![endif]-->
+
+
+
+       </head>
+
+       <body class="page-tuning_guide project-phoenix-site" data-spy="scroll" 
data-offset="60" data-target="#toc-scroll-target">
+
+               <div class="navbar navbar-fixed-top">
+                       <div class="navbar-inner">
+                               <div class="container">
+                                       <a class="btn btn-navbar" 
data-toggle="collapse" data-target="#top-nav-collapse">
+                                               <span class="icon-bar"></span>
+                                               <span class="icon-bar"></span>
+                                               <span class="icon-bar"></span>
+                                       </a>
+                                       <a class="brand" href="index.html"><div 
class="xtoplogo"></div></a>
+                                       <div class="nav-collapse collapse" 
id="top-nav-collapse">
+                                               <ul class="nav pull-right">
+                                                       <li class="dropdown">
+                                                               <a href="#" 
class="dropdown-toggle" data-toggle="dropdown">About <b class="caret"></b></a>
+                                                               <ul 
class="dropdown-menu">
+                                                                       <li ><a 
href="index.html" title="Overview">Overview</a></li>
+                                                                       <li ><a 
href="who_is_using.html" title="Who is Using">Who is Using</a></li>
+                                                                       <li ><a 
href="recent.html" title="New Features">New Features</a></li>
+                                                                       <li ><a 
href="roadmap.html" title="Roadmap">Roadmap</a></li>
+                                                                       <li ><a 
href="performance.html" title="Performance">Performance</a></li>
+                                                                       <li ><a 
href="team.html" title="Team">Team</a></li>
+                                                                       <li ><a 
href="resources.html" title="Presentations">Presentations</a></li>
+                                                                       <li ><a 
href="mailing_list.html" title="Mailing Lists">Mailing Lists</a></li>
+                                                                       <li ><a 
href="source.html" title="Source Repository">Source Repository</a></li>
+                                                                       <li ><a 
href="issues.html" title="Issue Tracking">Issue Tracking</a></li>
+                                                                       <li ><a 
href="download.html" title="Download">Download</a></li>
+                                                                       <li ><a 
href="installation.html" title="Installation">Installation</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="contributing.html" title="How to Contribute">How to Contribute</a></li>
+                                                                       <li ><a 
href="develop.html" title="How to Develop">How to Develop</a></li>
+                                                                       <li ><a 
href="building_website.html" title="How to Update Website">How to Update 
Website</a></li>
+                                                                       <li ><a 
href="release.html" title="How to Release">How to Release</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="http://www.apache.org/licenses/"; title="License" 
class="externalLink">License</a></li>
+                                                               </ul>
+                                                       </li>
+                                                       <li class="dropdown">
+                                                               <a href="#" 
class="dropdown-toggle" data-toggle="dropdown">Using <b class="caret"></b></a>
+                                                               <ul 
class="dropdown-menu">
+                                                                       <li ><a 
href="faq.html" title="F.A.Q.">F.A.Q.</a></li>
+                                                                       <li ><a 
href="Phoenix-in-15-minutes-or-less.html" title="Quick Start">Quick 
Start</a></li>
+                                                                       <li ><a 
href="building.html" title="Building">Building</a></li>
+                                                                       <li 
class="active"><a href="" title="Tuning">Tuning</a></li>
+                                                                       <li ><a 
href="tuning.html" title="Configuration">Configuration</a></li>
+                                                                       <li ><a 
href="upgrading.html" title="Backward Compatibility">Backward 
Compatibility</a></li>
+                                                                       <li ><a 
href="release_notes.html" title="Release Notes">Release Notes</a></li>
+                                                                       <li ><a 
href="pherf.html" title="Performance Testing">Performance Testing</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="phoenix_spark.html" title="Apache Spark Integration">Apache Spark 
Integration</a></li>
+                                                                       <li ><a 
href="hive_storage_handler.html" title="Phoenix Storage Handler for Apache 
Hive">Phoenix Storage Handler for Apache Hive</a></li>
+                                                                       <li ><a 
href="pig_integration.html" title="Apache Pig Integration">Apache Pig 
Integration</a></li>
+                                                                       <li ><a 
href="phoenix_mr.html" title="Map Reduce Integration">Map Reduce 
Integration</a></li>
+                                                                       <li ><a 
href="flume.html" title="Apache Flume Plugin">Apache Flume Plugin</a></li>
+                                                                       <li ><a 
href="kafka.html" title="Apache Kafka Plugin">Apache Kafka Plugin</a></li>
+                                                               </ul>
+                                                       </li>
+                                                       <li class="dropdown">
+                                                               <a href="#" 
class="dropdown-toggle" data-toggle="dropdown">Addons <b class="caret"></b></a>
+                                                               <ul 
class="dropdown-menu">
+                                                                       <li ><a 
href="http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-phoenix.html";
 title="Phoenix on Amazon EMR" class="externalLink">Phoenix on Amazon 
EMR</a></li>
+                                                                       <li ><a 
href="http://python-phoenixdb.readthedocs.io/en/latest"; title="Phoenix Adapter 
for Python" class="externalLink">Phoenix Adapter for Python</a></li>
+                                                                       <li ><a 
href="phoenix_orm.html" title="Phoenix ORM Library">Phoenix ORM Library</a></li>
+                                                               </ul>
+                                                       </li>
+                                                       <li class="dropdown">
+                                                               <a href="#" 
class="dropdown-toggle" data-toggle="dropdown">Features <b 
class="caret"></b></a>
+                                                               <ul 
class="dropdown-menu">
+                                                                       <li ><a 
href="transactions.html" title="Transactions">Transactions</a></li>
+                                                                       <li ><a 
href="udf.html" title="User-defined Functions">User-defined Functions</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="secondary_indexing.html" title="Secondary Indexes">Secondary 
Indexes</a></li>
+                                                                       <li ><a 
href="columnencoding.html" title="Storage Formats">Storage Formats</a></li>
+                                                                       <li ><a 
href="atomic_upsert.html" title="Atomic Upsert">Atomic Upsert</a></li>
+                                                                       <li ><a 
href="namspace_mapping.html" title="Namespace Mapping">Namespace 
Mapping</a></li>
+                                                                       <li ><a 
href="update_statistics.html" title="Statistics Collection">Statistics 
Collection</a></li>
+                                                                       <li ><a 
href="rowtimestamp.html" title="Row Timestamp Column">Row Timestamp 
Column</a></li>
+                                                                       <li ><a 
href="paged.html" title="Paged Queries">Paged Queries</a></li>
+                                                                       <li ><a 
href="salted.html" title="Salted Tables">Salted Tables</a></li>
+                                                                       <li ><a 
href="skip_scan.html" title="Skip Scan">Skip Scan</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="views.html" title="Views">Views</a></li>
+                                                                       <li ><a 
href="multi-tenancy.html" title="Multi tenancy">Multi tenancy</a></li>
+                                                                       <li ><a 
href="dynamic_columns.html" title="Dynamic Columns">Dynamic Columns</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="bulk_dataload.html" title="Bulk Loading">Bulk Loading</a></li>
+                                                                       <li ><a 
href="server.html" title="Query Server">Query Server</a></li>
+                                                                       <li ><a 
href="tracing.html" title="Tracing">Tracing</a></li>
+                                                                       <li ><a 
href="metrics.html" title="Metrics">Metrics</a></li>
+                                                               </ul>
+                                                       </li>
+                                                       <li class="dropdown">
+                                                               <a href="#" 
class="dropdown-toggle" data-toggle="dropdown">Reference <b 
class="caret"></b></a>
+                                                               <ul 
class="dropdown-menu">
+                                                                       <li ><a 
href="language/index.html" title="Grammar">Grammar</a></li>
+                                                                       <li ><a 
href="language/functions.html" title="Functions">Functions</a></li>
+                                                                       <li ><a 
href="language/datatypes.html" title="Datatypes">Datatypes</a></li>
+                                                                       <li ><a 
href="array_type.html" title="ARRAY type">ARRAY type</a></li>
+                                                                       <li 
class="divider"/>
+                                                                       <li ><a 
href="sequences.html" title="Sequences">Sequences</a></li>
+                                                                       <li ><a 
href="joins.html" title="Joins">Joins</a></li>
+                                                                       <li ><a 
href="subqueries.html" title="Subqueries">Subqueries</a></li>
+                                                               </ul>
+                                                       </li>
+                                               </ul>
+                                       </div><!--/.nav-collapse -->
+                               </div>
+                       </div>
+               </div>
+               
+       <div class="container">
+       
+       <!-- Masthead
+       ================================================== -->
+
+       <header>
+       </header>
+
+       <div class="main-body">
+       <div class="row">
+               <div class="span12">
+                       <div class="body-content">
+<div class="page-header">
+ <h1>Tuning Guide</h1>
+</div> 
+<p>Tuning Phoenix can be complex, but with a little knowledge of how it works 
you can make significant changes to the performance of your reads and writes. 
The most important factor in performance is the design of your schema, 
especially as it affects the underlying HBase row keys. Look in âGeneral 
Tipsâ below to find design advice for different anticipated data access 
patterns. Subsequent sections describe how to use secondary indexes, hints, and 
explain plans.</p> 
+<h1>Primary Keys</h1> 
+<p>The underlying row key design is the single most important factor in 
Phoenix performance, and itâs important to get it right at design time 
because you cannot change it later without re-writing the data and index 
tables.</p> 
+<p>The Phoenix primary keys are concatenated to create the underlying row key 
in Apache HBase. The columns for the primary key constraint should be chosen 
and ordered in a way that aligns with the common query patternsâchoose the 
most frequently queried columns as primary keys. The key that you place in the 
leading position is the most performant one. For example, if you lead off with 
a column containing org ID values, it is easy to select all rows pertaining to 
a specific org. You can add the HBase row timestamp to the primary key to 
improve scan efficiency by skipping rows outside the queried time range. </p> 
+<p>Every primary key imposes a cost because the entire row key is appended to 
every piece of data in memory and on disk. The larger the row key, the greater 
the storage overhead. Find ways to store information compactly in columns you 
plan to use for primary keysâstore deltas instead of complete time stamps, 
for example.</p> 
+<p>To sum up, the best practice is to design primary keys to add up to a row 
key that lets you scan the smallest amount of data.</p> 
+<p>*<i>Tip: </i>*When choosing primary keys, lead with the column you filter 
most frequently across the queries that are most important to optimize. If you 
will use <tt>ORDER BY</tt> in your query, make sure your PK columns match the 
expressions in your <tt>ORDER BY</tt> clause. </p> 
+<div class="section"> 
+ <div class="section"> 
+  <h3 id="Monotonically_increasing_Primary_keys">Monotonically increasing 
Primary keys</h3> 
+  <p>If your primary keys are monotonically increasing, use salting to help 
distribute writes across the cluster and improve parallelization. Example:</p> 
+  <p><tt>CREATE TABLE â¦ ( â¦ ) SALT_BUCKETS = N</tt></p> 
+  <p>For optimal performance the number of salt buckets should approximately 
equal the number of region servers. Do not salt automatically. Use salting only 
when experiencing hotspotting. The downside of salting is that it imposes a 
cost on read because when you want to query the data you have to run multiple 
queries to do a range scan.</p> 
+  <h1>General Tips</h1> 
+  <p>The following sections provide a few general tips for different access 
scenarios.</p> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Is_the_Data_Random-Access">Is the Data Random-Access?</h3> 
+  <ul> 
+   <li>As with any random read workloads, SSDs can improve performance because 
of their faster random seek time.</li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Is_the_data_read-heavy_or_write-heavy">Is the data read-heavy or 
write-heavy?</h3> 
+  <ul> 
+   <li>For read-heavy data: 
+    <ul> 
+     <li>Create global indexes. This will affect write speed depending on the 
number of columns included in an index because each index writes to its own 
separate table.</li> 
+     <li>Use multiple indexes to provide fast access to common queries.</li> 
+     <li>When specifying machines for HBase, do not skimp on cores; HBase 
needs them.</li> 
+    </ul></li> 
+   <li>For write-heavy data: 
+    <ul> 
+     <li>Pre-split the table. It can be helpful to split the table into 
pre-defined regions, or if the keys are monotonically increasing use salting to 
to avoid creating write hotspots on a small number of nodes. Use real data 
types rather than raw byte data.</li> 
+     <li>Create local indexes. Reads from local indexes have a performance 
penalty, so itâs important to do performance testing. See the <a 
class="externalLink" href="https://phoenix.apache.org/pherf.html";>Pherf</a> 
tool.</li> 
+    </ul></li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Which_columns_will_be_accessed_often">Which columns will be accessed 
often?</h3> 
+  <ul> 
+   <li>Choose commonly-queried columns as primary keys. For more information, 
see âPrimary Keysâ below. 
+    <ul> 
+     <li>Create additional indexes to support common query patterns, including 
heavily accessed fields that are not in the primary key.</li> 
+    </ul></li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Can_the_data_be_append-only_immutable">Can the data be append-only 
(immutable)?</h3> 
+  <ul> 
+   <li>If the data is immutable or append-only, declare the table and its 
indexes as immutable using the <tt>IMMUTABLE_ROWS</tt> <a class="externalLink" 
href="http://phoenix.apache.org/language/index.html#options";>option</a> at 
creation time to reduce the write-time cost. If you need to make an existing 
table immutable, you can do so with <tt>ALTER TABLE trans.event SET 
IMMUTABLE_ROWS=true</tt> after creation time. 
+    <ul> 
+     <li>If speed is more important than data integrity, you can use the 
<tt>DISABLE_WAL</tt> <a class="externalLink" 
href="http://phoenix.apache.org/language/index.html#options";>option</a>. Note: 
it is possible to lose data with <tt>DISABLE_WAL</tt> if a region server 
fails.</li> 
+    </ul></li> 
+   <li>Set the <tt>UPDATE_CACHE_FREQUENCY</tt> <a class="externalLink" 
href="http://phoenix.apache.org/language/index.html#options";>option</a> to 15 
minutes or so if your metadata doesnât change very often. This property 
determines how often an RPC is done to ensure youâre seeing the latest 
schema.</li> 
+   <li>If the data is not sparse (over 50% of the cells have values), use the 
SINGLE_CELL_ARRAY_WITH_OFFSETS data encoding scheme introduced in Phoenix 4.10, 
which obtains faster performance by reducing the size of the data. For more 
information, see â<a class="externalLink" 
href="https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data";>Column
 Mapping and Immutable Data Encoding</a>â on the Apache Phoenix blog.</li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Is_the_table_very_large">Is the table very large?</h3> 
+  <ul> 
+   <li>Use the <tt>ASYNC</tt> keyword with your <tt>CREATE INDEX</tt> call to 
create the index asynchronously via MapReduce job. Youâll need to manually 
start the job; see <a class="externalLink" 
href="https://phoenix.apache.org/secondary_indexing.html#Index_Population";>https://phoenix.apache.org/secondary_indexing.html#Index_Population</a>
 for details.</li> 
+   <li>If the data is too large to scan the table completely, use primary keys 
to create an underlying composite row key that makes it easy to return a subset 
of the data or facilitates <a class="externalLink" 
href="https://phoenix.apache.org/skip_scan.html";>skip-scanning</a>âPhoenix 
can jump directly to matching keys when the query includes key sets in the 
predicate.</li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Is_transactionality_required">Is transactionality required?</h3> 
+  <p>A transaction is a data operation that is atomicâthat is, guaranteed to 
succeed completely or not at all. For example, if you need to make cross-row 
updates to a data table, then you should consider your data transactional.</p> 
+  <ul> 
+   <li>If you need transactionality, use the <tt>TRANSACTIONAL</tt> <a 
class="externalLink" 
href="http://phoenix.apache.org/language/index.html#options";>option</a>. (See 
also <a class="externalLink" 
href="http://phoenix.apache.org/transactions.html.";>http://phoenix.apache.org/transactions.html.</a>)</li>
 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Block_Encoding">Block Encoding</h3> 
+  <p>Using compression or encoding is a must. Both SNAPPY and FAST_DIFF are 
good all around options.</p> 
+  <p><tt>FAST_DIFF</tt> encoding is automatically enabled on all Phoenix 
tables by default, and almost always improves overall read latencies and 
throughput by allowing more data to fit into blockcache. Note: 
<tt>FAST_DIFF</tt> encoding can increase garbage produced during request 
processing.</p> 
+  <p>Set encoding at table creation time. Example: <tt>CREATE TABLE â¦ ( â¦ 
) DATA_BLOCK_ENCODING=âFAST_DIFFâ</tt></p> 
+  <h1>Schema Design</h1> 
+  <p>Because the schema affects the way the data is written to the underlying 
HBase layer, Phoenix performance relies on the design of your tables, indexes, 
and primary keys. </p> 
+ </div> 
+</div> 
+<div class="section"> 
+ <h2 id="Phoenix_and_the_HBase_data_model">Phoenix and the HBase data 
model</h2> 
+ <p>HBase stores data in tables, which in turn contain columns grouped in 
column families. A row in an HBase table consists of versioned cells associated 
with one or more columns. An HBase row is a collection of many key-value pairs 
in which the rowkey attribute of the keys are equal. Data in an HBase table is 
sorted by the rowkey, and all access is via the rowkey. Phoenix creates a 
relational data model on top of HBase, enforcing a PRIMARY KEY constraint whose 
columns are concatenated to form the row key for the underlying HBase table. 
For this reason, itâs important to be cognizant of the size and number of the 
columns you include in the PK constraint, because a copy of the row key is 
included with every cell in the underlying HBase table.</p> 
+</div> 
+<div class="section"> 
+ <h2 id="Column_Families">Column Families</h2> 
+ <p>If some columns are accessed more frequently than others, <a 
class="externalLink" 
href="https://phoenix.apache.org/faq.html#Are_there_any_tips_for_optimizing_Phoenix";>create
 multiple column families</a> to separate the frequently-accessed columns from 
rarely-accessed columns. This improves performance because HBase reads only the 
column families specified in the query.</p> 
+</div> 
+<div class="section"> 
+ <h2 id="Columns">Columns</h2> 
+ <p>Here are a few tips that apply to columns in general, whether they are 
indexed or not:</p> 
+ <ul> 
+  <li>Keep <tt>VARCHAR</tt> columns under 1MB or so due to I/O costs. When 
processing queries, HBase materializes cells in full before sending them over 
to the client, and the client receives them in full before handing them off to 
the application code.</li> 
+  <li>For structured objects, donât use JSON, which is not very compact. Use 
a format such as protobuf, Avro, msgpack, or BSON.</li> 
+  <li>Consider compressing data before storage using a fast LZ variant to cut 
latency and I/O costs.</li> 
+  <li>Use the column mapping feature (added in Phoenix 4.10), which uses 
numerical HBase column qualifiers for non-PK columns instead of directly using 
column names. This improves performance when looking for a cell in the sorted 
list of cells returned by HBase, adds further across-the-board performance by 
reducing the disk size used by tables, and speeds up DDL operations like column 
rename and metadata-level column drops. For more information, see â<a 
class="externalLink" 
href="https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data";>Column
 Mapping and Immutable Data Encoding</a>â on the Apache Phoenix blog.</li> 
+ </ul> 
+ <h1>Indexes</h1> 
+ <p>A Phoenix index is a physical table that stores a pivoted copy of some or 
all of the data in the main table, to serve specific kinds of queries. When you 
issue a query, Phoenix selects the best index for the query automatically. The 
primary index is created automatically based on the primary keys you select. 
You can create secondary indexes, specifying which columns are included based 
on the anticipated queries the index will support.</p> 
+ <p>See also: <a class="externalLink" 
href="https://phoenix.apache.org/secondary_indexing.html";>Secondary 
Indexing</a></p> 
+</div> 
+<div class="section"> 
+ <h2 id="Secondary_indexes">Secondary indexes</h2> 
+ <p>Secondary indexes can improve read performance by turning what would 
normally be a full table scan into a point lookup (at the cost of storage space 
and write speed). Secondary indexes can be added or removed after table 
creation and donât require changes to existing queries â queries simply run 
faster. A small number of secondary indexes is often sufficient. Depending on 
your needs, consider creating <i><a class="externalLink" 
href="http://phoenix.apache.org/secondary_indexing.html#Covered_Indexes";>covered</a></i>
 indexes or <i><a class="externalLink" 
href="http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes";>functional</a></i>
 indexes, or both.</p> 
+ <p>If your table is large, use the <tt>ASYNC</tt> keyword with <tt>CREATE 
INDEX</tt> to create the index asynchronously. In this case, the index will be 
built through MapReduce, which means that the client going up or down wonât 
impact index creation and the job is retried automatically if necessary. 
Youâll need to manually start the job, which you can then monitor just as you 
would any other MapReduce job.</p> 
+ <p>Example: <tt>create index if not exists event_object_id_idx_b on 
trans.event (object_id) ASYNC UPDATE_CACHE_FREQUENCY=60000;</tt></p> 
+ <p>See <a class="externalLink" 
href="https://phoenix.apache.org/secondary_indexing.html#Index_Population";>Index
 Population</a> for details.</p> 
+ <p>If you canât create the index asynchronously for some reason, then 
increase the query timeout (<tt>phoenix.query.timeoutMs</tt>) to be larger than 
the time itâll take to build the index. If the <tt>CREATE INDEX</tt> call 
times out or the client goes down before itâs finished, then the index build 
will stop and must be run again. You can monitor the index table as it is 
createdâyouâll see new regions created as splits occur. You can query the 
<tt>SYSTEM.STATS</tt> table, which gets populated as splits and compactions 
happen. You can also run a <tt>count(*)</tt> query directly against the index 
table, though that puts more load on your system because requires a full table 
scan.</p> 
+ <p>Tips:</p> 
+ <ul> 
+  <li>Create <a class="externalLink" 
href="https://phoenix.apache.org/secondary_indexing.html#Local_Indexes";>local</a>
 indexes for write-heavy use cases.</li> 
+  <li>Create global indexes for read-heavy use cases. To save read-time 
overhead, consider creating <a class="externalLink" 
href="https://phoenix.apache.org/secondary_indexing.html#Covered_Indexes";>covered</a>
 indexes.</li> 
+  <li>If the primary key is monotonically increasing, create salt buckets. The 
salt buckets canât be changed later, so design them to handle future growth. 
Salt buckets help avoid write hotspots, but can decrease overall throughput due 
to the additional scans needed on read.</li> 
+  <li>Set up a cron job to build indexes. Use <tt>ASYNC</tt> with <tt>CREATE 
INDEX</tt> to avoid blocking.</li> 
+  <li>Only create the indexes you need.</li> 
+  <li>Limit the number of indexes on frequently updated tables.</li> 
+  <li>Use covered indexes to convert table scans into efficient point lookups 
or range queries over the index table instead of the primary table: <tt>CREATE 
INDEX index ON table</tt>( â¦ )<tt>INCLUDE</tt>( â¦ )</li> 
+ </ul> 
+ <h1>Queries</h1> 
+ <p>Itâs important to know which queries are executed on the server side 
versus the client side, because this can impact performace due to network I/O 
and other bottlenecks. If youâre querying a billion-row table, you want to do 
as much computation as possible on the server side rather than transmitting a 
billion rows to the client for processing. Some queries, on the other hand, 
must be executed on the client. Sorting data that lives on multiple region 
servers, for example, requires that you aggregate and re-sort on the client. 
</p> 
+</div> 
+<div class="section"> 
+ <h2 id="Reading">Reading</h2> 
+ <ul> 
+  <li>Avoid joins unless one side is small, especially on frequent queries. 
For larger joins, see âHints,â below.</li> 
+  <li>In the <tt>WHERE</tt> clause, filter leading columns in the primary key 
constraint.</li> 
+  <li>Filtering the first leading column with <tt>IN</tt> or <tt>OR</tt> in 
the <tt>WHERE</tt> clause enables skip scan optimizations.</li> 
+  <li>Equality or comparisions (<tt>&lt;</tt> or <tt>&gt;</tt>) in the 
<tt>WHERE</tt> clause enables range scan optimizations.</li> 
+  <li>Let Phoenix optimize query parallelism using statistics. This provides 
an automatic benefit if using Phoenix 4.2 or greater in production.</li> 
+ </ul> 
+ <p>See also: <a class="externalLink" 
href="https://phoenix.apache.org/joins.html";>https://phoenix.apache.org/joins.html</a></p>
 
+ <div class="section"> 
+  <h3 id="Range_Queries">Range Queries</h3> 
+  <p>If you regularly scan large data sets from spinning disk, youâre best 
off with GZIP (but watch write speed). Use a lot of cores for a scan to utilize 
the available memory bandwidth. Apache Phoenix makes it easy to utilize many 
cores to increase scan performance.</p> 
+  <p>For range queries, the HBase block cache does not provide much 
advantage.</p> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Large_Range_Queries">Large Range Queries</h3> 
+  <p>For large range queries, consider setting 
<tt>Scan.setCacheBlocks(false)</tt> even if the whole scan could fit into the 
block cache.</p> 
+  <p>If you mostly perform large range queries you might even want to consider 
running HBase with a much smaller heap and size the block cache down, to only 
rely on the OS Cache. This will alleviate some garbage collection related 
issues.</p> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Point_Lookups">Point Lookups</h3> 
+  <p>For point lookups it is quite important to have your data set cached, and 
you should use the HBase block cache. </p> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Hints">Hints</h3> 
+  <p>Hints let you override default query processing behavior and specify such 
factors as which index to use, what type of scan to perform, and what type of 
join to use. </p> 
+  <ul> 
+   <li>During the query, Hint global index if you want to force it when query 
includes a column not in the index.</li> 
+   <li>If necessary, you can do bigger joins with the <tt>/*+ 
USE_SORT_MERGE_JOIN */</tt> hint, but a big join will be an expensive operation 
over huge numbers of rows.</li> 
+   <li>If the overall size of all right-hand-side tables would exceed the 
memory size limit, use the <tt>/*+ NO_STAR_JOIN */</tt>hint.</li> 
+  </ul> 
+  <p>See also: <a class="externalLink" 
href="https://phoenix.apache.org/language/#hint";>Hint</a>.</p> 
+ </div> 
+</div> 
+<div class="section"> 
+ <h2 id="Writing">Writing</h2> 
+ <div class="section"> 
+  <h3 id="Batching_large_numbers_of_records">Batching large numbers of 
records</h3> 
+  <p>When using <tt>UPSERT</tt> to write a large number of records, turn off 
autocommit and batch records. <b>Note:</b> Phoenix uses <tt>commit()</tt> 
instead of <tt>executeBatch()</tt> to control batch updates.</p> 
+  <p>Start with a batch size of 1000 and adjust as needed. Hereâs some 
pseudocode showing one way to commit records in batches:</p> 
+  <div class="source"> 
+   <pre>try (Connection conn = DriverManager.getConnection(url)) {
+  conn.setAutoCommit(false);
+  int batchSize = 0;
+  int commitSize = 1000; // number of rows you want to commit per batch.  
+  try (Statement stmt = conn.prepareStatement(upsert)) {
+    stmt.set ... while (there are records to upsert) {
+      stmt.executeUpdate(); 
+      batchSize++; 
+      if (batchSize % commitSize == 0) { 
+        conn.commit(); 
+      } 
+   } 
+ conn.commit(); // commit the last batch of records 
+ }
+</pre> 
+  </div> 
+  <p><b>Note:</b> Because the Phoenix client keeps uncommitted rows in memory, 
be careful not to set <tt>commitSize</tt> too high.</p> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Reducing_RPC_traffic">Reducing RPC traffic</h3> 
+  <p>To reduce RPC traffic, set the <tt>UPDATE_CACHE_FREQUENCY</tt> (4.7 or 
above) on your table and indexes when you create them (or issue an <tt>ALTER 
TABLE</tt>/<tt>INDEX</tt> call. See <a class="externalLink" 
href="https://phoenix.apache.org/#Altering";>https://phoenix.apache.org/#Altering</a>.</p>
 
+ </div> 
+ <div class="section"> 
+  <h3 id="Using_local_indexes">Using local indexes</h3> 
+  <p>If using 4.8, consider using local indexes to minimize the write time. In 
this case, the writes for the secondary index will be to the same region server 
as your base table. This approach does involve a performance hit on the read 
side, though, so make sure to quantify both write speed improvement and read 
speed reduction.</p> 
+ </div> 
+</div> 
+<div class="section"> 
+ <h2 id="Deleting">Deleting</h2> 
+ <p>When deleting a large data set, turn on autoCommit before issuing the 
<tt>DELETE</tt> query so that the client does not need to remember the row keys 
of all the keys as they are deleted. This prevents the client from buffering 
the rows affected by the <tt>DELETE</tt> so that Phoenix can delete them 
directly on the region servers without the expense of returning them to the 
client.</p> 
+ <h1>Explain Plans</h1> 
+ <p>An <tt>EXPLAIN</tt> plan tells you a lot about how a query will be 
run:</p> 
+ <ul> 
+  <li>All the HBase range queries that will be executed</li> 
+  <li>The number of bytes that will be scanned</li> 
+  <li>The number of rows that will be traversed</li> 
+  <li>Which HBase table will be used for each scan</li> 
+  <li>Which operations (sort, merge, scan, limit) are executed on the client 
versus the server</li> 
+ </ul> 
+ <p>Use an <tt>EXPLAIN</tt> plan to check how a query will run, and consider 
rewriting queries to meet the following goals:</p> 
+ <ul> 
+  <li>Emphasize operations on the server rather than the client. Server 
operations are distributed across the cluster and operate in parallel, while 
client operations execute within the single client JDBC driver.</li> 
+  <li>Use <tt>RANGE SCAN</tt> or <tt>SKIP SCAN</tt> whenever possible rather 
than <tt>TABLE SCAN</tt>.</li> 
+  <li>Filter against leading columns in the primary key constraint. This 
assumes you have designed the primary key to lead with frequently-accessed or 
frequently-filtered columns as described in âPrimary Keys,â above.</li> 
+  <li>If necessary, introduce a local index or a global index that covers your 
query.</li> 
+  <li>If you have an index that covers your query but the optimizer is not 
detecting it, try hinting the query: <tt>SELECT /*+ INDEX() */ â¦</tt></li> 
+ </ul> 
+ <div class="section"> 
+  <h3 id="Anatomy_of_an_Explain_Plan">Anatomy of an Explain Plan</h3> 
+  <p>An explain plan consists of lines of text that describe operations that 
Phoenix will perform during a query, using the following terms:</p> 
+  <ul> 
+   <li><tt>AGGREGATE INTO ORDERED DISTINCT ROWS</tt>âaggregates the returned 
rows using an operation such as addition. When <tt>ORDERED</tt> is used, the 
<tt>GROUP BY</tt> operation is applied to the leading part of the primary key 
constraint, which allows the aggregation to be done in place rather than 
keeping all distinct groups in memory on the server side.</li> 
+   <li><tt>AGGREGATE INTO SINGLE ROW</tt>âaggregates the results into a 
single row using an aggregate function with no <tt>GROUP BY</tt> clause. For 
example, the <tt>count()</tt> statement returns one row with the total number 
of rows that match the query.</li> 
+   <li><tt>CLIENT</tt>âthe operation will be performed on the client side. 
Itâs faster to perform most operations on the server side, so you should 
consider whether thereâs a way to rewrite the query to give the server more 
of the work to do.</li> 
+   <li><tt>FILTER BY</tt> expressionâreturns only results that match the 
expression.</li> 
+   <li><tt>FULL SCAN OVER</tt> tableNameâthe operation will scan every row 
in the specified table.</li> 
+   <li><tt>INNER-JOIN</tt>âthe operation will join multiple tables on rows 
where the join condition is met.</li> 
+   <li><tt>MERGE SORT</tt>âperforms a merge sort on the results.</li> 
+   <li><tt>RANGE SCAN OVER</tt> tableName <tt>[</tt> â¦ <tt>]</tt>âThe 
information in the square brackets indicates the start and stop for each 
primary key thatâs used in the query.</li> 
+   <li><tt>ROUND ROBIN</tt>âwhen the query doesnât contain <tt>ORDER 
BY</tt> and therefore the rows can be returned in any order, <tt>ROUND 
ROBIN</tt> order maximizes parallelization on the client side.</li> 
+   <li>x<tt>-CHUNK</tt>âdescribes how many threads will be used for the 
operation. The maximum parallelism is limited to the number of threads in 
thread pool. The minimum parallelization corresponds to the number of regions 
the table has between the start and stop rows of the scan. The number of chunks 
will increase with a lower guidepost width, as there is more than one chunk per 
region.</li> 
+   <li><tt>PARALLEL</tt>x-<tt>WAY</tt>âdescribes how many parallel scans 
will be merge sorted during the operation.</li> 
+   <li><tt>SERIAL</tt>âsome queries run serially. For example, a single row 
lookup or a query that filters on the leading part of the primary key and 
limits the results below a configurable threshold.</li> 
+  </ul> 
+ </div> 
+ <div class="section"> 
+  <h3 id="Example">Example</h3> 
+  <div class="source"> 
+   <pre>+------------------------------------------+
+| PLAN |
++------------------------------------------+
+| CLIENT 36-CHUNK PARALLEL 36-WAY FULL SCAN OVER exDocStoreb |
+|   PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) |
+|     CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER indx_exdocb [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf'] - [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bg' |
+|       SERVER FILTER BY FIRST KEY ONLY |
+|       SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY [&quot;ID&quot;] |
+|     CLIENT MERGE SORT |
+|   DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, [A.ID](http://a.id/)) IN 
((TMP.MCT, TMP.TID)) |
++-------------------------------------------+
+</pre> 
+  </div> 
+  <p>See also: <a class="externalLink" 
href="http://phoenix.apache.org/language/index.html#explain";>http://phoenix.apache.org/language/index.html#explain</a></p>
 
+  <h1>Improving parallelization</h1> 
+  <p>You can improve parallelization with the <a class="externalLink" 
href="https://phoenix.apache.org/update_statistics.html";>UPDATE STATISTICS</a> 
command. This command subdivides each region by determining keys called 
<i>guideposts</i> that are equidistant from each other, then uses these 
guideposts to break up queries into multiple parallel scans. Statistics are 
turned on by default. With Phoenix 4.9, the user can set guidepost width for 
each table. Optimal guidepost width depends on a number of factors such as 
cluster size, cluster usage, number of cores per node, table size, and disk 
I/O.</p> 
+  <h1>Further Tuning</h1> 
+  <p>For advice about tuning the underlying HBase and JVM layers, see <a 
class="externalLink" 
href="https://hbase.apache.org/book.html#schema.ops";>Operational and 
Performance Configuration Options</a> in the Apache HBaseâ¢ Reference 
Guide.</p> 
+ </div> 
+</div> 
+<div class="section"> 
+ <h2 id="Special_Cases">Special Cases</h2> 
+ <p>The following sections provide Phoenix-specific additions to the tuning 
recommendations in the Apache HBaseâ¢ Reference Guide section referenced 
above. </p> 
+ <div class="section"> 
+  <h3 id="For_applications_where_failing_quickly_is_better_than_waiting">For 
applications where failing quickly is better than waiting</h3> 
+  <p>In addition to the HBase tuning referenced above, set 
<tt>phoenix.query.timeoutMs</tt> in <tt>hbase-site.xml</tt> on the client side 
to the maximum tolerable wait time in milliseconds.</p> 
+ </div> 
+ <div class="section"> 
+  <h3 
id="For_applications_that_can_tolerate_slightly_out_of_date_information">For 
applications that can tolerate slightly out of date information</h3> 
+  <p>In addition to the HBase tuning referenced above, set 
<tt>phoenix.connection.consistency = timeline</tt> in <tt>hbase-site.xml</tt> 
on the client side for all connections.</p> 
+ </div> 
+</div>
+                       </div>
+               </div>
+       </div>
+       </div>
+
+       </div><!-- /container -->
+       
+       <!-- Footer
+       ================================================== -->
+       <footer class="well">
+               <div class="container">
+                       <div class="row">
+                               <div class="span2 bottom-nav">
+                                       <ul class="nav nav-list">
+                                               <li 
class="nav-header">About</li>
+                                               <li >
+                                                       <a href="index.html" 
title="Overview">Overview</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="who_is_using.html" title="Who is Using">Who is Using</a>
+                                               </li>
+                                               <li >
+                                                       <a href="recent.html" 
title="New Features">New Features</a>
+                                               </li>
+                                               <li >
+                                                       <a href="roadmap.html" 
title="Roadmap">Roadmap</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="performance.html" title="Performance">Performance</a>
+                                               </li>
+                                               <li >
+                                                       <a href="team.html" 
title="Team">Team</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="resources.html" title="Presentations">Presentations</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="mailing_list.html" title="Mailing Lists">Mailing Lists</a>
+                                               </li>
+                                               <li >
+                                                       <a href="source.html" 
title="Source Repository">Source Repository</a>
+                                               </li>
+                                               <li >
+                                                       <a href="issues.html" 
title="Issue Tracking">Issue Tracking</a>
+                                               </li>
+                                               <li >
+                                                       <a href="download.html" 
title="Download">Download</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="installation.html" title="Installation">Installation</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="contributing.html" title="How to Contribute">How to Contribute</a>
+                                               </li>
+                                               <li >
+                                                       <a href="develop.html" 
title="How to Develop">How to Develop</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="building_website.html" title="How to Update Website">How to Update 
Website</a>
+                                               </li>
+                                               <li >
+                                                       <a href="release.html" 
title="How to Release">How to Release</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="http://www.apache.org/licenses/"; title="License" 
class="externalLink">License</a>
+                                               </li>
+                                       </ul>
+                               </div>
+                               <div class="span2 bottom-nav">
+                                       <ul class="nav nav-list">
+                                               <li 
class="nav-header">Using</li>
+                                               <li >
+                                                       <a href="faq.html" 
title="F.A.Q.">F.A.Q.</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="Phoenix-in-15-minutes-or-less.html" title="Quick Start">Quick Start</a>
+                                               </li>
+                                               <li >
+                                                       <a href="building.html" 
title="Building">Building</a>
+                                               </li>
+                                               <li class="active">
+                                                       <a href="#" 
title="Tuning">Tuning</a>
+                                               </li>
+                                               <li >
+                                                       <a href="tuning.html" 
title="Configuration">Configuration</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="upgrading.html" title="Backward Compatibility">Backward Compatibility</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="release_notes.html" title="Release Notes">Release Notes</a>
+                                               </li>
+                                               <li >
+                                                       <a href="pherf.html" 
title="Performance Testing">Performance Testing</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="phoenix_spark.html" title="Apache Spark Integration">Apache Spark 
Integration</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="hive_storage_handler.html" title="Phoenix Storage Handler for Apache 
Hive">Phoenix Storage Handler for Apache Hive</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="pig_integration.html" title="Apache Pig Integration">Apache Pig 
Integration</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="phoenix_mr.html" title="Map Reduce Integration">Map Reduce Integration</a>
+                                               </li>
+                                               <li >
+                                                       <a href="flume.html" 
title="Apache Flume Plugin">Apache Flume Plugin</a>
+                                               </li>
+                                               <li >
+                                                       <a href="kafka.html" 
title="Apache Kafka Plugin">Apache Kafka Plugin</a>
+                                               </li>
+                                       </ul>
+                               </div>
+                               <div class="span2 bottom-nav">
+                                       <ul class="nav nav-list">
+                                               <li 
class="nav-header">Features</li>
+                                               <li >
+                                                       <a 
href="transactions.html" title="Transactions">Transactions</a>
+                                               </li>
+                                               <li >
+                                                       <a href="udf.html" 
title="User-defined Functions">User-defined Functions</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="secondary_indexing.html" title="Secondary Indexes">Secondary Indexes</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="columnencoding.html" title="Storage Formats">Storage Formats</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="atomic_upsert.html" title="Atomic Upsert">Atomic Upsert</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="namspace_mapping.html" title="Namespace Mapping">Namespace Mapping</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="update_statistics.html" title="Statistics Collection">Statistics 
Collection</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="rowtimestamp.html" title="Row Timestamp Column">Row Timestamp Column</a>
+                                               </li>
+                                               <li >
+                                                       <a href="paged.html" 
title="Paged Queries">Paged Queries</a>
+                                               </li>
+                                               <li >
+                                                       <a href="salted.html" 
title="Salted Tables">Salted Tables</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="skip_scan.html" title="Skip Scan">Skip Scan</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a href="views.html" 
title="Views">Views</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="multi-tenancy.html" title="Multi tenancy">Multi tenancy</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="dynamic_columns.html" title="Dynamic Columns">Dynamic Columns</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="bulk_dataload.html" title="Bulk Loading">Bulk Loading</a>
+                                               </li>
+                                               <li >
+                                                       <a href="server.html" 
title="Query Server">Query Server</a>
+                                               </li>
+                                               <li >
+                                                       <a href="tracing.html" 
title="Tracing">Tracing</a>
+                                               </li>
+                                               <li >
+                                                       <a href="metrics.html" 
title="Metrics">Metrics</a>
+                                               </li>
+                                       </ul>
+                               </div>
+                               <div class="span3 bottom-nav">
+                                       <ul class="nav nav-list">
+                                               <li 
class="nav-header">Reference</li>
+                                               <li >
+                                                       <a 
href="language/index.html" title="Grammar">Grammar</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="language/functions.html" title="Functions">Functions</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="language/datatypes.html" title="Datatypes">Datatypes</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="array_type.html" title="ARRAY type">ARRAY type</a>
+                                               </li>
+                                               <li >
+                                                       <a href="http:divider" 
title=""></a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="sequences.html" title="Sequences">Sequences</a>
+                                               </li>
+                                               <li >
+                                                       <a href="joins.html" 
title="Joins">Joins</a>
+                                               </li>
+                                               <li >
+                                                       <a 
href="subqueries.html" title="Subqueries">Subqueries</a>
+                                               </li>
+                                       </ul>
+                               </div>
+                               <div class="span3 bottom-description">
+                                       <form 
action="http://search-hadoop.com/?"; method="get"><input value="Phoenix" 
name="fc_project" type="hidden"><input placeholder="Search Phoenix&hellip;" 
required="required" style="width:170px;" size="18" name="q" id="query" 
type="search"></form>
+                               </div>
+                       </div>
+               </div>
+       </footer>
+               
+       <div class="container subfooter">
+               <div class="row">
+                       <div class="span12">
+                               <p class="pull-right"><a href="#">Back to 
top</a></p>
+                               <p class="copyright">Copyright &copy;2017 <a 
href="http://www.apache.org";>Apache Software Foundation</a>. All Rights 
Reserved.</p>
+                       </div>
+               </div>
+       </div>
+
+       <!-- Le javascript
+       ================================================== -->
+       <!-- Placed at the end of the document so the pages load faster -->
+       <script 
src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
+       
+       <script 
src="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.2/js/bootstrap.min.js"></script>
+       <script src="./js/lightbox.js"></script>
+       <script src="./js/jquery.smooth-scroll.min.js"></script>
+       <!-- back button support for smooth scroll -->
+       <script src="./js/jquery.ba-bbq.min.js"></script>
+       <script src="//yandex.st/highlightjs/7.5/highlight.min.js"></script>
+
+       <script src="./js/reflow-skin.js"></script>
+       
+       </body>
+</html>

Added: phoenix/site/source/src/site/markdown/tuning_guide.md
URL: 
http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/tuning_guide.md?rev=1788628&view=auto
==============================================================================
--- phoenix/site/source/src/site/markdown/tuning_guide.md (added)
+++ phoenix/site/source/src/site/markdown/tuning_guide.md Sat Mar 25 01:44:50 
2017
@@ -0,0 +1,292 @@
+# Tuning Guide
+
+Tuning Phoenix can be complex, but with a little knowledge of how it works you 
can make significant changes to the performance of your reads and writes. The 
most important factor in performance is the design of your schema, especially 
as it affects the underlying HBase row keys. Look in âGeneral Tipsâ below 
to find design advice for different anticipated data access patterns. 
Subsequent sections describe how to use secondary indexes, hints, and explain 
plans.
+
+# Primary Keys
+
+The underlying row key design is the single most important factor in Phoenix 
performance, and it's important to get it right at design time because you 
cannot change it later without re-writing the data and index tables.
+
+ The Phoenix primary keys are concatenated to create the underlying row key in 
Apache HBase. The columns for the primary key constraint should be chosen and 
ordered in a way that aligns with the common query patternsâchoose the most 
frequently queried columns as primary keys. The key that you place in the 
leading position is the most performant one. For example, if you lead off with 
a column containing org ID values, it is easy to select all rows pertaining to 
a specific org. You can add the HBase row timestamp to the primary key to 
improve scan efficiency by skipping rows outside the queried time range. 
+
+Every primary key imposes a cost because the entire row key is appended to 
every piece of data in memory and on disk. The larger the row key, the greater 
the storage overhead. Find ways to store information compactly in columns you 
plan to use for primary keysâstore deltas instead of complete time stamps, 
for example.
+
+To sum up, the best practice is to design primary keys to add up to a row key 
that lets you scan the smallest amount of data.
+
+**Tip: **When choosing primary keys, lead with the column you filter most 
frequently across the queries that are most important to optimize. If you will 
use `ORDER BY` in your query, make sure your PK columns match the expressions 
in your `ORDER BY` clause. 
+
+### Monotonically increasing Primary keys
+
+If your primary keys are monotonically increasing, use salting to help 
distribute writes across the cluster and improve parallelization. Example:
+
+`CREATE TABLE â¦ ( â¦ ) SALT_BUCKETS = N`
+
+For optimal performance the number of salt buckets should approximately equal 
the number of region servers. Do not salt automatically. Use salting only when 
experiencing hotspotting. The downside of salting is that it imposes a cost on 
read because when you want to query the data you have to run multiple queries 
to do a range scan.
+
+
+# General Tips
+
+The following sections provide a few general tips for different access 
scenarios.
+
+### Is the Data Random-Access?
+
+* As with any random read workloads, SSDs can improve performance because of 
their faster random seek time.
+
+### Is the data read-heavy or write-heavy?
+
+* For read-heavy data:
+    * Create global indexes. This will affect write speed depending on the 
number of columns included in an index because each index writes to its own 
separate table.
+    * Use multiple indexes to provide fast access to common queries. 
+    * When specifying machines for HBase, do not skimp on cores; HBase needs 
them.
+* For write-heavy data:
+    * Pre-split the table. It can be helpful to split the table into 
pre-defined regions, or if the keys are monotonically increasing use salting to 
to avoid creating write hotspots on a small number of nodes. Use real data 
types rather than raw byte data.
+    * Create local indexes. Reads from local indexes have a performance 
penalty, so it's important to do performance testing. See the 
[Pherf](https://phoenix.apache.org/pherf.html) tool.
+
+
+
+
+
+### Which columns will be accessed often?
+
+* Choose commonly-queried columns as primary keys. For more information, see 
âPrimary Keysâ below.
+    * Create additional indexes to support common query patterns, including 
heavily accessed fields that are not in the primary key. 
+
+### Can the data be append-only (immutable)?
+
+* If the data is immutable or append-only, declare the table and its indexes 
as immutable using the `IMMUTABLE_ROWS` 
[option](http://phoenix.apache.org/language/index.html#options) at creation 
time to reduce the write-time cost. If you need to make an existing table 
immutable, you can do so with `ALTER TABLE trans.event SET IMMUTABLE_ROWS=true` 
after creation time.
+    * If speed is more important than data integrity, you can use the 
`DISABLE_WAL` [option](http://phoenix.apache.org/language/index.html#options). 
Note: it is possible to lose data with `DISABLE_WAL` if a region server fails. 
+* Set the `UPDATE_CACHE_FREQUENCY` 
[option](http://phoenix.apache.org/language/index.html#options) to 15 minutes 
or so if your metadata doesn't change very often. This property determines how 
often an RPC is done to ensure you're seeing the latest schema.
+* If the data is not sparse (over 50% of the cells have values), use the 
SINGLE_CELL_ARRAY_WITH_OFFSETS data encoding scheme introduced in Phoenix 4.10, 
which obtains faster performance by reducing the size of the data. For more 
information, see â[Column Mapping and Immutable Data 
Encoding](https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data)â
 on the Apache Phoenix blog.
+
+### Is the table very large?
+
+* Use the `ASYNC` keyword with your `CREATE INDEX` call to create the index 
asynchronously via MapReduce job.  You'll need to manually start the job; see 
https://phoenix.apache.org/secondary_indexing.html#Index_Population for 
details. 
+* If the data is too large to scan the table completely, use primary keys to 
create an underlying composite row key that makes it easy to return a subset of 
the data or facilitates 
[skip-scanning](https://phoenix.apache.org/skip_scan.html)âPhoenix can jump 
directly to matching keys when the query includes key sets in the predicate.
+
+### Is transactionality required?
+
+A transaction is a data operation that is atomicâthat is, guaranteed to 
succeed completely or not at all. For example, if you need to make cross-row 
updates to a data table, then you should consider your data transactional.
+
+* If you need transactionality, use the `TRANSACTIONAL` 
[option](http://phoenix.apache.org/language/index.html#options). (See also 
http://phoenix.apache.org/transactions.html.)
+
+### Block Encoding
+
+Using compression or encoding is a must. Both SNAPPY and FAST_DIFF are good 
all around options.
+
+`FAST_DIFF` encoding is automatically enabled on all Phoenix tables by 
default, and almost always improves overall read latencies and throughput by 
allowing more data to fit into blockcache. Note: `FAST_DIFF` encoding can 
increase garbage produced during request processing.
+
+Set encoding at table creation time. Example:
+` CREATE TABLE â¦ ( â¦ ) DATA_BLOCK_ENCODING=âFAST_DIFFâ`
+
+
+# Schema Design
+
+Because the schema affects the way the data is written to the underlying HBase 
layer, Phoenix performance relies on the design of your tables, indexes, and 
primary keys. 
+
+## Phoenix and the HBase data model
+
+HBase stores data in tables, which in turn contain columns grouped in column 
families. A row in an HBase table consists of versioned cells associated with 
one or more columns. An HBase row is a collection of many key-value pairs in 
which the rowkey attribute of the keys are equal. Data in an HBase table is 
sorted by the rowkey, and all access is via the rowkey.
+Phoenix creates a relational data model on top of HBase, enforcing a PRIMARY 
KEY constraint whose columns are concatenated to form the row key for the 
underlying HBase table. For this reason, it's important to be cognizant of the 
size and number of the columns you include in the PK constraint, because a copy 
of the row key is included with every cell in the underlying HBase table.
+
+
+
+## Column Families
+
+If some columns are accessed more frequently than others, [create multiple 
column 
families](https://phoenix.apache.org/faq.html#Are_there_any_tips_for_optimizing_Phoenix)
 to separate the frequently-accessed columns from rarely-accessed columns. This 
improves performance because HBase reads only the column families specified in 
the query.
+
+
+
+## Columns
+
+Here are a few tips that apply to columns in general, whether they are indexed 
or not:
+
+*  Keep `VARCHAR` columns under 1MB or so due to I/O costs. When processing 
queries, HBase materializes cells in full before sending them over to the 
client, and the client receives them in full before handing them off to the 
application code.
+* For structured objects, don't use JSON, which is not very compact. Use a 
format such as protobuf, Avro, msgpack, or BSON.
+* Consider compressing data before storage using a fast LZ variant to cut 
latency and I/O costs.
+* Use the column mapping feature (added in Phoenix 4.10), which uses numerical 
HBase column qualifiers for non-PK columns instead of directly using column 
names. This improves performance when looking for a cell in the sorted list of 
cells returned by HBase, adds further across-the-board performance by reducing 
the disk size used by tables, and speeds up DDL operations like column rename 
and metadata-level column drops. For more information, see â[Column Mapping 
and Immutable Data 
Encoding](https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data)â
 on the Apache Phoenix blog.
+
+# Indexes
+
+A Phoenix index  is a physical table that stores a pivoted copy of some or all 
of the data in the main table, to serve specific kinds of queries. When you 
issue a query, Phoenix selects the best index for the query automatically. The 
primary index is created automatically based on the primary keys you select. 
You can create secondary indexes, specifying which columns are included based 
on the anticipated queries the index will support.
+
+See also: 
+[Secondary Indexing](https://phoenix.apache.org/secondary_indexing.html)
+
+## Secondary indexes
+
+Secondary indexes can improve read performance by turning what would normally 
be a full table scan into a point lookup (at the cost of storage space and 
write speed). Secondary indexes can be added or removed after table creation 
and don't require changes to existing queries â queries simply run faster. A 
small number of secondary indexes is often sufficient. Depending on your needs, 
consider creating 
*[covered](http://phoenix.apache.org/secondary_indexing.html#Covered_Indexes)* 
indexes or 
*[functional](http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes)*
 indexes, or both.
+
+If your table is large, use the `ASYNC` keyword with `CREATE INDEX` to create 
the index asynchronously. In this case, the index will be built through 
MapReduce, which means that the client going up or down won't impact index 
creation and the job is retried automatically if necessary. You'll need to 
manually start the job, which you can then monitor just as you would any other 
MapReduce job.
+
+Example:
+`create index if not exists event_object_id_idx_b on trans.event (object_id) 
ASYNC UPDATE_CACHE_FREQUENCY=60000;`
+
+See [Index 
Population](https://phoenix.apache.org/secondary_indexing.html#Index_Population)
 for details.
+
+If you can't create the index asynchronously for some reason, then  increase 
the query timeout (`phoenix.query.timeoutMs`) to be larger than the time it'll 
take to build the index. If the `CREATE INDEX` call times out or the client 
goes down before it's finished, then the index build will stop  and must be run 
again. You can monitor the index table as it is createdâyou'll see new 
regions created as splits occur. You can query the `SYSTEM.STATS` table, which 
gets populated as splits and compactions happen. You can also run a `count(*)` 
query directly against the index table, though that puts more load on your 
system because requires a full table scan.
+
+Tips:
+
+* Create 
[local](https://phoenix.apache.org/secondary_indexing.html#Local_Indexes) 
indexes for write-heavy use cases.
+* Create global indexes for read-heavy use cases. To save read-time overhead, 
consider creating 
[covered](https://phoenix.apache.org/secondary_indexing.html#Covered_Indexes) 
indexes.
+* If the primary key is monotonically increasing, create salt buckets. The 
salt buckets can't be changed later, so design them to handle future growth. 
Salt buckets help avoid write hotspots, but can decrease overall throughput due 
to the additional scans needed on read.
+* Set up a cron job to build indexes. Use `ASYNC` with `CREATE INDEX` to avoid 
blocking.
+* Only create the indexes you need.
+* Limit the number of indexes on frequently updated tables.
+* Use covered indexes to convert table scans into efficient point lookups or 
range queries over the index table instead of the primary table:
+    ` CREATE INDEX index ON table `( â¦ )` INCLUDE `( â¦ ) 
+
+# Queries
+
+It's important to know which queries are executed on the server side versus 
the client side, because this can impact performace due to network I/O and 
other bottlenecks. If you're querying a billion-row table, you want to do as 
much computation as possible on the server side rather than transmitting a 
billion rows to the client for processing. Some queries, on the other hand, 
must be executed on the client. Sorting data that lives on multiple region 
servers, for example, requires that you aggregate and re-sort on the client. 
+
+## Reading
+
+* Avoid joins unless one side is small, especially on frequent queries. For 
larger joins, see âHints,â below.
+* In the `WHERE` clause, filter leading columns in the primary key constraint.
+*  Filtering the first leading column with `IN` or `OR` in the `WHERE` clause 
enables skip scan optimizations.
+*  Equality or comparisions (`<` or `>`) in the `WHERE` clause enables range 
scan optimizations.
+*  Let Phoenix optimize query parallelism using statistics. This provides an 
automatic benefit if using Phoenix 4.2 or greater in production.
+
+See also: https://phoenix.apache.org/joins.html
+
+### Range Queries
+
+If you regularly scan large data sets from spinning disk, you're best off with 
GZIP (but watch write speed). Use a lot of cores for a scan to utilize the 
available memory bandwidth.  Apache Phoenix makes it easy to utilize many cores 
to increase scan performance.
+
+For range queries, the HBase block cache does not provide much advantage.
+
+### Large Range Queries
+
+For large range queries, consider  setting `Scan.setCacheBlocks(false)` even 
if the whole scan could fit into the block cache.
+
+If you mostly perform large range queries you might even want to consider 
running HBase with a much smaller heap and size the block cache down, to only 
rely on the OS Cache. This will alleviate some garbage collection related 
issues.
+
+### Point Lookups
+
+For point lookups it is quite important to have your data set cached, and you 
should use the HBase block cache. 
+
+###  Hints
+
+Hints let you override default query processing behavior and specify such 
factors as which index to use, what type of scan to perform, and what type of 
join to use. 
+
+* During the query, Hint global index if you want to force it when query 
includes a column not in the index.
+* If necessary, you can do bigger joins with the `/*+ USE_SORT_MERGE_JOIN */` 
hint, but a big join will be an expensive operation over huge numbers of rows.
+* If the overall size of all right-hand-side tables would exceed the memory 
size limit, use the `/*+ NO_STAR_JOIN */ `hint.
+
+See also: [Hint](https://phoenix.apache.org/language/#hint).
+
+## Writing
+
+### Batching large numbers of records
+
+When using `UPSERT` to write a large number of records, turn off autocommit 
and batch records. 
+**Note:** Phoenix uses `commit()` instead of `executeBatch()` to control batch 
updates.
+
+Start with a batch size of 1000 and adjust as needed. Here's some pseudocode 
showing one way to commit records in batches:
+
+```
+try (Connection conn = DriverManager.getConnection(url)) {
+  conn.setAutoCommit(false);
+  int batchSize = 0;
+  int commitSize = 1000; // number of rows you want to commit per batch.  
+  try (Statement stmt = conn.prepareStatement(upsert)) {
+    stmt.set ... while (there are records to upsert) {
+      stmt.executeUpdate(); 
+      batchSize++; 
+      if (batchSize % commitSize == 0) { 
+        conn.commit(); 
+      } 
+   } 
+ conn.commit(); // commit the last batch of records 
+ }
+```
+
+**Note:** Because the Phoenix client keeps uncommitted rows in memory, be 
careful not to set `commitSize` too high.
+
+### Reducing RPC traffic
+
+To reduce RPC traffic, set the `UPDATE_CACHE_FREQUENCY` (4.7 or above) on your 
table and indexes when you create them (or issue an `ALTER TABLE`/`INDEX` call. 
See https://phoenix.apache.org/#Altering.
+
+### Using local indexes
+
+If using 4.8, consider using local indexes to minimize the write time. In this 
case, the writes for the secondary index will be to the same region server as 
your base table. This approach does involve a performance hit on the read side, 
though, so make sure to quantify both write speed improvement and read speed 
reduction.
+
+## Deleting
+
+When deleting a large data set, turn on autoCommit before issuing the `DELETE` 
query so that the client does not need to remember the row keys of all the keys 
as they are deleted. This prevents the client from buffering the rows affected 
by the `DELETE` so that Phoenix can delete them directly on the region servers 
without the expense of returning them to the client.
+
+# Explain Plans
+
+An `EXPLAIN` plan tells you a lot about how a query will be run:
+
+* All the HBase range queries that will be executed
+* The number of bytes that will be scanned
+* The number of rows that will be traversed
+* Which HBase table will be used for each scan
+* Which operations (sort, merge, scan, limit) are executed on the client 
versus the server 
+
+Use an `EXPLAIN` plan to check how a query will run, and consider rewriting 
queries to meet the following goals:
+
+* Emphasize operations on the server rather than the client. Server operations 
are distributed across the cluster and operate in parallel, while client 
operations execute within the single client JDBC driver.
+* Use `RANGE SCAN` or `SKIP SCAN` whenever possible rather than `TABLE SCAN`.
+* Filter against leading columns in the primary key constraint.  This assumes 
you have designed the primary key to lead with frequently-accessed or 
frequently-filtered columns as described in âPrimary Keys,â above.
+* If necessary, introduce a local index or a global index that covers your 
query.
+* If you have an index that covers your query but the optimizer is not 
detecting it, try hinting the query:
+    `SELECT /*+ INDEX() */ â¦`
+
+### Anatomy of an Explain Plan
+
+An explain plan consists of lines of text that describe operations that 
Phoenix will perform during a query, using the following terms:
+
+* `AGGREGATE INTO ORDERED DISTINCT ROWS`âaggregates the returned rows using 
an operation such as addition. When `ORDERED` is used, the `GROUP BY` operation 
is applied to the leading part of the primary key constraint, which allows the 
aggregation to be done in place rather than keeping all distinct groups in 
memory on the server side.
+* `AGGREGATE INTO SINGLE ROW`âaggregates the results into a single row using 
an aggregate function with no `GROUP BY` clause. For example, the `count()` 
statement returns one row with the total number of rows that match the query.
+* `CLIENT`âthe operation will be performed on the client side. It's faster 
to perform most operations on the server side, so you should consider whether 
there's a way to rewrite the query to give the server more of the work to do. 
+* `FILTER BY` expressionâreturns only results that match the expression.
+* `FULL SCAN OVER` tableNameâthe operation will scan every row in the 
specified table. 
+* `INNER-JOIN`âthe operation will join multiple tables on rows where the 
join condition is met.
+* `MERGE SORT`âperforms a merge sort on the results.
+* `RANGE SCAN OVER` tableName `[` ... `]`âThe information in the square 
brackets indicates the start and stop for each primary key that's used in the 
query.
+* `ROUND ROBIN`âwhen the query doesn't contain `ORDER BY` and therefore the 
rows can be returned in any order, `ROUND ROBIN` order maximizes 
parallelization on the client side.
+* x`-CHUNK`âdescribes how many threads will be used for the operation. The 
maximum parallelism is limited to the number of threads in thread pool. The 
minimum parallelization corresponds to the number of regions the table has 
between the start and stop rows of the scan. The number of chunks will increase 
with a lower guidepost width, as there is more than one chunk per region.
+* `PARALLEL `x-`WAY`âdescribes how many parallel scans will be merge sorted 
during the operation.
+* `SERIAL`âsome queries run serially. For example, a single row lookup or a 
query that filters on the leading part of the primary key and limits the 
results below a configurable threshold.
+
+### Example
+
+```
++------------------------------------------+
+| PLAN |
++------------------------------------------+
+| CLIENT 36-CHUNK PARALLEL 36-WAY FULL SCAN OVER exDocStoreb |
+|   PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) |
+|     CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER indx_exdocb [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf'] - [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bg' |
+|       SERVER FILTER BY FIRST KEY ONLY |
+|       SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] |
+|     CLIENT MERGE SORT |
+|   DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, [A.ID](http://a.id/)) IN 
((TMP.MCT, TMP.TID)) |
++-------------------------------------------+
+```
+
+See also:
+http://phoenix.apache.org/language/index.html#explain
+
+# Improving parallelization
+
+You can improve parallelization with the [UPDATE 
STATISTICS](https://phoenix.apache.org/update_statistics.html) command. This 
command subdivides each region by determining keys called *guideposts* that are 
equidistant from each other, then uses these guideposts to break up queries 
into multiple parallel scans.
+Statistics are turned on by default. With Phoenix 4.9, the user can set 
guidepost width for each table. Optimal guidepost width depends on a number of 
factors such as cluster size, cluster usage, number of cores per node, table 
size, and disk I/O.
+
+# Further Tuning
+
+For advice about tuning the underlying HBase and JVM layers, see [Operational 
and Performance Configuration 
Options](https://hbase.apache.org/book.html#schema.ops) in the Apache HBaseâ¢ 
Reference Guide.
+
+## Special Cases
+
+The following sections provide Phoenix-specific additions to the tuning 
recommendations in the Apache HBaseâ¢ Reference Guide section referenced 
above. 
+
+### For applications where failing quickly is better than waiting
+
+In addition to the HBase tuning referenced above, set 
`phoenix.query.timeoutMs` in `hbase-site.xml` on the client side to the maximum 
tolerable wait time in milliseconds.
+
+### For applications that can tolerate slightly out of date information
+
+In addition to the HBase tuning referenced above, set 
`phoenix.connection.consistency = timeline` in `hbase-site.xml` on the client 
side for all connections.

svn commit: r1788628 - in /phoenix/site: publish/tuning_guide.html source/src/site/markdown/tuning_guide.md

Reply via email to