[drill] branch gh-pages updated: add index content to docs

bridgetb Mon, 17 Dec 2018 15:33:54 -0800

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git



The following commit(s) were added to refs/heads/gh-pages by this push:
     new 2ac8a08  add index content to docs
2ac8a08 is described below

commit 2ac8a08da63e1d671b7a720e9760697c231ad2e9
Author: Bridget Bevens <[email protected]>
AuthorDate: Mon Dec 17 15:32:47 2018 -0800

    add index content to docs
---
 _docs/img/query-plan-verify.png                    | Bin 0 -> 113092 bytes
 .../010-querying-indexes-introduction.md           |  15 +++
 ...ies-that-qualify-for-index-based-query-plans.md |  59 ++++++++++
 .../querying-indexes/021-types-of-indexes.md       | 124 +++++++++++++++++++++
 .../querying-indexes/030-index-selection.md        |  23 ++++
 .../040-designing-indexes-for-your-queries.md      |  97 ++++++++++++++++
 .../050-configuring-index-planning.md              |  37 ++++++
 .../querying-indexes/060-verifying-index-use.md    | 103 +++++++++++++++++
 8 files changed, 458 insertions(+)

diff --git a/_docs/img/query-plan-verify.png b/_docs/img/query-plan-verify.png
new file mode 100644
index 0000000..4d79109
Binary files /dev/null and b/_docs/img/query-plan-verify.png differ
diff --git 
a/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md 
b/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md
new file mode 100644
index 0000000..13f6d71
--- /dev/null
+++ b/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md
@@ -0,0 +1,15 @@
+---
+title: "Querying Indexes Introduction"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---  
+
+{% include startnote.html %}Currently, Drill only supports indexes for the 
MapR-DB storage plugin. Examples in this document were created using Drill on 
MapR Database. {% include endnote.html %}
+
+An index is a structure, defined on a table, that contains a subset of columns 
from the table sorted by key values. Well-designed indexes eliminate full table 
scans and optimize access to data to significantly improve performance. 
Starting in Drill 1.15, Drill can leverage indexes (primary or secondary) in 
data sources to create index-based query plans. 
+
+{% include startnote.html %}You cannot create, update, or delete indexes or 
tables through Drill. You must create and manage the indexes and tables in the 
data sources themselves. Also, the data source must be configured as a storage 
plugin in Drill.{% include endnote.html %}
+
+An index-based query plan is a query plan that uses indexes versus full tables 
scans to access data. When you submit a query through Drill, the query planner 
in Drill evaluates the query and compares the cost of multiple query plans to 
find an optimal plan with the lowest cost. Index planning is enabled by 
default; however, you can disable the feature or configure the behavior of 
index planning in Drill through several options. 
+
+Certain types of queries qualify for index-based query plans. You can create 
indexes and write queries such that the query planner in Drill leverages the 
indexes to create an index-based query plan for optimized performance. Drill 
can leverage indexes to create covering, non-covering, and functional index 
plans. You can verify that the query planner generated an index-based query 
plan in the query profile (Drill Web UI) or from the command line using the 
[EXPLAIN PLAN FOR]({{site.baseurl [...]
diff --git 
a/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
 
b/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
new file mode 100644
index 0000000..09f5420
--- /dev/null
+++ 
b/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
@@ -0,0 +1,59 @@
+---
+title: "Queries that Qualify for Index-Based Query Plans"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---  
+
+In most cases, the planner chooses an index-based query plan for queries with 
[WHERE clause]({{site.baseurl}}/docs/where-clause/) filters and sort-based 
operations, such as [ORDER BY]({{site.baseurl}}/docs/order-by-clause/), [GROUP 
BY]({{site.baseurl}}/docs/group-by-clause/), and 
[joins]({{site.baseurl}}/docs/from-clause/#join-types).
+
+The following types of queries and conditions typically qualify for 
index-based query plans:  
+
+
+- Equality conditions (includes IN)
+- Range conditions (includes LIKE pattern matching)
+- ORDER BY queries
+- GROUP BY queries
+- JOIN queries
+- Projection queries (including DISTINCT)
+
+Drill can use indexes for queries that GROUP BY or ORDER BY the leading 
columns in an index. When a query contains GROUP BY and ORDER BY operations on 
the leading indexed column, Drill uses the sort order of the index to create 
index-based query plans with streaming aggregates and merge joins to improve 
query performance. Drill does not use indexes for queries that GROUP BY or 
ORDER BY the trailing or included columns in an index. 
+If a query contains conditions on multiple columns (multi-index query), Drill 
can scan multiple indexes and use the index intersection to optimize the query. 
 
+
+{% include startnote.html %}Each column added to an index increases the 
storage requirement for the index. As the storage size increases, the cost of 
reading the index also increases. Likewise, for the cost of adding and updating 
data. Consider the impact on storage and updates when including columns in an 
index.{% include endnote.html %}  
+
+##Queries Without Filters  
+
+Drill can create index-based query plans for the following types of queries 
without filters (queries that do not have a WHERE clause):  
+   
+
+- ORDER BY queries, as shown in the following example where L_LINENUMBER is an 
indexed column in the index selected for the query plan:    
+  
+               SELECT L_LINENUMBER FROM lineitem ORDER BY L_LINENUMBER;  
+  
+- GROUP BY queries, as shown in the following example where L_COMMITDate is an 
indexed column in the index selected for the query plan:  
+  
+               SELECT L_COMMITDate FROM lineitem GROUP BY L_COMMITDate;  
+
+- JOIN queries, as shown in the following example where L_ORDERKEY and 
O_ORDERKEY are indexed columns and L_LINESTATUS is an included column in the 
index selected for the query plan:  
+
+               SELECT L.L_LINESTATUS FROM lineitem L, orders O WHERE  
L.L_ORDERKEY=O.O_ORDERKEY;  
+
+
+       **Note:** If the planner picks two indexes, one for lineitem and one 
for orders, a sort merge join is used instead of a hash join.  
+
+- Queries with DISTINCT projections, as shown in the following examples where 
L_LINENUMBER is an indexed column in the index selected for the query plan:  
+ 
+               SELECT DISTINCT L_LINENUMBER FROM lineitem;
+               SELECT COUNT(DISTINCT L_LINENUMBER) FROM lineitem;  
+
+
+
+
+
+
+
+
+  
+
+
+  
\ No newline at end of file
diff --git a/_docs/query-data/querying-indexes/021-types-of-indexes.md 
b/_docs/query-data/querying-indexes/021-types-of-indexes.md
new file mode 100644
index 0000000..0208f66
--- /dev/null
+++ b/_docs/query-data/querying-indexes/021-types-of-indexes.md
@@ -0,0 +1,124 @@
+---
+title: "Types of Indexes"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---  
+  
+The query planner in Drill can leverage indexes to create index-based query 
plans for improved query performance. An index is either covering or 
non-covering based on the columns referenced in the query and the columns in 
the index. Indexes for queries with ORDER BY may be hashed or non-hashed, 
affecting the sort order of the indexes. Drill also supports indexes on the 
CAST function.  
+
+The following sections describe these indexes in more detail and show examples 
of query plans generated for each.  
+
+{% include startnote.html %}Currently, Drill works with the types of indexes 
supported by MapR Database. The following sections demonstrate how Drill works 
with indexes in the MapR Database.{% include endnote.html %}  
+
+##Covering Index  
+A covering index is an index that “covers” all the columns referenced in a 
query. Only the index is needed to process the query. Covering indexes avoid 
the overhead of fetching data from the primary table. A covering index can 
include indexed and non-indexed (included) columns. 
+
+For example, suppose an index is created and queried, as follows:  
+
+       index keys:{a}, included columns: {b} and the query is SELECT a, b FROM 
T WHERE a > 10 AND b < 20.
+
+Since columns a, b are present in the index, this is a covering index. The 
Drill planner generates a covering index plan (index-only plan) where all the 
columns are retrieved from the index after pushing down relevant filter 
conditions to the index scan.  
+
+###Covering Index Example  
+
+This example uses an index, l_comp_1, created on a table named lineitem with 
indexed columns L_LINENUMBER and L_ORDERKEY and also includes columns 
L_LINESTATUS and L_QUANTITY.  
+
+The following query references the L_LINESTATUS, L_QUANTITY, L_LINENUMBER, and 
L_ORDERKEY columns in the lineitem table:  
+
+       SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE L_LINENUMBER = 1 
AND L_ORDERKEY BETWEEN 40 AND 75;  
+
+Because the l_comp_1 index includes all columns referenced in the query, the 
query planner in Drill creates a covering index plan.
+
+Running the EXPLAIN PLAN FOR command with the query shows that Drill created a 
query plan that only uses the index to process the query:  
+
+       EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE 
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;
+       
+       00-00    Screen
+       00-01      Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+       00-02        Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]], 
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec 
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem, 
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >= 
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})), 
indexName=l_comp_1], columns=[`L_LINESTATUS`, `L_QUANTITY`]]])  
+
+Reading the query plan, you can see that the plan includes an index scan, as 
indicated by `groupscan=[JsonTableGroupScan` and `indexName=l_comp_1`. Drill 
and the data source can process this query using only the index.  
+
+##Non-Covering Index
+A non-covering index is an index that does not “cover” all the columns 
referenced in a query.  A non-covering index has indexed and/or non-indexed 
(included) columns that only partially cover the columns referenced in a query. 
A non-covering query plan includes an index scan and a join back to the primary 
table. In some scenarios, a full table scan is more cost efficient than an 
index scan and Drill will not create an index-based query plan.
+
+For example, suppose an index is created and queried, as follows:  
+
+       index keys:{a, b}, included columns: {c} and the query is SELECT d, e 
FROM T WHERE a > 10 AND b < 20.
+
+Since columns d, e are not present in the index, this is a non-covering index. 
For such indexes, the Drill planner generates a non-covering index plan where 
only the row ids are fetched from the index by pushing down the WHERE clause 
filters and the rest of the columns are fetched after a join-back to the 
primary table. The join-back is performed using the row ids.  
+
+###Non-Covering Index Example  
+This example uses an index, l_comp_1, created on a table named lineitem with 
indexed columns L_LINENUMBER and L_ORDERKEY and also included columns 
L_LINESTATUS and L_QUANTITY.  
+
+The following query references the L_RETURNFLAG, L_LINESTATUS, L_QUANTITY 
L_LINENUMBER, and L_ORDERKEY columns in the lineitem table:  
+       
+       SELECT L_RETURNFLAG, L_LINESTATUS, L_QUANTITY FROM lineitem WHERE 
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;  
+
+Because the l_comp_1 index does not include the L_RETURNFLAG column, the query 
planner in Drill creates a non-covering index plan that uses the index, but 
also includes a join on the primary table.  
+
+Running the EXPLAIN PLAN FOR command with the query shows that Drill includes 
an index scan and a table scan with a rowkey join:  
+
+       EXPLAIN PLAN FOR SELECT L_RETURNFLAG, L_LINESTATUS, L_QUANTITY FROM 
lineitem WHERE L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;
+       
+       00-00    Screen
+       00-01      Project(L_RETURNFLAG=[$0], L_LINESTATUS=[$1], 
L_QUANTITY=[$2])
+       00-02        Project(L_RETURNFLAG=[$2], L_LINESTATUS=[$3], 
L_QUANTITY=[$4])
+       00-03          Project(L_LINENUMBER=[$0], L_ORDERKEY=[$1], 
L_RETURNFLAG=[$2], L_LINESTATUS=[$3], L_QUANTITY=[$4])
+       00-04            RowKeyJoin(condition=[=($5, $6)], joinType=[inner])
+       00-06              Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]], 
groupscan=[RestrictedJsonTableGroupScan [ScanSpec=JsonScanSpec 
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem, 
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >= 
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75}))], 
columns=[`L_LINENUMBER`, `L_ORDERKEY`, `L_RETURNFLAG`, `L_LINESTATUS`, 
`L_QUANTITY`, `_id`], rowcount=60012.15000000001]])
+       00-05              Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]], 
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec 
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem, 
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >= 
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})), 
indexName=l_comp_1], columns=[`_id`]]])
+
+
+Reading the non-covering index plan, you can see that the plan includes an 
index scan, as indicated by the `groupscan=[JsonTableGroupScan` and 
`indexName=l_comp_1`, and also a scan on the primary table, as indicated by the 
`groupscan=[RestrictedJsonTableGroupScan` and the `RowKeyJoin`. To process this 
query, Drill can use the index with the data source, but the data source must 
also use the rowkey join on the primary table to fetch data in the L_RETURNFLAG 
column. Note that this is not a [...]
+
+If this query ran on a regular basis, you could remove the l_comp_1 index and 
create a new index that includes all columns referenced in the query, including 
the L_RETURNFLAG column, to improve query performance. However, running a query 
only once or a few times may not justify the overhead of removing the old index 
and creating a new index.  
+
+##Non-Hashed Indexes 
+Non-hashed indexes support conditional queries with an ORDER BY clause. When 
processing ORDER BY queries, Drill does not have to perform sort operations on 
the data.  
+
+###Non-Hashed Index Plan Example  
+
+A non-hashed index, l_comp_1, was created on a table, lineitem with indexed 
columns L_LINENUMBER and L_ORDERKEY and included columns L_LINESTATUSand 
L_QUANTITY.  
+
+Running the example query with the EXPLAIN PLAN FOR command shows that Drill 
produces an index plan without the additional sort and merge operations when 
using the non-hashed index to process the query, as follows:  
+
+       EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE 
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75 ORDER BY L_LINENUMBER;
+       
+       00-00    Screen
+       00-01      Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+       00-02        Project(L_LINESTATUS=[$2], L_QUANTITY=[$3], 
L_LINENUMBER=[$0])
+       00-03          Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]], 
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec 
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem, 
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >= 
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})), 
indexName=l_comp_1], columns=[`L_LINENUMBER`, `L_ORDERKEY`, `L_LINESTATUS`, 
`L_QUANTITY`]]])  
+
+Reading the query plan, you can see that Drill uses the non-hashed index plan, 
as indicated by `indexName=l_comp_1`. To process the query, the data source 
uses the index and Drill does not have to perform sort and merge operations on 
the data, as indicated by the absence of the Sort and SingleMergeExchange 
operations in the query plan. The data source sorted the data in the index when 
the index was created.  
+
+##Hashed Indexes
+Hashed indexes support the same conditional queries as non-hashed indexes, but 
they do not have a guaranteed sort order. Hashed indexes enable the data source 
to evenly distribute new writes on an index across logical partitions to avoid 
hot spotting. Drill must perform a sort for ORDER BY queries that use hashed 
indexes. Sorting the data can increase the CPU costs and negatively impact 
performance.  
+
+If you notice performance issues with ORDER BY queries that use hashed 
indexes, review the query plans to see if the plans include sort and merge 
operations. If this is the case, create non-hashed indexes to support the 
queries and achieve the best performance.  
+
+###Hashed Index Plan Example
+A hashed index, l_hash_comp_1, was created with indexed columns L_LINENUMBER 
and L_ORDERKEY and included columns L_LINESTATUS and L_QUANTITY.  
+
+Running the example query with the EXPLAIN PLAN FOR command shows that Drill 
produces an index plan with sort and merge operations to process the query when 
using the hashed index, as follows:  
+
+       EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE 
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75 ORDER BY L_LINENUMBER;
+       
+       00-00    Screen
+       00-01      Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+       00-02        SingleMergeExchange(sort0=[2])
+       01-01          SelectionVectorRemover
+       01-02            Sort(sort0=[$2], dir0=[ASC])
+       01-03              Project(L_LINESTATUS=[$2], L_QUANTITY=[$3], 
L_LINENUMBER=[$0])
+       01-04                Scan(table=[[si, tpch_sf1_maprdb_hash, lineitem]], 
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec 
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/hash/lineitem, 
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >= 
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})), 
indexName=l_hash_comp_1], columns=[`L_LINENUMBER`, `L_ORDERKEY`, 
`L_LINESTATUS`, `L_QUANTITY`]]])  
+
+Reading the query plan, you can see that Drill uses the hashed index in the 
plan, as indicated by `indexName=l_hash_comp_1`. To process the query, the data 
source can use the index, but Drill must sort and merge the data, as indicated 
by the `Sort` and `SingleMergeExchange` operations in the query plan.
+Using the hashed index plan for this ORDER BY query requires additional 
processing and negatively impacts performance.  
+
+##Functional Index  
+A functional index is an index created on functions or expressions instead of 
columns in a table. CAST functions are most commonly used in Drill views. For 
example, if the filter condition is 'WHERE CAST(zip_code as BIGINT) = 95120' 
and a functional index exists on CAST(zip_code as BIGINT), the query planner 
will leverage the index.  
+
+When issuing Drill queries through BI tools, you can include CAST functions in 
your queries to create [Drill views]({{site.baseurl}}/docs/create-view/). 
Including CAST functions provides the metadata needed to optimally process the 
queries. For more information about using the CAST function with Drill, see 
[Data Type Conversion]({{site.baseurl}}/docs/data-type-conversion/).  
+
+{% include startnote.html %}MapR Database supports functional indexes. The 
query planner in Drill can use the functional index to optimize queries with 
CAST functions.{% include endnote.html %}  
+ 
\ No newline at end of file
diff --git a/_docs/query-data/querying-indexes/030-index-selection.md 
b/_docs/query-data/querying-indexes/030-index-selection.md
new file mode 100644
index 0000000..586a230
--- /dev/null
+++ b/_docs/query-data/querying-indexes/030-index-selection.md
@@ -0,0 +1,23 @@
+---
+title: "Index Selection"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---  
+
+Drill leverages indexes during the physical planning phase of the query. The 
query planner in Drill evaluates the cost of multiple query plans and then 
selects the plan with the lowest cost for query execution. The planner uses 
CPU, memory, and network I/O estimates to calculate cost. Based on these costs, 
the query planner evaluates selectivity, collation (sortedness of data), and 
the type of index. For each candidate index, the query planner estimates the 
total cost of accessing the index.  
+
+The query planner can create three types of index plans: covering, 
non-covering, and index intersection. By default, the planner evaluates the top 
five indexes per table, though this number is configurable. The indexes chosen 
may also include an index intersection. For example, two indexes may not 
qualify for an index plan based on their individual selectivity, but their 
combined selectivity after intersection could be low enough to qualify.  
+ 
+##Selectivity    
+Selectivity is the estimated number of rows based on the selectivity of each 
conditional expression in the WHERE clause, calculated as (output row 
count)/(total table row count). For example, if a table has 100 rows and 25 of 
the rows qualify the filter condition, the selectivity is .25. Selectivity 
ranges between 0 and 1. The closer to 0, the more selective the filter. The 
more selective a filter, the lower the cost. High filter selectivity results in 
better query performance. If filter [...]
+
+The data source provides Drill with the estimated number of rows that match a 
filter filter condition and the average row size. The data source uses a filter 
condition, such as WHERE a > 10 AND b < 20, to return the estimated row count 
of the filter condition based on the leading prefix of the index columns. For 
example, if the index columns are the composite key {a, b}, the leading prefix 
of the filter condition is {a, b}, and the row count of the conjunct (number of 
rows that meet both [...]
+
+##Selectivity of Covering and Non-Covering Indexes
+For a covering index, the query planner always generates a covering index 
plan, even if the estimated selectivity is 100%. The planner always expects an 
index-only plan to be cheaper than a plan with a full table scan due to the 
smaller row widths in an index.  
+  
+For a non-covering index, the query planner estimates the cost of a join-back 
to the primary table. Due to the random I/O nature of the rowkey join-back to 
primary table, the default selectivity threshold is small: 2.5% (.025). You can 
configure the default selectivity threshold for non-covering indexes through 
the `planner.index.noncovering_selectivity_threshold` option.   
+
+If the estimated selectivity of the filter condition is above this threshold, 
the query planner does not generate a non-covering index plan for that index; 
the rationale is that each new plan adds to the search space and increases 
planning time. If the estimated row count is already high, the plan is unlikely 
to be chosen anyway, and therefore better to prune the plan out early. 
+
+ 
\ No newline at end of file
diff --git 
a/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md 
b/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md
new file mode 100644
index 0000000..0f5ac40
--- /dev/null
+++ 
b/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md
@@ -0,0 +1,97 @@
+---
+title: "Designing Indexes for Your Queries"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---   
+
+Design indexes that support your queries for maximum performance benefits. Use 
common query patterns that involve filters and ordering to define indexes. 
Weigh the benefits of indexes against their update and storage costs and take 
into consideration any index limitations.  
+
+##Identify Query Patterns  
+Query patterns, such as queries with filter conditions and ORDER BY clauses, 
indicate where indexes can improve performance. If a query does not contain 
selective filters, the overhead of using an index may cost more than a full 
table scan. You should also define your indexes such that a single index 
benefits either multiple queries or individual queries that you run most often. 
 
+
+###Determine Potential Indexes Based on Query Patterns  
+The following table describes the types and characteristics of indexes you 
might want to create based on some example query patterns:  
+
+| **Identified Query Pattern**                                                 
                   | **Potential Indexes to Create**                            
                                                                                
                                                                                
                         |
+|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Compares individual columns with selective filter conditions                 
               | Define single column indexes on the columns that   you compare 
against. Verify that the columns contain supported data types.                  
                                                                                
                 |
+| Filters against specific combinations of columns                             
               | Define   composite column indexes instead of single column 
indexes. Specify the   sequence of the index keys so columns that appear in 
equality conditions are   the prefixes in the keys.                             
                         |
+| Accesses a subset of columns   in a document, but does not filter or sort on 
these columns  | Add those columns as included columns in   indexes.            
                                                                                
                                                                                
                 |
+| Filters on a sub-column in a   nested document                               
               | Define the index key on the sub-column.                        
                                                                                
                                                                                
                 |
+| Filters on sub-columns in   nested documents that are array elements         
               | Define the index key using a container column   path: for 
example, arraycolumn[].subcolumn.                                               
                                                                                
                      |
+| Filters and projects using a   container column path                         
               | Define the container column path as both an   indexed column 
and included column.                                                            
                                                                                
                   |
+| Filters on individual   elements of an array, which can appear in any 
position in the array | Define an index using a container column path:   for 
example, arraycolumn[].                                                         
                                                                                
                           |
+| Issues Drill SQL queries with   filter conditions that contain CAST 
expressions             | Specify the CAST function when defining the   index 
key.                                                                            
                                                                                
                            |
+| Sorts on columns                                                             
               | Define the sequence and order direction of the   index keys to 
match the sequence and order direction of the columns your   query sorts. If 
the sort order of the index keys matches the insertion order   of documents, 
define hashed indexes. |
+| Sorts on one set of columns   and filters on another set using equality 
conditions          | Define a composite index so that columns using   equality 
conditions are the prefixes in the index keys, followed by the sort   columns.  
                                                                                
                      |   
+
+##Evaluate Trade-Offs and Limitations  
+
+When designing indexes for optimization, consider the following trade-offs and 
limitations:
+
+###Synchronizing Indexes
+When you design your indexes, remember that the data source must synchronize 
each index when you insert and update columns in the table. This impacts the 
throughput performance of inserts and updates because the data source must 
perform additional writes. The impact increases with each additional index.  
+
+###Index Storage Requirements
+Consider the storage costs when creating indexes and deciding on the columns 
to add to the index. Indexes increase your storage requirements. The storage 
size depends on the number of indexed and included columns in the index and the 
size of values stored in those columns. As the size of the index increases, the 
cost of reading the index also increases.  
+
+###Index Restrictions
+When designing your indexes, make sure the indexes support the functionality 
you need. 
+
+**Examples**  
+
+The following examples illustrate the concepts behind index design, though 
they do not account for sizing, storage, and updates. Always weigh the benefits 
of indexes against these other requirements.  
+
+Suppose you have the following customer data in JSON format:  
+
+       {
+          "_id": "10000",
+          "FullName": {
+             "LastName": "Smith",
+             "FirstName": "John"
+          },
+          "Address": {
+             "Street": "123 SE 22nd St.",
+             "City": "Oakland",
+             "State": "CA",
+             "Zipcode": "94601-1001"
+          },
+          "Gender": "M",
+          "AccountBalance": 999.99,
+          "Email": "[email protected]",
+          "Phones": [
+             {"Type": "Home", "Number": "555-555-1234"},
+             {"Type": "Mobile", "Number": "555-555-5678"},
+             {"Type": "Work", "Number": "555-555-9012"}
+          ],
+          "Hobbies": ["Baseball", "Cooking", "Reading"],
+          "DateOfBirth": "10/1/1985"
+       }   
+
+The following table contains columns in the document that are candidates for 
indexing based on the sample queries:  
+
+|    Query # | Query                                                           
                                                          | Candidate   columns 
for Indexing                           |
+|------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
+| 1          | Find all customers who were born in the 1970s.                  
                                                          | DateOfBirth         
                                       |
+| 2          | Find all customers who have an account balance   greater than 
$10K. Order the information in descending order of balance. | AccountBalance    
                                         |
+| 3          | List customers who live in   California, ordering the list by 
LastName, FirstName.                                        | Address.State     
FullName.LastName     FullName.FirstName |
+| 4          | Find the ids and emails of customers who live in   a specific 
zip code.                                                   | Address.Zip       
                                         |
+| 5          | Find customers who live in a   specific set of states and have 
an account balance less than a specific   value.           | Address.State     
AccountBalance                           |
+| 6          | Find male customers with the   last name starting with the 
letter "S".                                                    | Gender     
FullName.LastName                               |
+| 7          | Find all customers who have a   mobile phone number with a 
prefix of "650".                                               | Phones[].Type  
   Phones[].Number                          |     
+
+The following table contains indexes you might create to optimize the queries 
listed in the previous table and the reasons for doing so:  
+
+|    Index                                                                     
                      | Rationale                                               
                                                                                
                                                                                
                                                 |
+|----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Simple   index on DateOfBirth                                                
                      | Optimizes the range condition on   DateOfBirth in query 
1.     You need not create a hashed index, because it is unlikely that the 
order   of DateOfBirth correlates with the insert order of new data.            
                                                      |
+| Simple   index on AccountBalance, specified as a descending key              
                      | Optimizes the range condition on   AccountBalance in 
query 2.     Descending order of key meets the ordering criteria in query 2.    
 Also optimizes the range condition on AccountBalance in query 5 in   
combination with the index on Address.State.                  |
+| Composite   index on:     •Address.State     •FullName.LastName     
•FullName.FirstName            | Optimizes both the equality   condition on 
Address.State and ordering in query 3.     Inclusion of the name columns in the 
index meets query 3 ordering.     Also optimizes the IN condition in query 5 
when used in combination with   the index on AccountBalance      |
+| Simple   index with:     •Indexed column on Address.Zip     •Included 
columns on Id and Email      | Optimizes the equality condition   on 
Address.Zip in query 4.     Adding the included columns avoids reading the JSON 
table in query 4.                                                               
                                                                   |
+| Composite   index on:     •Gender     •FullName.LastName                     
                      | Optimizes equality condition on   Gender and pattern 
matching condition on FullName.LastName for query 6.     Specifying Gender as 
the leading key in combination with FullName.LastName   results in more 
selective index lookups for query 6.                          |
+| Simple   index on Hobbies[]                                                  
                      | Optimizes the equality condition   on array elements of 
Hobbies in query 7:     {"$eq":{"Hobbies[]":"Reading"}}                         
                                                                                
                                                 |
+| Composite   index on:     •Phones[].Type     •Phones[].Number                
                      | Optimizes the following two   conditions in query 8:    
 •Equality condition on the Type subcolumn in nested documents in the Phones   
array.     •Pattern matching condition on the Number subcolumn in nested 
documents in   the Phones array.                         |
+
+
+
+
diff --git 
a/_docs/query-data/querying-indexes/050-configuring-index-planning.md 
b/_docs/query-data/querying-indexes/050-configuring-index-planning.md
new file mode 100644
index 0000000..a5d176f
--- /dev/null
+++ b/_docs/query-data/querying-indexes/050-configuring-index-planning.md
@@ -0,0 +1,37 @@
+---
+title: "Configuring Index Planning"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---   
+
+Certain options control how Drill uses indexes when planning and executing 
queries. You can set the query planning and execution options, at the system or 
session level, using the ALTER SYSTEM|SET commands, as shown:  
+
+       ALTER SYSTEM SET `planner.enable_index_planning` = 'true' 
+       SET `planner.enable_index_planning` = 'false'  
+
+Options set at the session level only apply to queries that you run during the 
current Drill connection. Options set at the system level affect the entire 
system and persist between restarts. Session level settings override system 
level settings. Typically, you set the options at the session level unless you 
want the setting to persist across all sessions.  
+
+The following table lists the index planning and execution options that you 
can enable, disable, or modify:  
+
+**Note:** The planning option names are prepended with planner, for example 
`planner.enable_index_planning`. The execution options are prepended with exec, 
for example `exec.query.rowkeyjoin_batchsize`.  
+
+|    **Option**                                           | **Description**    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+|-----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| planner.enable_index_planning                       | Enables   or disables 
index planning                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
               [...]
+| planner.index.force_sort_noncovering                | Forces   Drill to sort 
for non-covering indexes. If the query has an ORDER-BY on index   columns and a 
non-covering index is chosen, by default Drill leverages the   sortedness of 
the index columns and does not sort. Fast changing primary   table data may 
produce a partial sort. This option forces a sort within   Drill.               
                                                                                
                     [...]
+| planner.enable_rowkeyjoin_conversion                | Drill can push down 
the   rowkey filter to the data source during runtime. For a query to qualify 
for   runtime filter pushdown, the join condition must filter on a rowkey. For  
 example,     SELECT t.mscIdentities FROM dfs.root.`/user/ted/MixTable` t WHERE 
t.row_key   IN (SELECT max(convert_fromutf8(i.KeyA.ENTRY_KEY)) FROM   
dfs.root.`/user/ted/TableIMSI` i WHERE i.row_key='460021050005636')     Drill 
evaluates the results of the s [...]
+| planner.rowkeyjoin_conversion_selectivity_threshold | Sets the selectivity 
(as a   percentage) under which Drill uses a rowkey join for eligible queries.  
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| planner.rowkeyjoin_conversion_using_hashjoin        | When   enabled, Drill 
uses the hash join operator instead of a rowkey join.                           
                                                                                
                                                                                
                                                                                
                                                                                
               [...]
+| planner.index.covering_selectivity_threshold        | For   covering 
indexes, this option specifies the filter selectivity that   corresponds to the 
leading prefix of the index below which the index is   considered for planning. 
For example, for the filter ‘a > 10 AND b <   20’ if an index has indexed 
columns (a, b, c) and the combined selectivity of   the above condition is less 
than the threshold, the index is considered for   the query plan.               
                            [...]
+| planner.index.noncovering_selectivity_threshold     | For   non-covering 
indexes, this option specifies the filter selectivity that   corresponds to the 
leading prefix of the index below which the index is   considered for planning. 
                                                                                
                                                                                
                                                                                
                  [...]
+| planner.index.max_chosen_indexes_per_table          | The   maximum number 
of “chosen” indexes for a table after index costing and   ranking.              
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| planner.index.rowkeyjoin_cost_factor                | The   cost factor that 
provides some control over the I/O cost for non-covering   indexes when the 
rowkey join back to the primary table causes random I/O from   the primary 
table.                                                                          
                                                                                
                                                                                
                       [...]
+| planner.enable_statistics                           | Enable   or disable 
statistics for the filter conditions on indexed columns.                        
                                                                                
                                                                                
                                                                                
                                                                                
                 [...]
+| exec.query.rowkeyjoin_batchsize                     | For   batch GET 
operations, this option specifies the batch size in terms of the   number of 
rowkeys. Used for non-covering index plans when doing joins back to   primary 
table.                                                                          
                                                                                
                                                                                
                          [...]
+
+
+
+
+
+
+
diff --git a/_docs/query-data/querying-indexes/060-verifying-index-use.md 
b/_docs/query-data/querying-indexes/060-verifying-index-use.md
new file mode 100644
index 0000000..2662c07
--- /dev/null
+++ b/_docs/query-data/querying-indexes/060-verifying-index-use.md
@@ -0,0 +1,103 @@
+---
+title: "Verifying Index Use"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---  
+
+Evaluate query plans to analyze query performance and determine if Drill used 
qualifying indexes. You can view query plans in the Drill Web UI or through the 
command line using the EXPLAIN PLAN FOR command. Alternatively, you can disable 
the `planner.enable_index_planning` option in Drill to see the query plan with 
a full table scan and then compare the plan with the index-based plan.  
+
+The following examples demonstrate how to view the query profile in the Drill 
Web UI and the output of the EXPLAIN PLAN FOR command to determine if the query 
planner in Drill selected an index-based query plan. There is also an example 
that shows you how to disable index planning to compare a full table scan plan 
with an index-based query plan.  
+
+##Examples  
+The subsequent sections assume that an index exists on a table named 
"lineitem." The index, l_single_c_5, is a single column index created on the 
L_QUANTITY column. The index also covers the L_SUPPKEY, L_DISCOUNT, L_SHIPDate, 
and L_SHIPMODE columns. If a query contains columns covered by the index, the 
query is a covering query. If a query contains columns not covered by the 
index, the query is non-covering and requires a lookup back into the primary 
table to retrieve data.  
+
+The following list summarizes the assumptions:  
+
+- **Table name**: `lineitem`
+- **Index name**: `l_single_c_5`
+- **Indexed column**: `L_QUANTITY`
+- **Included columns**: `L_SUPPKEY`, `L_DISCOUNT`, `L_SHIPDate`, `L_SHIPMODE`  
+
+###Query Profile
+You can view the query plan on the Profiles page of the Drill Web UI, by 
selecting the query you want to evaluate and then selecting the Physical Plan 
page. The page displays the physical plan that Drill used to execute the query. 
 
+ 
+The following image shows the physical plan that Drill used to execute this 
simple equality query:  
+
+       SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5; 
+
+![](https://i.imgur.com/DkZPAYJ.png)   
+
+In the plan, you can see that Drill scanned the index, `l_single_c_5`, instead 
of the primary table. The query was completely covered by the index because the 
index contains all the columns referenced in the query and the query filtered 
on the indexed column.  
+
+###EXPLAIN Command  
+Alternatively, you can issue the [EXPLAIN 
command]({{site.baseurl}}/docs/explain/) to see how Drill executes a query. You 
can see the chosen physical execution plan for a query without running the 
query, by issuing the [EXPLAIN PLAN FOR 
command]({{site.baseurl}}/docs/explain/#explain-for-physical-plans). The output 
of the command shows you if Drill plans to use the index when executing the 
query, as shown:    
+
+       EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5 
LIMIT 10;
+       +------+------+
+       | text | json |
+       +------+------+
+       | 00-00    Screen
+       00-01      Project(L_SHIPDate=[$0])
+       00-02        SelectionVectorRemover
+       00-03          Limit(fetch=[10])
+       00-04            Limit(fetch=[10])
+       00-05              Project(L_SHIPDate=[$1])
+       00-06                Scan(groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec 
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem, 
condition=(L_QUANTITY = {"$numberLong":5}), indexName=l_single_c_5], 
columns=[`L_QUANTITY`, `L_SHIPDate`]]])  
+
+In the plan, you can see that Drill plans to use the index, `l_single_c_5`, 
instead of performing a full table scan. The query is completely covered by the 
index because the index contains all columns referenced in the query and the 
query filters on the indexed column.  
+
+###Comparing an Index-Based Plan to a Full Table Scan Plan 
+If you want to compare an index-based plan against a plan with a full table 
scan, disable the `planner.enable_index_planning` option in Drill, and run the 
EXPLAIN PLAN FOR command with the query. Running this command with the 
`planner.enable_index_planning` option disabled forces Drill to generate a plan 
that includes a full table scan. 
+
+You can compare the full table scan plan against the index-based plan to 
compare the costs and resource consumption of each plan.
+ 
+In the following example, the indexing feature is enabled, and Drill generated 
a plan using the index:  
+
+       EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5 
LIMIT 10;
+       +------+------+
+       | text | json |
+       +------+------+
+       | 00-00    Screen
+       00-01      Project(L_SHIPDate=[$0])
+       00-02        SelectionVectorRemover
+       00-03          Limit(fetch=[10])
+       00-04            Limit(fetch=[10])
+       00-05              Project(L_SHIPDate=[$1])
+       00-06                Scan(groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec 
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem, 
condition=(L_QUANTITY = {"$numberLong":5}), indexName=l_single_c_5], 
columns=[`L_QUANTITY`, `L_SHIPDate`]]])  
+
+Turning the option off, as shown:  
+
+       ALTER SESSION SET planner.enable_index_planning = false   
+ 
+And running the EXPLAIN PLAN FOR command again shows the plan with a full 
table scan:  
+
+       EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5 
LIMIT 10;
+       +------+------+
+       | text | json |
+       +------+------+
+       | 00-00    Screen
+       00-01      Project(L_SHIPDate=[$0])
+       00-02        SelectionVectorRemover
+       00-03          Limit(fetch=[10])
+       00-04            UnionExchange
+       01-01              SelectionVectorRemover
+       01-02                Limit(fetch=[10])
+       01-03                  Project(L_SHIPDate=[$1])
+       01-04                    Scan(groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec 
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem, 
condition=(L_QUANTITY = {"$numberLong":5})], columns=[`L_QUANTITY`, 
`L_SHIPDate`]]])
+        |  
+
+**Note:** To see the cost of each plan, go to the Drill Web UI and view the 
query profile for each EXPLAIN PLAN FOR command that you issue through the 
command line.
+
+
+
+
+
+ 
+
+
+
+
+
+
+
+   
\ No newline at end of file

[drill] branch gh-pages updated: add index content to docs

Reply via email to