This is an automated email from the ASF dual-hosted git repository.
bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push:
new 2ac8a08 add index content to docs
2ac8a08 is described below
commit 2ac8a08da63e1d671b7a720e9760697c231ad2e9
Author: Bridget Bevens <[email protected]>
AuthorDate: Mon Dec 17 15:32:47 2018 -0800
add index content to docs
---
_docs/img/query-plan-verify.png | Bin 0 -> 113092 bytes
.../010-querying-indexes-introduction.md | 15 +++
...ies-that-qualify-for-index-based-query-plans.md | 59 ++++++++++
.../querying-indexes/021-types-of-indexes.md | 124 +++++++++++++++++++++
.../querying-indexes/030-index-selection.md | 23 ++++
.../040-designing-indexes-for-your-queries.md | 97 ++++++++++++++++
.../050-configuring-index-planning.md | 37 ++++++
.../querying-indexes/060-verifying-index-use.md | 103 +++++++++++++++++
8 files changed, 458 insertions(+)
diff --git a/_docs/img/query-plan-verify.png b/_docs/img/query-plan-verify.png
new file mode 100644
index 0000000..4d79109
Binary files /dev/null and b/_docs/img/query-plan-verify.png differ
diff --git
a/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md
b/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md
new file mode 100644
index 0000000..13f6d71
--- /dev/null
+++ b/_docs/query-data/querying-indexes/010-querying-indexes-introduction.md
@@ -0,0 +1,15 @@
+---
+title: "Querying Indexes Introduction"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+{% include startnote.html %}Currently, Drill only supports indexes for the
MapR-DB storage plugin. Examples in this document were created using Drill on
MapR Database. {% include endnote.html %}
+
+An index is a structure, defined on a table, that contains a subset of columns
from the table sorted by key values. Well-designed indexes eliminate full table
scans and optimize access to data to significantly improve performance.
Starting in Drill 1.15, Drill can leverage indexes (primary or secondary) in
data sources to create index-based query plans.
+
+{% include startnote.html %}You cannot create, update, or delete indexes or
tables through Drill. You must create and manage the indexes and tables in the
data sources themselves. Also, the data source must be configured as a storage
plugin in Drill.{% include endnote.html %}
+
+An index-based query plan is a query plan that uses indexes versus full tables
scans to access data. When you submit a query through Drill, the query planner
in Drill evaluates the query and compares the cost of multiple query plans to
find an optimal plan with the lowest cost. Index planning is enabled by
default; however, you can disable the feature or configure the behavior of
index planning in Drill through several options.
+
+Certain types of queries qualify for index-based query plans. You can create
indexes and write queries such that the query planner in Drill leverages the
indexes to create an index-based query plan for optimized performance. Drill
can leverage indexes to create covering, non-covering, and functional index
plans. You can verify that the query planner generated an index-based query
plan in the query profile (Drill Web UI) or from the command line using the
[EXPLAIN PLAN FOR]({{site.baseurl [...]
diff --git
a/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
b/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
new file mode 100644
index 0000000..09f5420
--- /dev/null
+++
b/_docs/query-data/querying-indexes/020-queries-that-qualify-for-index-based-query-plans.md
@@ -0,0 +1,59 @@
+---
+title: "Queries that Qualify for Index-Based Query Plans"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+In most cases, the planner chooses an index-based query plan for queries with
[WHERE clause]({{site.baseurl}}/docs/where-clause/) filters and sort-based
operations, such as [ORDER BY]({{site.baseurl}}/docs/order-by-clause/), [GROUP
BY]({{site.baseurl}}/docs/group-by-clause/), and
[joins]({{site.baseurl}}/docs/from-clause/#join-types).
+
+The following types of queries and conditions typically qualify for
index-based query plans:
+
+
+- Equality conditions (includes IN)
+- Range conditions (includes LIKE pattern matching)
+- ORDER BY queries
+- GROUP BY queries
+- JOIN queries
+- Projection queries (including DISTINCT)
+
+Drill can use indexes for queries that GROUP BY or ORDER BY the leading
columns in an index. When a query contains GROUP BY and ORDER BY operations on
the leading indexed column, Drill uses the sort order of the index to create
index-based query plans with streaming aggregates and merge joins to improve
query performance. Drill does not use indexes for queries that GROUP BY or
ORDER BY the trailing or included columns in an index.
+If a query contains conditions on multiple columns (multi-index query), Drill
can scan multiple indexes and use the index intersection to optimize the query.
+
+{% include startnote.html %}Each column added to an index increases the
storage requirement for the index. As the storage size increases, the cost of
reading the index also increases. Likewise, for the cost of adding and updating
data. Consider the impact on storage and updates when including columns in an
index.{% include endnote.html %}
+
+##Queries Without Filters
+
+Drill can create index-based query plans for the following types of queries
without filters (queries that do not have a WHERE clause):
+
+
+- ORDER BY queries, as shown in the following example where L_LINENUMBER is an
indexed column in the index selected for the query plan:
+
+ SELECT L_LINENUMBER FROM lineitem ORDER BY L_LINENUMBER;
+
+- GROUP BY queries, as shown in the following example where L_COMMITDate is an
indexed column in the index selected for the query plan:
+
+ SELECT L_COMMITDate FROM lineitem GROUP BY L_COMMITDate;
+
+- JOIN queries, as shown in the following example where L_ORDERKEY and
O_ORDERKEY are indexed columns and L_LINESTATUS is an included column in the
index selected for the query plan:
+
+ SELECT L.L_LINESTATUS FROM lineitem L, orders O WHERE
L.L_ORDERKEY=O.O_ORDERKEY;
+
+
+ **Note:** If the planner picks two indexes, one for lineitem and one
for orders, a sort merge join is used instead of a hash join.
+
+- Queries with DISTINCT projections, as shown in the following examples where
L_LINENUMBER is an indexed column in the index selected for the query plan:
+
+ SELECT DISTINCT L_LINENUMBER FROM lineitem;
+ SELECT COUNT(DISTINCT L_LINENUMBER) FROM lineitem;
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/_docs/query-data/querying-indexes/021-types-of-indexes.md
b/_docs/query-data/querying-indexes/021-types-of-indexes.md
new file mode 100644
index 0000000..0208f66
--- /dev/null
+++ b/_docs/query-data/querying-indexes/021-types-of-indexes.md
@@ -0,0 +1,124 @@
+---
+title: "Types of Indexes"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+The query planner in Drill can leverage indexes to create index-based query
plans for improved query performance. An index is either covering or
non-covering based on the columns referenced in the query and the columns in
the index. Indexes for queries with ORDER BY may be hashed or non-hashed,
affecting the sort order of the indexes. Drill also supports indexes on the
CAST function.
+
+The following sections describe these indexes in more detail and show examples
of query plans generated for each.
+
+{% include startnote.html %}Currently, Drill works with the types of indexes
supported by MapR Database. The following sections demonstrate how Drill works
with indexes in the MapR Database.{% include endnote.html %}
+
+##Covering Index
+A covering index is an index that “covers” all the columns referenced in a
query. Only the index is needed to process the query. Covering indexes avoid
the overhead of fetching data from the primary table. A covering index can
include indexed and non-indexed (included) columns.
+
+For example, suppose an index is created and queried, as follows:
+
+ index keys:{a}, included columns: {b} and the query is SELECT a, b FROM
T WHERE a > 10 AND b < 20.
+
+Since columns a, b are present in the index, this is a covering index. The
Drill planner generates a covering index plan (index-only plan) where all the
columns are retrieved from the index after pushing down relevant filter
conditions to the index scan.
+
+###Covering Index Example
+
+This example uses an index, l_comp_1, created on a table named lineitem with
indexed columns L_LINENUMBER and L_ORDERKEY and also includes columns
L_LINESTATUS and L_QUANTITY.
+
+The following query references the L_LINESTATUS, L_QUANTITY, L_LINENUMBER, and
L_ORDERKEY columns in the lineitem table:
+
+ SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE L_LINENUMBER = 1
AND L_ORDERKEY BETWEEN 40 AND 75;
+
+Because the l_comp_1 index includes all columns referenced in the query, the
query planner in Drill creates a covering index plan.
+
+Running the EXPLAIN PLAN FOR command with the query shows that Drill created a
query plan that only uses the index to process the query:
+
+ EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;
+
+ 00-00 Screen
+ 00-01 Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+ 00-02 Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]],
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem,
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >=
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})),
indexName=l_comp_1], columns=[`L_LINESTATUS`, `L_QUANTITY`]]])
+
+Reading the query plan, you can see that the plan includes an index scan, as
indicated by `groupscan=[JsonTableGroupScan` and `indexName=l_comp_1`. Drill
and the data source can process this query using only the index.
+
+##Non-Covering Index
+A non-covering index is an index that does not “cover” all the columns
referenced in a query. A non-covering index has indexed and/or non-indexed
(included) columns that only partially cover the columns referenced in a query.
A non-covering query plan includes an index scan and a join back to the primary
table. In some scenarios, a full table scan is more cost efficient than an
index scan and Drill will not create an index-based query plan.
+
+For example, suppose an index is created and queried, as follows:
+
+ index keys:{a, b}, included columns: {c} and the query is SELECT d, e
FROM T WHERE a > 10 AND b < 20.
+
+Since columns d, e are not present in the index, this is a non-covering index.
For such indexes, the Drill planner generates a non-covering index plan where
only the row ids are fetched from the index by pushing down the WHERE clause
filters and the rest of the columns are fetched after a join-back to the
primary table. The join-back is performed using the row ids.
+
+###Non-Covering Index Example
+This example uses an index, l_comp_1, created on a table named lineitem with
indexed columns L_LINENUMBER and L_ORDERKEY and also included columns
L_LINESTATUS and L_QUANTITY.
+
+The following query references the L_RETURNFLAG, L_LINESTATUS, L_QUANTITY
L_LINENUMBER, and L_ORDERKEY columns in the lineitem table:
+
+ SELECT L_RETURNFLAG, L_LINESTATUS, L_QUANTITY FROM lineitem WHERE
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;
+
+Because the l_comp_1 index does not include the L_RETURNFLAG column, the query
planner in Drill creates a non-covering index plan that uses the index, but
also includes a join on the primary table.
+
+Running the EXPLAIN PLAN FOR command with the query shows that Drill includes
an index scan and a table scan with a rowkey join:
+
+ EXPLAIN PLAN FOR SELECT L_RETURNFLAG, L_LINESTATUS, L_QUANTITY FROM
lineitem WHERE L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75;
+
+ 00-00 Screen
+ 00-01 Project(L_RETURNFLAG=[$0], L_LINESTATUS=[$1],
L_QUANTITY=[$2])
+ 00-02 Project(L_RETURNFLAG=[$2], L_LINESTATUS=[$3],
L_QUANTITY=[$4])
+ 00-03 Project(L_LINENUMBER=[$0], L_ORDERKEY=[$1],
L_RETURNFLAG=[$2], L_LINESTATUS=[$3], L_QUANTITY=[$4])
+ 00-04 RowKeyJoin(condition=[=($5, $6)], joinType=[inner])
+ 00-06 Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]],
groupscan=[RestrictedJsonTableGroupScan [ScanSpec=JsonScanSpec
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem,
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >=
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75}))],
columns=[`L_LINENUMBER`, `L_ORDERKEY`, `L_RETURNFLAG`, `L_LINESTATUS`,
`L_QUANTITY`, `_id`], rowcount=60012.15000000001]])
+ 00-05 Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]],
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem,
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >=
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})),
indexName=l_comp_1], columns=[`_id`]]])
+
+
+Reading the non-covering index plan, you can see that the plan includes an
index scan, as indicated by the `groupscan=[JsonTableGroupScan` and
`indexName=l_comp_1`, and also a scan on the primary table, as indicated by the
`groupscan=[RestrictedJsonTableGroupScan` and the `RowKeyJoin`. To process this
query, Drill can use the index with the data source, but the data source must
also use the rowkey join on the primary table to fetch data in the L_RETURNFLAG
column. Note that this is not a [...]
+
+If this query ran on a regular basis, you could remove the l_comp_1 index and
create a new index that includes all columns referenced in the query, including
the L_RETURNFLAG column, to improve query performance. However, running a query
only once or a few times may not justify the overhead of removing the old index
and creating a new index.
+
+##Non-Hashed Indexes
+Non-hashed indexes support conditional queries with an ORDER BY clause. When
processing ORDER BY queries, Drill does not have to perform sort operations on
the data.
+
+###Non-Hashed Index Plan Example
+
+A non-hashed index, l_comp_1, was created on a table, lineitem with indexed
columns L_LINENUMBER and L_ORDERKEY and included columns L_LINESTATUSand
L_QUANTITY.
+
+Running the example query with the EXPLAIN PLAN FOR command shows that Drill
produces an index plan without the additional sort and merge operations when
using the non-hashed index to process the query, as follows:
+
+ EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75 ORDER BY L_LINENUMBER;
+
+ 00-00 Screen
+ 00-01 Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+ 00-02 Project(L_LINESTATUS=[$2], L_QUANTITY=[$3],
L_LINENUMBER=[$0])
+ 00-03 Scan(table=[[si, tpch_sf1_maprdb_range, lineitem]],
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/range/lineitem,
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >=
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})),
indexName=l_comp_1], columns=[`L_LINENUMBER`, `L_ORDERKEY`, `L_LINESTATUS`,
`L_QUANTITY`]]])
+
+Reading the query plan, you can see that Drill uses the non-hashed index plan,
as indicated by `indexName=l_comp_1`. To process the query, the data source
uses the index and Drill does not have to perform sort and merge operations on
the data, as indicated by the absence of the Sort and SingleMergeExchange
operations in the query plan. The data source sorted the data in the index when
the index was created.
+
+##Hashed Indexes
+Hashed indexes support the same conditional queries as non-hashed indexes, but
they do not have a guaranteed sort order. Hashed indexes enable the data source
to evenly distribute new writes on an index across logical partitions to avoid
hot spotting. Drill must perform a sort for ORDER BY queries that use hashed
indexes. Sorting the data can increase the CPU costs and negatively impact
performance.
+
+If you notice performance issues with ORDER BY queries that use hashed
indexes, review the query plans to see if the plans include sort and merge
operations. If this is the case, create non-hashed indexes to support the
queries and achieve the best performance.
+
+###Hashed Index Plan Example
+A hashed index, l_hash_comp_1, was created with indexed columns L_LINENUMBER
and L_ORDERKEY and included columns L_LINESTATUS and L_QUANTITY.
+
+Running the example query with the EXPLAIN PLAN FOR command shows that Drill
produces an index plan with sort and merge operations to process the query when
using the hashed index, as follows:
+
+ EXPLAIN PLAN FOR SELECT L_LINESTATUS, L_QUANTITY FROM lineitem WHERE
L_LINENUMBER = 1 AND L_ORDERKEY BETWEEN 40 AND 75 ORDER BY L_LINENUMBER;
+
+ 00-00 Screen
+ 00-01 Project(L_LINESTATUS=[$0], L_QUANTITY=[$1])
+ 00-02 SingleMergeExchange(sort0=[2])
+ 01-01 SelectionVectorRemover
+ 01-02 Sort(sort0=[$2], dir0=[ASC])
+ 01-03 Project(L_LINESTATUS=[$2], L_QUANTITY=[$3],
L_LINENUMBER=[$0])
+ 01-04 Scan(table=[[si, tpch_sf1_maprdb_hash, lineitem]],
groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec
[tableName=maprfs:///drill/testdata/tpch/sf1/maprdb/json/hash/lineitem,
condition=(((L_LINENUMBER = {"$numberLong":1}) and (L_ORDERKEY >=
{"$numberLong":40})) and (L_ORDERKEY <= {"$numberLong":75})),
indexName=l_hash_comp_1], columns=[`L_LINENUMBER`, `L_ORDERKEY`,
`L_LINESTATUS`, `L_QUANTITY`]]])
+
+Reading the query plan, you can see that Drill uses the hashed index in the
plan, as indicated by `indexName=l_hash_comp_1`. To process the query, the data
source can use the index, but Drill must sort and merge the data, as indicated
by the `Sort` and `SingleMergeExchange` operations in the query plan.
+Using the hashed index plan for this ORDER BY query requires additional
processing and negatively impacts performance.
+
+##Functional Index
+A functional index is an index created on functions or expressions instead of
columns in a table. CAST functions are most commonly used in Drill views. For
example, if the filter condition is 'WHERE CAST(zip_code as BIGINT) = 95120'
and a functional index exists on CAST(zip_code as BIGINT), the query planner
will leverage the index.
+
+When issuing Drill queries through BI tools, you can include CAST functions in
your queries to create [Drill views]({{site.baseurl}}/docs/create-view/).
Including CAST functions provides the metadata needed to optimally process the
queries. For more information about using the CAST function with Drill, see
[Data Type Conversion]({{site.baseurl}}/docs/data-type-conversion/).
+
+{% include startnote.html %}MapR Database supports functional indexes. The
query planner in Drill can use the functional index to optimize queries with
CAST functions.{% include endnote.html %}
+
\ No newline at end of file
diff --git a/_docs/query-data/querying-indexes/030-index-selection.md
b/_docs/query-data/querying-indexes/030-index-selection.md
new file mode 100644
index 0000000..586a230
--- /dev/null
+++ b/_docs/query-data/querying-indexes/030-index-selection.md
@@ -0,0 +1,23 @@
+---
+title: "Index Selection"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+Drill leverages indexes during the physical planning phase of the query. The
query planner in Drill evaluates the cost of multiple query plans and then
selects the plan with the lowest cost for query execution. The planner uses
CPU, memory, and network I/O estimates to calculate cost. Based on these costs,
the query planner evaluates selectivity, collation (sortedness of data), and
the type of index. For each candidate index, the query planner estimates the
total cost of accessing the index.
+
+The query planner can create three types of index plans: covering,
non-covering, and index intersection. By default, the planner evaluates the top
five indexes per table, though this number is configurable. The indexes chosen
may also include an index intersection. For example, two indexes may not
qualify for an index plan based on their individual selectivity, but their
combined selectivity after intersection could be low enough to qualify.
+
+##Selectivity
+Selectivity is the estimated number of rows based on the selectivity of each
conditional expression in the WHERE clause, calculated as (output row
count)/(total table row count). For example, if a table has 100 rows and 25 of
the rows qualify the filter condition, the selectivity is .25. Selectivity
ranges between 0 and 1. The closer to 0, the more selective the filter. The
more selective a filter, the lower the cost. High filter selectivity results in
better query performance. If filter [...]
+
+The data source provides Drill with the estimated number of rows that match a
filter filter condition and the average row size. The data source uses a filter
condition, such as WHERE a > 10 AND b < 20, to return the estimated row count
of the filter condition based on the leading prefix of the index columns. For
example, if the index columns are the composite key {a, b}, the leading prefix
of the filter condition is {a, b}, and the row count of the conjunct (number of
rows that meet both [...]
+
+##Selectivity of Covering and Non-Covering Indexes
+For a covering index, the query planner always generates a covering index
plan, even if the estimated selectivity is 100%. The planner always expects an
index-only plan to be cheaper than a plan with a full table scan due to the
smaller row widths in an index.
+
+For a non-covering index, the query planner estimates the cost of a join-back
to the primary table. Due to the random I/O nature of the rowkey join-back to
primary table, the default selectivity threshold is small: 2.5% (.025). You can
configure the default selectivity threshold for non-covering indexes through
the `planner.index.noncovering_selectivity_threshold` option.
+
+If the estimated selectivity of the filter condition is above this threshold,
the query planner does not generate a non-covering index plan for that index;
the rationale is that each new plan adds to the search space and increases
planning time. If the estimated row count is already high, the plan is unlikely
to be chosen anyway, and therefore better to prune the plan out early.
+
+
\ No newline at end of file
diff --git
a/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md
b/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md
new file mode 100644
index 0000000..0f5ac40
--- /dev/null
+++
b/_docs/query-data/querying-indexes/040-designing-indexes-for-your-queries.md
@@ -0,0 +1,97 @@
+---
+title: "Designing Indexes for Your Queries"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+Design indexes that support your queries for maximum performance benefits. Use
common query patterns that involve filters and ordering to define indexes.
Weigh the benefits of indexes against their update and storage costs and take
into consideration any index limitations.
+
+##Identify Query Patterns
+Query patterns, such as queries with filter conditions and ORDER BY clauses,
indicate where indexes can improve performance. If a query does not contain
selective filters, the overhead of using an index may cost more than a full
table scan. You should also define your indexes such that a single index
benefits either multiple queries or individual queries that you run most often.
+
+###Determine Potential Indexes Based on Query Patterns
+The following table describes the types and characteristics of indexes you
might want to create based on some example query patterns:
+
+| **Identified Query Pattern**
| **Potential Indexes to Create**
|
+|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Compares individual columns with selective filter conditions
| Define single column indexes on the columns that you compare
against. Verify that the columns contain supported data types.
|
+| Filters against specific combinations of columns
| Define composite column indexes instead of single column
indexes. Specify the sequence of the index keys so columns that appear in
equality conditions are the prefixes in the keys.
|
+| Accesses a subset of columns in a document, but does not filter or sort on
these columns | Add those columns as included columns in indexes.
|
+| Filters on a sub-column in a nested document
| Define the index key on the sub-column.
|
+| Filters on sub-columns in nested documents that are array elements
| Define the index key using a container column path: for
example, arraycolumn[].subcolumn.
|
+| Filters and projects using a container column path
| Define the container column path as both an indexed column
and included column.
|
+| Filters on individual elements of an array, which can appear in any
position in the array | Define an index using a container column path: for
example, arraycolumn[].
|
+| Issues Drill SQL queries with filter conditions that contain CAST
expressions | Specify the CAST function when defining the index
key.
|
+| Sorts on columns
| Define the sequence and order direction of the index keys to
match the sequence and order direction of the columns your query sorts. If
the sort order of the index keys matches the insertion order of documents,
define hashed indexes. |
+| Sorts on one set of columns and filters on another set using equality
conditions | Define a composite index so that columns using equality
conditions are the prefixes in the index keys, followed by the sort columns.
|
+
+##Evaluate Trade-Offs and Limitations
+
+When designing indexes for optimization, consider the following trade-offs and
limitations:
+
+###Synchronizing Indexes
+When you design your indexes, remember that the data source must synchronize
each index when you insert and update columns in the table. This impacts the
throughput performance of inserts and updates because the data source must
perform additional writes. The impact increases with each additional index.
+
+###Index Storage Requirements
+Consider the storage costs when creating indexes and deciding on the columns
to add to the index. Indexes increase your storage requirements. The storage
size depends on the number of indexed and included columns in the index and the
size of values stored in those columns. As the size of the index increases, the
cost of reading the index also increases.
+
+###Index Restrictions
+When designing your indexes, make sure the indexes support the functionality
you need.
+
+**Examples**
+
+The following examples illustrate the concepts behind index design, though
they do not account for sizing, storage, and updates. Always weigh the benefits
of indexes against these other requirements.
+
+Suppose you have the following customer data in JSON format:
+
+ {
+ "_id": "10000",
+ "FullName": {
+ "LastName": "Smith",
+ "FirstName": "John"
+ },
+ "Address": {
+ "Street": "123 SE 22nd St.",
+ "City": "Oakland",
+ "State": "CA",
+ "Zipcode": "94601-1001"
+ },
+ "Gender": "M",
+ "AccountBalance": 999.99,
+ "Email": "[email protected]",
+ "Phones": [
+ {"Type": "Home", "Number": "555-555-1234"},
+ {"Type": "Mobile", "Number": "555-555-5678"},
+ {"Type": "Work", "Number": "555-555-9012"}
+ ],
+ "Hobbies": ["Baseball", "Cooking", "Reading"],
+ "DateOfBirth": "10/1/1985"
+ }
+
+The following table contains columns in the document that are candidates for
indexing based on the sample queries:
+
+| Query # | Query
| Candidate columns
for Indexing |
+|------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
+| 1 | Find all customers who were born in the 1970s.
| DateOfBirth
|
+| 2 | Find all customers who have an account balance greater than
$10K. Order the information in descending order of balance. | AccountBalance
|
+| 3 | List customers who live in California, ordering the list by
LastName, FirstName. | Address.State
FullName.LastName FullName.FirstName |
+| 4 | Find the ids and emails of customers who live in a specific
zip code. | Address.Zip
|
+| 5 | Find customers who live in a specific set of states and have
an account balance less than a specific value. | Address.State
AccountBalance |
+| 6 | Find male customers with the last name starting with the
letter "S". | Gender
FullName.LastName |
+| 7 | Find all customers who have a mobile phone number with a
prefix of "650". | Phones[].Type
Phones[].Number |
+
+The following table contains indexes you might create to optimize the queries
listed in the previous table and the reasons for doing so:
+
+| Index
| Rationale
|
+|----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Simple index on DateOfBirth
| Optimizes the range condition on DateOfBirth in query
1. You need not create a hashed index, because it is unlikely that the
order of DateOfBirth correlates with the insert order of new data.
|
+| Simple index on AccountBalance, specified as a descending key
| Optimizes the range condition on AccountBalance in
query 2. Descending order of key meets the ordering criteria in query 2.
Also optimizes the range condition on AccountBalance in query 5 in
combination with the index on Address.State. |
+| Composite index on: •Address.State •FullName.LastName
•FullName.FirstName | Optimizes both the equality condition on
Address.State and ordering in query 3. Inclusion of the name columns in the
index meets query 3 ordering. Also optimizes the IN condition in query 5
when used in combination with the index on AccountBalance |
+| Simple index with: •Indexed column on Address.Zip •Included
columns on Id and Email | Optimizes the equality condition on
Address.Zip in query 4. Adding the included columns avoids reading the JSON
table in query 4.
|
+| Composite index on: •Gender •FullName.LastName
| Optimizes equality condition on Gender and pattern
matching condition on FullName.LastName for query 6. Specifying Gender as
the leading key in combination with FullName.LastName results in more
selective index lookups for query 6. |
+| Simple index on Hobbies[]
| Optimizes the equality condition on array elements of
Hobbies in query 7: {"$eq":{"Hobbies[]":"Reading"}}
|
+| Composite index on: •Phones[].Type •Phones[].Number
| Optimizes the following two conditions in query 8:
•Equality condition on the Type subcolumn in nested documents in the Phones
array. •Pattern matching condition on the Number subcolumn in nested
documents in the Phones array. |
+
+
+
+
diff --git
a/_docs/query-data/querying-indexes/050-configuring-index-planning.md
b/_docs/query-data/querying-indexes/050-configuring-index-planning.md
new file mode 100644
index 0000000..a5d176f
--- /dev/null
+++ b/_docs/query-data/querying-indexes/050-configuring-index-planning.md
@@ -0,0 +1,37 @@
+---
+title: "Configuring Index Planning"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+Certain options control how Drill uses indexes when planning and executing
queries. You can set the query planning and execution options, at the system or
session level, using the ALTER SYSTEM|SET commands, as shown:
+
+ ALTER SYSTEM SET `planner.enable_index_planning` = 'true'
+ SET `planner.enable_index_planning` = 'false'
+
+Options set at the session level only apply to queries that you run during the
current Drill connection. Options set at the system level affect the entire
system and persist between restarts. Session level settings override system
level settings. Typically, you set the options at the session level unless you
want the setting to persist across all sessions.
+
+The following table lists the index planning and execution options that you
can enable, disable, or modify:
+
+**Note:** The planning option names are prepended with planner, for example
`planner.enable_index_planning`. The execution options are prepended with exec,
for example `exec.query.rowkeyjoin_batchsize`.
+
+| **Option** | **Description**
[...]
+|-----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| planner.enable_index_planning | Enables or disables
index planning
[...]
+| planner.index.force_sort_noncovering | Forces Drill to sort
for non-covering indexes. If the query has an ORDER-BY on index columns and a
non-covering index is chosen, by default Drill leverages the sortedness of
the index columns and does not sort. Fast changing primary table data may
produce a partial sort. This option forces a sort within Drill.
[...]
+| planner.enable_rowkeyjoin_conversion | Drill can push down
the rowkey filter to the data source during runtime. For a query to qualify
for runtime filter pushdown, the join condition must filter on a rowkey. For
example, SELECT t.mscIdentities FROM dfs.root.`/user/ted/MixTable` t WHERE
t.row_key IN (SELECT max(convert_fromutf8(i.KeyA.ENTRY_KEY)) FROM
dfs.root.`/user/ted/TableIMSI` i WHERE i.row_key='460021050005636') Drill
evaluates the results of the s [...]
+| planner.rowkeyjoin_conversion_selectivity_threshold | Sets the selectivity
(as a percentage) under which Drill uses a rowkey join for eligible queries.
[...]
+| planner.rowkeyjoin_conversion_using_hashjoin | When enabled, Drill
uses the hash join operator instead of a rowkey join.
[...]
+| planner.index.covering_selectivity_threshold | For covering
indexes, this option specifies the filter selectivity that corresponds to the
leading prefix of the index below which the index is considered for planning.
For example, for the filter ‘a > 10 AND b < 20’ if an index has indexed
columns (a, b, c) and the combined selectivity of the above condition is less
than the threshold, the index is considered for the query plan.
[...]
+| planner.index.noncovering_selectivity_threshold | For non-covering
indexes, this option specifies the filter selectivity that corresponds to the
leading prefix of the index below which the index is considered for planning.
[...]
+| planner.index.max_chosen_indexes_per_table | The maximum number
of “chosen” indexes for a table after index costing and ranking.
[...]
+| planner.index.rowkeyjoin_cost_factor | The cost factor that
provides some control over the I/O cost for non-covering indexes when the
rowkey join back to the primary table causes random I/O from the primary
table.
[...]
+| planner.enable_statistics | Enable or disable
statistics for the filter conditions on indexed columns.
[...]
+| exec.query.rowkeyjoin_batchsize | For batch GET
operations, this option specifies the batch size in terms of the number of
rowkeys. Used for non-covering index plans when doing joins back to primary
table.
[...]
+
+
+
+
+
+
+
diff --git a/_docs/query-data/querying-indexes/060-verifying-index-use.md
b/_docs/query-data/querying-indexes/060-verifying-index-use.md
new file mode 100644
index 0000000..2662c07
--- /dev/null
+++ b/_docs/query-data/querying-indexes/060-verifying-index-use.md
@@ -0,0 +1,103 @@
+---
+title: "Verifying Index Use"
+date: 2018-09-28 21:35:21 UTC
+parent: "Querying Indexes"
+---
+
+Evaluate query plans to analyze query performance and determine if Drill used
qualifying indexes. You can view query plans in the Drill Web UI or through the
command line using the EXPLAIN PLAN FOR command. Alternatively, you can disable
the `planner.enable_index_planning` option in Drill to see the query plan with
a full table scan and then compare the plan with the index-based plan.
+
+The following examples demonstrate how to view the query profile in the Drill
Web UI and the output of the EXPLAIN PLAN FOR command to determine if the query
planner in Drill selected an index-based query plan. There is also an example
that shows you how to disable index planning to compare a full table scan plan
with an index-based query plan.
+
+##Examples
+The subsequent sections assume that an index exists on a table named
"lineitem." The index, l_single_c_5, is a single column index created on the
L_QUANTITY column. The index also covers the L_SUPPKEY, L_DISCOUNT, L_SHIPDate,
and L_SHIPMODE columns. If a query contains columns covered by the index, the
query is a covering query. If a query contains columns not covered by the
index, the query is non-covering and requires a lookup back into the primary
table to retrieve data.
+
+The following list summarizes the assumptions:
+
+- **Table name**: `lineitem`
+- **Index name**: `l_single_c_5`
+- **Indexed column**: `L_QUANTITY`
+- **Included columns**: `L_SUPPKEY`, `L_DISCOUNT`, `L_SHIPDate`, `L_SHIPMODE`
+
+###Query Profile
+You can view the query plan on the Profiles page of the Drill Web UI, by
selecting the query you want to evaluate and then selecting the Physical Plan
page. The page displays the physical plan that Drill used to execute the query.
+
+The following image shows the physical plan that Drill used to execute this
simple equality query:
+
+ SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5;
+
+
+
+In the plan, you can see that Drill scanned the index, `l_single_c_5`, instead
of the primary table. The query was completely covered by the index because the
index contains all the columns referenced in the query and the query filtered
on the indexed column.
+
+###EXPLAIN Command
+Alternatively, you can issue the [EXPLAIN
command]({{site.baseurl}}/docs/explain/) to see how Drill executes a query. You
can see the chosen physical execution plan for a query without running the
query, by issuing the [EXPLAIN PLAN FOR
command]({{site.baseurl}}/docs/explain/#explain-for-physical-plans). The output
of the command shows you if Drill plans to use the index when executing the
query, as shown:
+
+ EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5
LIMIT 10;
+ +------+------+
+ | text | json |
+ +------+------+
+ | 00-00 Screen
+ 00-01 Project(L_SHIPDate=[$0])
+ 00-02 SelectionVectorRemover
+ 00-03 Limit(fetch=[10])
+ 00-04 Limit(fetch=[10])
+ 00-05 Project(L_SHIPDate=[$1])
+ 00-06 Scan(groupscan=[JsonTableGroupScan
[ScanSpec=JsonScanSpec
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem,
condition=(L_QUANTITY = {"$numberLong":5}), indexName=l_single_c_5],
columns=[`L_QUANTITY`, `L_SHIPDate`]]])
+
+In the plan, you can see that Drill plans to use the index, `l_single_c_5`,
instead of performing a full table scan. The query is completely covered by the
index because the index contains all columns referenced in the query and the
query filters on the indexed column.
+
+###Comparing an Index-Based Plan to a Full Table Scan Plan
+If you want to compare an index-based plan against a plan with a full table
scan, disable the `planner.enable_index_planning` option in Drill, and run the
EXPLAIN PLAN FOR command with the query. Running this command with the
`planner.enable_index_planning` option disabled forces Drill to generate a plan
that includes a full table scan.
+
+You can compare the full table scan plan against the index-based plan to
compare the costs and resource consumption of each plan.
+
+In the following example, the indexing feature is enabled, and Drill generated
a plan using the index:
+
+ EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5
LIMIT 10;
+ +------+------+
+ | text | json |
+ +------+------+
+ | 00-00 Screen
+ 00-01 Project(L_SHIPDate=[$0])
+ 00-02 SelectionVectorRemover
+ 00-03 Limit(fetch=[10])
+ 00-04 Limit(fetch=[10])
+ 00-05 Project(L_SHIPDate=[$1])
+ 00-06 Scan(groupscan=[JsonTableGroupScan
[ScanSpec=JsonScanSpec
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem,
condition=(L_QUANTITY = {"$numberLong":5}), indexName=l_single_c_5],
columns=[`L_QUANTITY`, `L_SHIPDate`]]])
+
+Turning the option off, as shown:
+
+ ALTER SESSION SET planner.enable_index_planning = false
+
+And running the EXPLAIN PLAN FOR command again shows the plan with a full
table scan:
+
+ EXPLAIN PLAN FOR SELECT L_SHIPDate FROM lineitem WHERE L_QUANTITY = 5
LIMIT 10;
+ +------+------+
+ | text | json |
+ +------+------+
+ | 00-00 Screen
+ 00-01 Project(L_SHIPDate=[$0])
+ 00-02 SelectionVectorRemover
+ 00-03 Limit(fetch=[10])
+ 00-04 UnionExchange
+ 01-01 SelectionVectorRemover
+ 01-02 Limit(fetch=[10])
+ 01-03 Project(L_SHIPDate=[$1])
+ 01-04 Scan(groupscan=[JsonTableGroupScan
[ScanSpec=JsonScanSpec
[tableName=hdfs:///drill/testdata/tpch/sf1/data/json/lineitem,
condition=(L_QUANTITY = {"$numberLong":5})], columns=[`L_QUANTITY`,
`L_SHIPDate`]]])
+ |
+
+**Note:** To see the cost of each plan, go to the Drill Web UI and view the
query profile for each EXPLAIN PLAN FOR command that you issue through the
command line.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file