[incubator-pinot] branch master updated: Improving docs on index techniques under tuning pinot section. (#3964)

sunithabeeram Wed, 13 Mar 2019 08:36:15 -0700

This is an automated email from the ASF dual-hosted git repository.

sunithabeeram pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git



The following commit(s) were added to refs/heads/master by this push:
     new 9e37800  Improving docs on index techniques under tuning pinot 
section. (#3964)
9e37800 is described below

commit 9e37800f2b2d33ba173e4bdf0ca03e1a7a126bf4
Author: Seunghyun Lee <[email protected]>
AuthorDate: Wed Mar 13 08:35:48 2019 -0700

    Improving docs on index techniques under tuning pinot section. (#3964)
---
 docs/conf.py                 |   4 +-
 docs/img/dictionary.png      | Bin 0 -> 80599 bytes
 docs/img/no-dictionary.png   | Bin 0 -> 111237 bytes
 docs/img/sorted-forward.png  | Bin 0 -> 70022 bytes
 docs/img/sorted-inverted.png | Bin 0 -> 89473 bytes
 docs/index_techniques.rst    | 109 +++++++++++++++++++++++++++++++++++++++----
 6 files changed, 102 insertions(+), 11 deletions(-)

diff --git a/docs/conf.py b/docs/conf.py
index e34396a..667bd8f 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -76,9 +76,9 @@ author = u'Pinot development team'
 # built documents.
 #
 # The short X.Y version.
-version = u'0.016'
+version = u'0.1.0'
 # The full version, including alpha/beta/rc tags.
-release = u'0.016'
+release = u'0.1.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
diff --git a/docs/img/dictionary.png b/docs/img/dictionary.png
new file mode 100644
index 0000000..d4c3d20
Binary files /dev/null and b/docs/img/dictionary.png differ
diff --git a/docs/img/no-dictionary.png b/docs/img/no-dictionary.png
new file mode 100644
index 0000000..33877f7
Binary files /dev/null and b/docs/img/no-dictionary.png differ
diff --git a/docs/img/sorted-forward.png b/docs/img/sorted-forward.png
new file mode 100644
index 0000000..9c2e9a6
Binary files /dev/null and b/docs/img/sorted-forward.png differ
diff --git a/docs/img/sorted-inverted.png b/docs/img/sorted-inverted.png
new file mode 100644
index 0000000..9d607a6
Binary files /dev/null and b/docs/img/sorted-inverted.png differ
diff --git a/docs/index_techniques.rst b/docs/index_techniques.rst
index 51f0d1a..a3ade2a 100644
--- a/docs/index_techniques.rst
+++ b/docs/index_techniques.rst
@@ -24,19 +24,29 @@ Index Techniques
 ================
 
 Pinot currently supports the following index techniques, where each of them 
have their own advantages in different query
-scenarios.
+scenarios. By default, Pinot will use ``dictionary-encoded forward index`` for 
each column.
 
 Forward Index
 -------------
 
-Dictionary-Encoded Forward Index with Bit Compression
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Dictionary-Encoded Forward Index with Bit Compression (Default)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 For each unique value from a column, we assign an id to it, and build a 
dictionary from the id to the value. Then in the
 forward index, we only store the bit-compressed ids instead of the values.
 
 With few number of unique values, dictionary-encoding can significantly 
improve the space efficiency of the storage.
 
+The below diagram shows the dictionary encoding for two columns with 
``integer`` and ``string`` types. As seen in the
+``colA``, dictionary encoding will save significant amount of space for 
duplicated values. On the other hand, ``colB`` 
+has no duplicated data. Dictionary encoding will not compress much data in 
this case where there are a lot of unique
+values in the column. For ``string`` type, we pick the length of the longest 
value and use it as the length for 
+dictionary's fixed length value array. In this case, padding overhead can be 
high if there are a large number of unique 
+values for a column.
+
+.. image:: img/dictionary.png
+
+
 Raw Value Forward Index
 ~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -45,13 +55,61 @@ In contrast to the dictionary-encoded forward index, raw 
value forward index dir
 Without the dictionary, the dictionary lookup step can be skipped for each 
value fetch. Also, the index can take
 advantage of the good locality of the values, thus improve the performance of 
scanning large number of values.
 
+A typical use case to apply raw value forward index is when the column has a 
large number of unique values and the
+dictionary does not provide much compression. As seen the above diagram for 
dictionary encoding, scanning values
+with a dictionary involves a lot of random access because we need to perform 
dictionary look up. On the other hand, 
+we can scan values sequentially with raw value forward index and this can 
improve performance a lot when applied 
+appropriately.
+
+.. image:: img/no-dictionary.png
+
+Raw value forward index can be configured for a table by setting it in the 
table config as
+
+.. code-block:: none
+
+    {
+        "noDictionaryColumns": [
+          "column_name",
+          ...
+        ],
+        ...
+    }
+
+
 Sorted Forward Index with Run-Length Encoding
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-On top of the dictionary-encoding, all the values are sorted, so sorted 
forward index has the advantages of both good
-compression and data locality.
+When a column is physically sorted, Pinot uses a sorted forward index with 
run-length encoding on top of the 
+dictionary-encoding. Instead of saving dictionary ids for each document id, we 
store a pair of start and end 
+document id for each value. (The below diagram does not include dictionary 
encoding layer for simplicity.)
+
+.. image:: img/sorted-forward.png
+
+
+Sorted forward index has the advantages of both good compression and data 
locality. Sorted forward index can 
+also be used as inverted index.
+
+Sorted index can be configured for a table by setting it in the table config as
+
+.. code-block:: none
+
+    {
+        "sortedColumn": [
+          "memberId"
+        ],
+        ...
+    }
+
+Realtime server will sort data on ``sortedColumn`` when generating segment 
internally. For offline push, input data
+needs to be sorted before running Pinot segment conversion and push job.
+
+When applied correctly, one can find the following information on the segment 
metadata.
+
+.. code-block:: none
+
+    $ grep memberId <segment_name>/v3/metadata.properties | grep isSorted
+    column.memberId.isSorted = true
 
-Sorted forward index can also be used as inverted index.
 
 Inverted Index (only available with dictionary-encoded indexes)
 ---------------------------------------------------------------
@@ -59,12 +117,36 @@ Inverted Index (only available with dictionary-encoded 
indexes)
 Bitmap Inverted Index
 ~~~~~~~~~~~~~~~~~~~~~
 
-Pinot maintains a map from each value to a bitmap, which makes value lookup to 
be constant time.
+When inverted index is enabled for a column, Pinot maintains a map from each 
value to a bitmap, which makes value 
+lookup to be constant time. When you have a column that is used for filtering 
frequently, adding an inverted index
+will improve the performance greatly.
+
+Inverted index can be configured for a table by setting it in the table config 
as
+
+.. code-block:: none
+
+    {
+        "invertedIndexColumns": [
+          "column_name"
+        ],
+        ...
+    }
+
 
 Sorted Inverted Index
 ~~~~~~~~~~~~~~~~~~~~~
-Because the values are sorted, the sorted forward index can directly be used 
as inverted index, with constant time
-lookup and good data locality.
+Sorted forward index can directly be used as inverted index, with ``log(n)`` 
time lookup and it can benefit from data locality. 
+
+For the below example, if the query has a filter on ``memberId``, Pinot will 
perform binary search on ``memberId`` values 
+to find the range pair of docIds for corresponding filtering value. If the 
query requires to scan values for other columns
+after filtering, values within the range docId pair will be located together; 
therefore, we can benefit a lot from data locality.
+
+.. image:: img/sorted-inverted.png
+
+Sorted index performs much better than inverted index; however, it can only be 
applied to one column. When the query performance
+with inverted index is not good enough and most of queries have a filter on a 
specific column (e.g. memberId), sorted index can
+improve the query performance.
+
 
 Advanced Index
 --------------
@@ -74,3 +156,12 @@ Star-Tree Index
 
 Unlike other index techniques which work on single column, Star-Tree index is 
built on multiple columns, and utilize the
 pre-aggregated results to significantly reduce the number of values to be 
processed, thus improve the query performance.
+
+
+Notes on Index Tuning
+---------------------
+
+If your use case is not site facing with a strict low latency requirement, 
inverted index will perform good enough for 
+the most of use cases. We recommend to start with adding inverted index and if 
the query does not perform good enough,
+a user can consider to use more advanced indices such as sorted column and 
star-tree index.
+


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot] branch master updated: Improving docs on index techniques under tuning pinot section. (#3964)

Reply via email to