[incubator-pinot.wiki] branch master updated: Updated Architecture (markdown)

jlli Tue, 05 Feb 2019 11:21:52 -0800

This is an automated email from the ASF dual-hosted git repository.

jlli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.wiki.git



The following commit(s) were added to refs/heads/master by this push:
     new 88b0dcd  Updated Architecture (markdown)
88b0dcd is described below

commit 88b0dcdf70ac9f0a13b172d3eb020310a9dd695f
Author: Jialiang Li <[email protected]>
AuthorDate: Tue Feb 5 11:21:45 2019 -0800

    Updated Architecture (markdown)
---
 Architecture.md | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/Architecture.md b/Architecture.md
index 537dd64..751a7aa 100644
--- a/Architecture.md
+++ b/Architecture.md
@@ -273,7 +273,7 @@ Apart from the above segment related metadata, we also 
store metadata for each c
 
 <td class="confluenceTd">
 
-column.<columnName>.cardinality
+column.${columnName}.cardinality
 
 </td>
 
@@ -283,7 +283,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td class="confluenceTd"><span>column.<columnName>.totalDocs</span></td>
+<td class="confluenceTd"><span>column.${columnName}.totalDocs</span></td>
 
 <td class="confluenceTd">Number of documents in the segment</td>
 
@@ -291,7 +291,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.dataType</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.dataType</span></td>
 
 <td colspan="1" class="confluenceTd">Data type of this column, INT, 
FLOAT,STRING etc</td>
 
@@ -299,7 +299,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.bitsPerElement</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.bitsPerElement</span></td>
 
 <td colspan="1" class="confluenceTd">If dictionary encoding is applied, how 
many bits are needed to store each value</td>
 
@@ -307,7 +307,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.lengthOfEachEntry</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.lengthOfEachEntry</span></td>
 
 <td colspan="1" class="confluenceTd">If the column type is String, this 
indicates the max length of character needed to store this value. Similar to 
varchar(100).</td>
 
@@ -315,7 +315,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.columnType</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.columnType</span></td>
 
 <td colspan="1" class="confluenceTd">Type of the column Dimension, Metric, 
Time</td>
 
@@ -323,7 +323,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.isSorted</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.isSorted</span></td>
 
 <td colspan="1" class="confluenceTd">Do the values for this column appear in 
sorted order in the segment</td>
 
@@ -331,7 +331,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.hasNullValue</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.hasNullValue</span></td>
 
 <td colspan="1" class="confluenceTd">Can there be a null value for this 
column</td>
 
@@ -339,7 +339,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.hasDictionary</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.hasDictionary</span></td>
 
 <td colspan="1" class="confluenceTd">Was a dictionary used to encode the 
data</td>
 
@@ -347,7 +347,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.hasInvertedIndex</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.hasInvertedIndex</span></td>
 
 <td colspan="1" class="confluenceTd">Does this column have inverted index</td>
 
@@ -355,7 +355,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.isSingleValues</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.isSingleValues</span></td>
 
 <td colspan="1" class="confluenceTd">Is this column single valued or multi 
valued</td>
 
@@ -363,7 +363,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.maxNumberOfMultiValues</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.maxNumberOfMultiValues</span></td>
 
 <td colspan="1" class="confluenceTd">Applicable to Multi Value column.Max 
number of values per document</td>
 
@@ -371,7 +371,7 @@ column.<columnName>.cardinality
 
 <tr>
 
-<td colspan="1" 
class="confluenceTd"><span>column.<columnName>.totalNumberOfEntries</span></td>
+<td colspan="1" 
class="confluenceTd"><span>column.${columnName}.totalNumberOfEntries</span></td>
 
 <td colspan="1" class="confluenceTd">Applicable to Multi Value column. Total 
number of entries across all documents in the segment.</td>
 
@@ -389,32 +389,32 @@ column.<columnName>.cardinality
 
 <span style="line-height: 1.4285715;">Dictionary is used to encode the values 
in the column. Applying dictionary encoding can significantly reduce the data 
size. This is especially the case when the cardinality of the column is low (up 
to thousands). One option is to always used integer data type to encode the 
values. But in some cases the number of unique values are few and hence we use 
fixed bit encoding. For example, if the number of unique values is 3, we just 
need 2 bits to represen [...]
 
-While dictionary encoding saves space, it introduces additional look up cost 
during query processing. We need convert the dictionary id back to actual value 
at run time. This is known to cause significant over head when the number of 
looks up required is high.Look up cannot be hash map since hash map end up 
requiring additional memory. Linear scan on the dictionary to perform a look up 
will be very slow, one simple optimization is to sort the values and perform a 
binary search for the lo [...]
+While dictionary encoding saves space, it introduces additional look up cost 
during query processing. We need convert the dictionary id back to actual value 
at run time. This is known to cause significant over head when the number of 
looks up required is high. Look up cannot be hash map since hash map end up 
requiring additional memory. Linear scan on the dictionary to perform a look up 
will be very slow, one simple optimization is to sort the values and perform a 
binary search for the l [...]
 
 Dictionary does allow us to speed up the query processing, since dictionary 
allows us to know all the unique values for a column in a segment. We can skip 
the processing of a segment during predicate evaluation if the RHS of the 
predicate is absent in the dictionary.
 
-Currently, we generate one dictionary per segment, in future, we will explore 
the idea of maintaining a global dictionary across all segments. This will 
further reduce the dictionary over head and allows us convert look ups on 
dictionary into hash map look ups.
+Currently, we generate one dictionary per segment, in future, we will explore 
the idea of maintaining a global dictionary across all segments. This will 
further reduce the dictionary over head and allows us to convert look ups on 
dictionary into hash map look ups.
 
 ###### Forward Index (.sv.sorted.fwd)
 
 
-<div>Forward Index store the column value(single or multi) for a given 
document. Forward Indices are stored in a format that allows constant time look 
up given a document id. During query processing, the filter phase returns a 
bunch of document ids. In order to fetch the raw value corresponding a specific 
document Forward Index is used.</div>
+<div>Forward Index stores the column value(single or multi) for a given 
document. Forward Indices are stored in a format that allows constant time look 
up given a document id. During query processing, the filter phase returns a 
bunch of document ids. In order to fetch the raw value corresponding, a 
specific document Forward Index is used.</div>
 
 ###### Single Value Sorted Forward Index (.sv.sorted.fwd)
 
 
-<div>Its well known that sorting the columnar data on any column will 
dramatically speed up query execution. Most use cases chose one of the columns 
as their leading key. During index creation, Pinot sorts the data in every 
segment based on this leading key. Sorting the data allows us to significantly 
reduce the number size of the column. Lets say that there are 100 million rows 
in the segment and number of unique leading keys is only 1 million. Instead of 
storing 100 million values, we  [...]
+<div>It's well known that sorting the columnar data on any column will 
dramatically speed up query execution. Most use cases chose one of the columns 
as their leading key. During index creation, Pinot sorts the data in every 
segment based on this leading key. Sorting the data allows us to significantly 
reduce the number size of the column. Let's say that there are 100 million rows 
in the segment and number of unique leading keys is only 1 million. Instead of 
storing 100 million values, w [...]
 
 [[image2015-5-19 0-29-34.png]]
 
 ###### Single Value unsorted Forward Index (.sv.unsorted.fwd)
 
-If the values in a column are not sorted, we have the following possible 
optimizations
+If the values in a column are not sorted, we have the following possible 
optimizations:
 
 1.  Dictionary encoding if feasible. See previous section on when we apply 
dictionary encoding
 2.  <span style="line-height: 1.4285715;">Snappy or LZO or LZ4 or ZLIB 
compression</span>
 
-In the current version of Pinot, we only apply dictionary encoding that allows 
us to compress the data using Fixed bit encoding. In subsequent versions, we 
will evaluate other compression techniques such as snappy etc. While these 
compression techniques save space, there is additional over head to decompress 
them on the fly. The challenge here is to get the right trade-off between 
compressing the data and query latency
+In the current version of Pinot, we only apply dictionary encoding that allows 
us to compress the data using Fixed bit encoding. In subsequent versions, we 
will evaluate other compression techniques such as snappy etc. While these 
compression techniques save space, there is additional over head to decompress 
them on the fly. The challenge here is to get the right trade-off between 
compressing the data and query latency.
 
 ###### Multi Value Forward Index (.mv.fwd)
 
@@ -431,7 +431,7 @@ In some cases, the columns are multi-valued such as skill 
sets of a member. Whil
 
 Query Execution Phases
 
-*   Query Parsing: Pinot supports a slightly modified version of SQL which we 
refer to as PQL. PQL only supports a subset of SQL for example Pinot does not 
support Joins, nested sub queries etc. We use Antlr to parse the query into a 
parse tree. In this phase, all syntax validations are performed and default 
values are set for missing elements.
+*   Query Parsing: Pinot supports a slightly modified version of SQL which we 
refer to as PQL. PQL only supports a subset of SQL, for example Pinot does not 
support Joins, nested sub queries etc. We use Antlr to parse the query into a 
parse tree. In this phase, all syntax validations are performed and default 
values are set for missing elements.
 *   Logical Plan Phase: This phase takes in query parse tree and outputs a 
Logical Plan Tree. This phase is single threaded and is simple and constructs 
the appropriate logical plan operator tree based on the query type (selection, 
aggregation, group by etc) and metadata provided by the data source.
 *   Physical Plan Phase: This phase further optimizes the plan based on 
individual segment. The optimization applied in this phase can be different 
across various segments.
 *   Executor Service: Once we have per segment physical operator tree, 
executor service takes up the responsibility of scheduling the query processing 
tasks on each and every segment.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot.wiki] branch master updated: Updated Architecture (markdown)

Reply via email to