[incubator-pinot] branch master updated: [PINOT-7658] Moving design documents to cwiki (#3802)

mcvsubbu Thu, 07 Feb 2019 16:37:33 -0800

This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git



The following commit(s) were added to refs/heads/master by this push:
     new ab203a9  [PINOT-7658] Moving design documents to cwiki (#3802)
ab203a9 is described below

commit ab203a9573771a14640560f6010cd9ae646614d8
Author: Subbu Subramaniam <[email protected]>
AuthorDate: Thu Feb 7 16:36:57 2019 -0800

    [PINOT-7658] Moving design documents to cwiki (#3802)
    
    * [PINOT-7658] Moving design documents to cwiki
    
    See https://cwiki.apache.org/confluence/display/PINOT/Home
    
    * Fixing design doc references in documentation
---
 docs/High-level-stream.png             | Bin 39800 -> 0 bytes
 docs/Low-level-stream.png              | Bin 37306 -> 0 bytes
 docs/PlanNode.png                      | Bin 41056 -> 0 bytes
 docs/ServerSegmentCompletion.dot.png   | Bin 72968 -> 0 bytes
 docs/architecture.rst                  |   2 +-
 docs/commit-happy-path-1.png           | Bin 22933 -> 0 bytes
 docs/commit-happy-path-2.png           | Bin 22939 -> 0 bytes
 docs/committer-failed.png              | Bin 15412 -> 0 bytes
 docs/controller-failed.png             | Bin 23599 -> 0 bytes
 docs/controller-segment-completion.png | Bin 85092 -> 0 bytes
 docs/delayed-server.png                | Bin 21690 -> 0 bytes
 docs/expressionTree.jpg                | Bin 12881 -> 0 bytes
 docs/expressions_udf.rst               | 120 ------------------------
 docs/index.rst                         |  23 -----
 docs/llc.rst                           | 164 ---------------------------------
 docs/multiple-server-failure.png       | Bin 18036 -> 0 bytes
 docs/parseTree.png                     | Bin 55136 -> 0 bytes
 docs/partition_aware_routing.rst       | 141 ----------------------------
 docs/pluggable_streams.rst             |   3 +-
 docs/schema_timespec.rst               | 111 ----------------------
 docs/segment-consumer-fsm.png          | Bin 30109 -> 0 bytes
 docs/segment-creation.png              | Bin 9830 -> 0 bytes
 docs/segment-helix-fsm.png             | Bin 7494 -> 0 bytes
 docs/zk-setup.png                      | Bin 39246 -> 0 bytes
 24 files changed, 3 insertions(+), 561 deletions(-)

diff --git a/docs/High-level-stream.png b/docs/High-level-stream.png
deleted file mode 100644
index ac68d60..0000000
Binary files a/docs/High-level-stream.png and /dev/null differ
diff --git a/docs/Low-level-stream.png b/docs/Low-level-stream.png
deleted file mode 100644
index 9bc87f7..0000000
Binary files a/docs/Low-level-stream.png and /dev/null differ
diff --git a/docs/PlanNode.png b/docs/PlanNode.png
deleted file mode 100644
index 65b69e3..0000000
Binary files a/docs/PlanNode.png and /dev/null differ
diff --git a/docs/ServerSegmentCompletion.dot.png 
b/docs/ServerSegmentCompletion.dot.png
deleted file mode 100644
index f56d112..0000000
Binary files a/docs/ServerSegmentCompletion.dot.png and /dev/null differ
diff --git a/docs/architecture.rst b/docs/architecture.rst
index 27d583e..48a4148 100644
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -120,7 +120,7 @@ Realtime segments are immutable once they are completed. 
While realtime segments
 in the sense that new rows can be added to them. Rows cannot be deleted from 
segments.
 
 
-See :doc:`realtime design <llc>` for details.
+See `Consuming and Indexing rows in Realtime 
<https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_
 for details.
 
 
 Pinot Segments
diff --git a/docs/commit-happy-path-1.png b/docs/commit-happy-path-1.png
deleted file mode 100644
index df65dd8..0000000
Binary files a/docs/commit-happy-path-1.png and /dev/null differ
diff --git a/docs/commit-happy-path-2.png b/docs/commit-happy-path-2.png
deleted file mode 100644
index 77c9e48..0000000
Binary files a/docs/commit-happy-path-2.png and /dev/null differ
diff --git a/docs/committer-failed.png b/docs/committer-failed.png
deleted file mode 100644
index f1b536d..0000000
Binary files a/docs/committer-failed.png and /dev/null differ
diff --git a/docs/controller-failed.png b/docs/controller-failed.png
deleted file mode 100644
index 18bef06..0000000
Binary files a/docs/controller-failed.png and /dev/null differ
diff --git a/docs/controller-segment-completion.png 
b/docs/controller-segment-completion.png
deleted file mode 100644
index e6ceac2..0000000
Binary files a/docs/controller-segment-completion.png and /dev/null differ
diff --git a/docs/delayed-server.png b/docs/delayed-server.png
deleted file mode 100644
index b7fda39..0000000
Binary files a/docs/delayed-server.png and /dev/null differ
diff --git a/docs/expressionTree.jpg b/docs/expressionTree.jpg
deleted file mode 100644
index 5fe9772..0000000
Binary files a/docs/expressionTree.jpg and /dev/null differ
diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst
deleted file mode 100644
index 1e1b675..0000000
--- a/docs/expressions_udf.rst
+++ /dev/null
@@ -1,120 +0,0 @@
-Expressions and UDFs
-====================
-
-Requirements
-~~~~~~~~~~~~
-The query language for Pinot (:doc:`PQL <reference>`) currently only supports 
*selection*, *aggregation* & *group by* operations on columns, and moreover, do 
not support nested operations. There are a growing number of use-cases of Pinot 
that require some sort of transformation on the column values, before and/or 
after performing *selection*, *aggregation* & *group by*. One very common 
example is when we would want to aggregate *metrics* over different granularity 
of times, without needi [...]
-
-The high level requirement here is to support *expressions* that represent a 
function on a set of columns in the queries, as opposed to just columns.
-
-.. code-block:: none
-
-  select <exp1> from myTable where ... [group by <exp2>]
-
-Where exp1 and exp2 can be of the form:
-
-.. code-block:: none
-
-  func1(func2(col1, col2...func3(...)...), coln...)...
-
-
-Proposal
-~~~~~~~~
-
-We first propose the concept of a Transform Operator (*xform*) below. We then 
propose using these *xform* operators to perform transformations before/after 
*selection*, *aggregation* and *group by* operations.
-
-The *xform* operator takes following inputs:
-
-#. An expression tree capturing *functions* and the *columns* they are applied 
on. The figure below shows one such tree for expression: ``f1(f2(col1, col2), 
col3)``
-#. A set of document Id's on which to perform *xform*
-
-.. figure:: expressionTree.jpg
-
-The *xform* produces the following output:
-
-* For each document Id in the input, it evaluates the specified expression, 
and produces one value.
-
-  * It is Many:1 for columns, i.e. many columns in the input produce one 
column value in the output.
-  * It is 1:1 for document Id's, i.e. for each document in the input, it 
produces one value in the output.
-
-The *functions* in the *expression* can be either built-in into Pinot, or can 
be user-defined. We will discuss the mechanism for hooking up *UDF* and the 
manageability aspects in later sections.
-
-Parser
-~~~~~~
-
-The PQL parser is already capable of parsing expressions in the *selection*, 
*aggregation* and *group by* sections. Following is a sample query containing 
expression, and its parse tree shown in the image.
-
-.. code-block:: none
-
-  select f1(f2(col1, col2), col3) from myTable where (col4 = 'x') group by 
f3(col5, f4(col6, col7))
-
-
-.. figure:: parseTree.png
-
-BrokerRequest
-~~~~~~~~~~~~~
-
-We convert the Parse Tree from the parser into what we refer to as 
*BrokerRequest*, which captures the parse tree along with other information and 
is serialized from Pinot broker to server.
-While the parser does already recognize expressions in these sections, the 
*BrokerRequest* currently assumes these to be columns and not expressions. We 
propose the following enhancements here:
-
-#. *BrokerRequest* needs to be enhanced to support not just *columns* but also 
*expressions* in the *selection*, *aggregation* & *group by* sections. 
*BrokerRequest* is currently implemented via 'Thrift'. We will need to enhance 
*request.thrift* to be able to support expressions. There are a couple of 
options here:
-
-  * We use the same mechanism as *FilterQuery* (which is how the predicates 
are represented).
-  * Evaluate other structures that may be more suitable for expression 
evaluation. (TBD).
-
-
-#. The back-end of the parser generates *BrokerRequest* based on the parse 
tree of the query. We need to implement the functionality that takes the parse 
tree containing expressions in these sections and generates the new/enhanced 
*BrokerRequest* containing expressions.
-
-
-Query Planning
-~~~~~~~~~~~~~~
-
-In the `query planning 
<https://github.com/linkedin/pinot/wiki/Query-Execution>`_ phase, Pinot server 
receives a *BrokerRequest* (per query) and parses it to build a query plan, 
where it hooks up various plan nodes for filtering, 
*Selection/Aggregation/GroupBy*, combining together.
-
-A new *TransformPlanNode* class will be implemented that implements the 
*PlanNode* interface.
-The query planning phase will be enhanced to include new *xform* plan nodes if 
the *BrokerRequest* contains *expressions* for *selection*, *aggregation* & 
*group by*. These plan nodes will get hooked up appropriately during planning 
phase.
-
-.. figure:: PlanNode.png
-
-Query Execution
-~~~~~~~~~~~~~~~
-
-In the query execution phase, the *run* method for *TransformPlanNode* will 
return a new *TransformOperator*. This operator is responsible for applying a 
transformation to a given set of documents, as specified by the *expression* in 
the query. The output *block* of this operator will be fed into other operators 
as per the query plan.
-
-UDFs
-~~~~
-
-The functions in *expressions* can either be built-in functions in Pinot, or 
they can be user-defined. There are a couple of approaches for supporting 
hooking up of UDF's into Pinot:
-
-#. If the function is generic enough and reusable by more than one clients, it 
might be better to include it as part of Pinot code base. In this case, the 
process for users would be to file a pull-request, which would then be reviewed 
and become part of Pinot code base.
-
-#. Dynamic loading of user-defined functions:
-
-  * Users can specify jars containing their UDF's in the class path.
-  * List of UDF's can be specified in server config, and the server can ensure 
that it can find and load classes for each UDF specified in the config. This 
allows for a one-time static checking of availability of all specified UDF's.
-  * Alternatively, the server may do a dynamic check for each query to ensure 
all UDF's specified in the query are available and can be loaded.
-
-
-Backward compatibility
-~~~~~~~~~~~~~~~~~~~~~~
-
-Given that this proposal requires modifying *BrokerRequest*, we are exposed to 
backward compatibility issues where  different versions of broker and server 
are running (one with the new feature and another without). We propose to 
address this as follows:
-
-#. The changes to *BrokerRequest* to include *expressions* instead of 
*columns* would only take effect if a query containing *expression* is 
received. For the query just contains *columns* instead of *expressions*, we 
fall be to existing behavior and send the *columns* as they are being sent in 
the current design (ie not as a special case of an *expresion*).
-
-#. This will warrant the following sequencing:
-   * Broker upgraded before server.
-   * New queries containing *expressions* should be sent only after both 
broker and server are upgraded.
-
-Limitations
-~~~~~~~~~~~
-
-We see the following limitations in functionality currently:
-
-#. Nesting of *aggregation* functions is not supported in the expression tree. 
This is because the number of documents after *aggregation* is reduced. In the 
expression below, *sum* of *col2* would yield one value, whereas *xform1* one 
*col1* would yield the same number of documents as in the input.
-
-.. code-block:: none
-
-   sum(xform1(col1), sum(col2))
-
-#. The current parser does not support precedence/associativity of operators, 
it just builds parse tree from left to right. Addressing this is outside of the 
scope of this project. Once the parser is enhanced to support this, 
*expression* evaluation within query execution would work correctly without any 
code changes required.
diff --git a/docs/index.rst b/docs/index.rst
index f929f57..92b9b09 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -38,26 +38,3 @@ Customizing Pinot
    segment_fetcher
    pluggable_storage
 
-################
-Design Documents
-################
-
-.. toctree::
-   :maxdepth: 1
-
-
-   llc
-   partition_aware_routing
-   expressions_udf
-   schema_timespec
-
-################
-Design Proposals
-################
-
-.. toctree::
-   :maxdepth: 1
-
-
-   multitenancy
-
diff --git a/docs/llc.rst b/docs/llc.rst
deleted file mode 100644
index 4c08e70..0000000
--- a/docs/llc.rst
+++ /dev/null
@@ -1,164 +0,0 @@
-Realtime Design
-===============
-
-Pinot consumes rows from streaming data (such as Kafka) and serves queries on 
the data consumed thus far.
-
-Two modes of consumption are supported in Pinot:
-
-.. _hlc-section:
-
-High Level Consumers
---------------------
-
-.. figure:: High-level-stream.png
-
-   High Level Stream Consumer Architecture
-
-
-*TODO*: Add design description of how HLC realtime works
-
-
-.. _llc-section:
-
-Low Level Consumers
-------------------- 
-
-.. figure:: Low-level-stream.png
-
-   Low Level Stream Consumer Architecture
-
-The HighLevel Pinot consumer has the following properties:
-
-* Each consumer needs to consume from all partitions. So, if we run one 
consumer in a host, we are limited by the capacity of that host to consume all 
partitions of the topic, no matter what the ingestion rate is.
-    A stream may provide a way by which multiple hosts can consume the same 
topic, with each host receiving a subset of the messages. However in this mode 
the stream may duplicate rows that across the machines when the machines go 
down and come back up. Pinot cannot afford that.
-* A stream consumer has no control over the messages that are received. As a 
result, the consumers may have more or less same segments, but not exactly the 
same. This makes capacity expansion etc.operationally heavy (e.g. start a 
consumer and wait 5 days before enabling it to serve queries). Having 
equivalent segments allows us to store the segments in the controller (like the 
offline segments) and download them onto a new server during capacity 
expansion, drastically reducing the operat [...]
-    If we have common realtime segments across servers, it allows the brokers 
to use different routing algorithms (like we do with offline segments). 
Otherwise, the broker needs to route a query to exactly one realtime server so 
that we do not see duplicate data in results.
-
-Design
-------
-
-When a table is created, the controller determines the number of partitions 
for the table, and "creates" one segment per partition, spraying these segments 
evenly across all the tagged servers. The following steps are done as a part of 
creating each segment:
-* Segment metadata is created in Zookeeper. The segments are named as 
tableName__partitionNumber__segmentSeqNumber__Timestamp. For example: 
"myTable__6__0__20180801T1647Z"
-* Segment metadata is set with the segment completion criteria -- the number 
of rows. The controller computes this number by dividing the rows threshold set 
in table configuration by the total number of segments of the table on the same 
server.
-* Segment metadata is set with the offset from which to consume. Controller 
determines the offset by querying the stream.
-* Table Idealstate is set with these segment names and the appropriate server 
instances thay are hosted. The state is set to CONSUMING
-* Depending on the number of replicas set, each partition could be consumed in 
multiple servers.
-
-When a server completes consuming the segment and reaches the end-criteria 
(either time or number of rows as per segment metadata), the server goes 
through a segment completion protocol sequence (described in diagrams below). 
The controller orchestrates all the replicas to reach the same consumption 
level, and allows one replica to "commit" a segment. The "commit" step involves:
-
-* The server uploads the completed segment to the controller
-* The controller updates that segments metadata to mark it completed, writes 
the end offset, end time, etc. in the metadata
-* The controller creates a new segment for the same partition (e.g. 
"myTable__6__1__20180801T1805Z") and sets the metadata exactly like before, 
with the consumption offsets adjusted to reflect the end offset of the previous 
segmentSeqNumber.
-* The controller updates the IdealState to change the state of the completing 
segment to ONLINE, and add the new segment in CONSUMING state.
-
-As a first implementation, the end-criteria in the metadata points to the 
table config. It can be used at some point if we want to implement a more fancy 
end-criteria, perhaps based on traffic or other conditions, something that 
varies on a per-segment basis. The end-criteria could be number of rows, or 
time. If number of rows is specified, then the controller divides the specified 
number by the number of segments (of that table) on the same server, and sets 
the appropriate row threshold [...]
-
-We change the broker to introduce a new routing strategy that prefers ONLINE 
to CONSUMING segments, and ensures that there is at most one segment in 
CONSUMING state on a per partition basis in the segments that a query is to be 
routed to.
-
-Important tuning parameters for Realtime Pinot
-----------------------------------------------
-
-* replicasPerPartition: This number indicates how many replicas are needed for 
each partition to be consumed from the stream
-* realtime.segment.flush.threshold.size: This parameter should be set to the 
total number of rows of a topic that a realtime consuming server can hold in 
memory. Default value is 5M. If the value is set to 0, then the number of rows 
is automatically adjusted such that the size of the segment generated is as per 
the setting realtime.segment.flush.desired.size
-* realtime.segment.flush.desired.size: Default value is "200M". The setting is 
used only if realtime.segment.flush.threshold.size is set to 0
-* realtime.segment.flush.threshold.size.llc: This parameter overrides 
realtime.segment.flush.threshold.size. Useful when migrating live from HLC to 
LLC
-* pinot.server.instance.realtime.alloc.offheap: Default is false. Set it to 
true if you want off-heap allocation for dictionaries and no-dictionary column
-* pinot.server.instance.realtime.alloc.offheap.direct: Default is false. Set 
it to true if you want off-heap allocation from DirectMemory (as opposed to 
MMAP)
-* pinot.server.instance.realtime.max.parallel.segment.builds: Default is 0 
(meaning infinite). Set it to a number if you want to limit number of segment 
builds. Segment builds take up heap memory, so it is useful to have a max 
setting and limit the number of simultaneous segment builds on a single server 
instance JVM.
-
-Live migration of existing use cases from HLC to LLC
-----------------------------------------------------
-
-Preparation
-~~~~~~~~~~~
-
-* Set the new configurations as desired (replicasPerPartition, 
realtime.segment.flush.threshold.size.llc, 
realtime.segment.flush.threshold.time.llc). Note that the ".llc" versions of 
the configs are to be used only if you want to do a live migration of an 
existing table from HLC to LLC and need to have different thresholds for LLC 
than HLC.
-* Set loadMode of segments to MMAP
-* Set configurations to use offheap (either direct or MMAP) for dictionaries 
and no-dictinary items (realtime.alloc.offheap, realtime.alloc.offheap.direct)
-* If your stream is Kafka, add ``stream.kafka.broker.list`` configurations for 
per-partition consumers
-* Increase the heap size (doubling it may be useful) since we will be 
consuming both HLC and LLC on the same machines now. Restart the servers
-
-Consuming the streams via both mechanisms
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Configure two consumers but keep routing to be KafkaHighLevel
-
-* Change the ``stream.<your stream here>.consumer.type`` setting to be 
``highLevel,simple``. This starts both LLC and HLC consumers on the nodes.
-* Change ``stream.<your stream here>.consumer.prop.auto.offset.reset`` to have 
the value largest (otherwise, llc consumers will start consuming from the 
beginning which may interfere with HLC operations)
-* Check memory consumption and query response times.
-* Set the broker routingTableuilderName to be ``KafkaHighLevel`` so that 
queries are not routed to LLC until consumption is caught up.
-* Apply the table config
-* Restart brokers and servers
-* wait for retention period of the table.
-
-Disabling HLC
-~~~~~~~~~~~~~
-
-* Change the ``stream.<your stream here>.consumer.type`` setting to be "simple"
-* Remove the routingTableuilderName setting
-* Apply the table configs and restart the brokers and servers
-* The HLC segments will slowly age out on their own.
-
-Migration from HLC to LLC with downtime
----------------------------------------
-
-If it is all right to take a down time, then the easiest way is to disable the 
table, do the last step of the previous migration steps, and restart the table 
once the consumption has caught
-up.
-
-LLC Zookeeper setup and Segment Completion Protocol
----------------------------------------------------
-
-
-.. figure:: zk-setup.png
-
-   Zookeeper setup
-
-
-.. figure:: segment-helix-fsm.png
-
-    Server-side Helix State Machine
-
-.. figure:: ServerSegmentCompletion.dot.png
-
-   Server-side Partition consumer state machine
-
-.. figure:: segment-consumer-fsm.png
-
-     Server-side control flow
-
-.. figure:: controller-segment-completion.png
-
-    Controller-side Segment completion state machine
-
-Scenarios
-~~~~~~~~~
-
-
-.. figure:: segment-creation.png
-
-   Segment Creation
-
-
-.. figure:: commit-happy-path-1.png
-
-   Happy path commit 1
-
-.. figure:: commit-happy-path-2.png
-
-    Happy path commit 2
-
-.. figure:: delayed-server.png
-
-    Delayed Server
-
-.. figure:: committer-failed.png
-
-   Committer failure
-
-.. figure:: controller-failed.png
-
-    Controller failure during commit
-
-.. figure:: multiple-server-failure.png
-
-    Multiple failures
diff --git a/docs/multiple-server-failure.png b/docs/multiple-server-failure.png
deleted file mode 100644
index b0fa386..0000000
Binary files a/docs/multiple-server-failure.png and /dev/null differ
diff --git a/docs/parseTree.png b/docs/parseTree.png
deleted file mode 100644
index b5247bd..0000000
Binary files a/docs/parseTree.png and /dev/null differ
diff --git a/docs/partition_aware_routing.rst b/docs/partition_aware_routing.rst
deleted file mode 100644
index be9dcd4..0000000
--- a/docs/partition_aware_routing.rst
+++ /dev/null
@@ -1,141 +0,0 @@
-Partition Aware Query Routing
-=============================
-
-In ongoing efforts to support use cases with low latency high throughput 
requirements, we have started to notice scaling issues in Pinot broker where 
adding more replica sets does not improve throughput as expected beyond a 
certain point. One of the reason behind this is the fact that the broker does 
not perform any pruning of segments, and so every query ends up processing each 
segment in the data set. This processing of potentially unnecessary segments 
has been shown to impact throughp [...]
-
-Details
--------
-
-The Pinot broker component is responsible for receiving queries, fanning them 
out to Pinot servers, merging the responses  from servers and finally sending 
the merged responses back to the client. The broker maintains multiple randomly 
generated routing tables that map each server to a subset of segments, such 
that one routing table covers one replica set (across various servers). This 
implies that for each query all segments of a replica set are processed for a 
server.
-
-This becomes an overhead when the answer for the given query lies within a 
small subset of segments. One very common example is when queries have a narrow 
time filter (say 30 days), but the retention is two years (730 segments, at the 
rate of one segment per day). For each such query there are multiple overheads:
-
-       Broker needs to use connections to servers that may not even be hosting 
any segments worth processing.
-       On the server side, there is query planning as well as execution once 
per segment. This happens for hundreds (or even few thousands) of segments, 
when only a few need to be actually processed.
-       
-We observed through experiments as well as prototyping that while these 
overheads may (or may not) impact latency, they certainly impact throughput 
quite a bit. In an experiment with one SSD node with 500 GB of data (720 
segments), we observed a baseline QPS of 150 without any pruning and pruning on 
broker side improves QPS to about 1500.
-
-Proposed Solution
------------------
-
-We propose a solution that would allow the broker to quickly prune segments 
for a given query, reducing the overheads and improving throughput. The idea is 
to make information available in the segment metadata for broker to be able to 
quickly prune a segment for a given query. Once the broker has compiled the 
query, it has the filter query tree that represents the predicates in the 
query. If there are predicates on column(s) for which there is partition/range 
information in the metadata, [...]
-
-
-In our design, we propose two new components within the broker:
-
-  * **Segment ZK Metadata Manager**: This component will be responsible for 
caching the segment ZK metadata in memory within the broker. Its cache will be 
kept upto date by listening to external view changes.
-  * **Segment Pruner**: This component will be responsible pruning segments 
for a given query. Segments will be pruned based on the segment metadata and 
the predicates in the query.
-
-Segment ZK Metadata Manager
----------------------------
-
-While the broker does not have access to the segments themselves, it does have 
access to the ZK metadata for segments. The SegmentZKMetadataManager will be 
responsible for fetching and caching this metadata for each segment.
-Upon transition to online state, the SegmentZKMetadataManager will go over the 
segments of the table(s) it hosts and build its cache.
-It will use the existing External View change listener to update its cache. 
Since External View does not change when segments are refreshed, and setting 
watches for each segment in the table is too expensive, we are choosing to live 
with a limitation where this feature does not work for refresh usecase. This 
limitation can be revisited at a later time, when inexpensive solutions are 
available to detect segment level changes for refresh usecases.
-
-Table Level Partition Metadata
-------------------------------
-
-We will introduce a table level config to control enabling/disabling this 
feature for a table. This config can actually be the pruner class name, so that 
realtime segment generation can pick it up. Absence or incorrect specification 
of this table config would imply the feature is disabled for the table.
-
-Segment Level Partition Metadata
---------------------------------
-
-The partition metadata would be a list of tuples, one for each partition 
column, where each tuple contains:
-Partition column: Column on which the data is partitioned.
-Partition value: A min-max range with [start, end]. For single value start == 
end.
-Prune function: Name of the class that will be used by the broker to prune a 
segment based on the predicate(s) in the query. It will take as argument the 
predicate value and the partition value, and return true if segment can be 
pruned, and false otherwise.
-
-For example, let us consider a case where the data is naturally partitioned on 
time column ‘daysSinceEpoch’. The segment zk metadata will have information 
like below:
-
-.. code-block:: none
-
-  {
-    “partitionColumn”  : “daysSinceEpoch”,
-       “partitionStart”        : “17200”,
-     “partitionEnd”            : “17220”,
-        “pruneFunction”                : “TimePruner”
-  }
-
-Now consider the following query comes in. 
-
-.. code-block:: none
-
-  Select count(*) from myTable where daysSinceEpoch between 17100 and 17110
-
-The broker will recognize the range predicate on the partition column, and 
call the TimePruner which will identify that the predicate cannot be satisfied 
and hence and return true, thus pruning the segment. If there is no segment 
metadata or there is no range predicate on partition column, the segment will 
not be pruned (return false).
-
-Let’s consider another example where the data is partitioned by memberId, 
where a hash function was applied on the memberId to compute a partition number.
-
-.. code-block:: none
-
-  {
-    “partitionColumn”  : “memberId”,
-       “partitionStart”        : “10”,
-    “partitionEnd”             : “10”,
-       “pruneFunction”         : “HashPartitionPruner”
-  }
-
-  Select count(*) from myTable where memberId = 1000`
-
-In this case, the HashPartitionPruner will compute the partition id for the  
memberId (1000) in the query. And if it turns out to anything other than 10, 
the segment would be pruned out. The prune function would contain the complete 
logic (and information) to be able to compute partition id’s from meberId’s.
-
-Segment Pruner
---------------
-
-The broker will perform segment pruning as follows. This is not an exact 
algorithm but meant for conveying the idea. Actual implementation might differ 
slightly (if needed).
-
-#. Broker will compile the query and generate filter query tree as it does 
currently.
-#. The broker will perform a DFS on the filter query tree  and prune the 
segment as follows:
-
-       * If the current node is leaf and is not the partition column, return 
false (not prune).
-       * If the current node is leaf and is the partition column, return the 
result of calling prune function with predicate value from leaf node, and 
partition value from metadata. 
-       * If the current node is AND, return true as long as one of its 
children returned true (prune).
-       * If the current node is OR, return true if all of its children 
returned true (prune).
-
-Segment Generation
-------------------
-
-The segment generation code would be enhanced as follows:
-It already auto-detects sorted columns, but only writes out the min-max range 
for time column to metadata. It will be enhanced to write out the min-max range 
for all sorted columns in the metadata.
-
-For columns with custom partitioning schemes, the name of partitioning 
(pruner) class will be specified in the segment generation config. Segment 
generator will ensure that the column values adhere to partitioning scheme and 
then will write out the partition information into the metadata. In case column 
values do not adhere to partition scheme, it will log a warning and will not 
write partition information in the metadata. The impact of this will be that 
broker would not able to perform  [...]
-
-This will apply to both offline as well as realtime segment generation, except 
that for realtime the pruner class name is specified in the table config. 
-When the creation/annotation of segment ZK metadata from segment metadata 
happens in controller (when adding a new segment) the partition info will also 
be copied over.
-
-Backward compatibility
-----------------------
-
-This feature will be ensured to not create any  backward compatibility issues.
-New code with old segments will not find any partition information and pruning 
will be skipped.
-Old code will not look for the new information in segment Zk metadata and will 
work as expected.
-
-Query response impact
----------------------
-
-The total number of documents in the table is returned as part of the query 
response. This is computed by servers when they process segments. If some 
segments are pruned, their documents will not be accounted for in the query 
response.
-
-To address this, we will enhance the Segment ZK metadata to also store the 
document count of the segment, which the broker has access to. The total 
document count will then be computed on the broker side instead.
-
-Partitioning of existing data
------------------------------
-
-In the scope of this project, existing segments would not be partitioned. This 
simply means that pruning would not apply to those segments. If there’s a 
existing usecase that would benefit from this, then there are a few 
possibilities that can be explored (outside the scope of this project):
-
-Data can be re-bootstrapped after partitioning
-----------------------------------------------
-
-If the existing segments already comply to some partitioning, a utility can be 
created to to update the segment metadata and re-push the segments.
-
-Results
--------
-
-With Partition aware segment assignment and query routing, we were able to 
demonstrate 6k qps with 99th %ile latency under 100ms, on a data size of 3TB, 
in production.
-
-Limitations
------------
-
-The initial implementation will have the following limitations:
-It will not work for refresh usecases because currently there’s no cheap way 
to detect segment ZK metadata change for segment refresh. The only available 
way is to set watches at segment level that will be too expensive.
-Segment generation will not partition the data itself, but assume and assert 
that input data is partitioned as specified in the config.
diff --git a/docs/pluggable_streams.rst b/docs/pluggable_streams.rst
index 4c6b26a..f11c40c 100644
--- a/docs/pluggable_streams.rst
+++ b/docs/pluggable_streams.rst
@@ -20,7 +20,8 @@ You may encounter some limitations either in Pinot or in the 
stream system while
 Please feel free to get in touch with us when you start writing a stream 
plug-in, and we can help you out.
 We are open to receiving PRs in order to improve these abstractions if they do 
not work for a certain stream implementation.
 
-Refer to sections :ref:`hlc-section` and :ref:`llc-section` for details on how 
Pinot consumes streaming data.
+Refer to `Consuming and Indexing rows in Realtime 
<https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_
+for details on how Pinot consumes streaming data.
 
 Requirements to support Stream Level (High Level) consumers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/schema_timespec.rst b/docs/schema_timespec.rst
deleted file mode 100644
index 531ff28..0000000
--- a/docs/schema_timespec.rst
+++ /dev/null
@@ -1,111 +0,0 @@
-Schema TimeSpec Refactoring
-============================
-
-Problems with current schema design
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The pinot schema timespec looks like this:
-
-.. code-block:: none
-
-  {
-    "timeFieldSpec":
-    {
-      "name" : <name of time column>,
-      "dataType" : <datatype of time column>,
-      "timeFormat" : <format of time column, EPOCH or 
SIMPLE_DATE_FORMAT:format>,
-      "timeUnitSize" : <time column granularity size>,
-      "timeType" : <time unit of time column>
-    }
-  }
-
-We are missing data granularity information in pinot schema.
-TimeUnitSize, timeType and timeFormat allow us to define the granularity of 
the time column, but don’t provide a way for applications to know in what 
buckets the data granularity is.
-Currently, we can only have one time column in the table which is limiting 
some use cases. We should allow multiple time columns and even allow derived 
time columns. Derived columns can be useful in performing roll ups or 
generating star tree aggregate nodes.
-
-
-Changes
-~~~~~~~
-
-We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs to the pinot schema
-
-.. code-block:: none
-
-  {
-    “dateTimeFieldSpec”:
-      {
-        “name” : <name of the date time column>,
-        “dataType” : <datetype of the date time column>,
-        “format” : <string for interpreting the datetime column>,
-        “granularity” : <string for data granularity buckets>,
-        “dateTimeType” : <DateTimeType enum PRIMARY,SECONDARY or DERIVED>
-      }
-  }
-
-#. name - this if the name of the date time column, similar to the older 
timeSpec
-
-#. dataType - this is the DataType of the date time column, similar to the 
older timeSpec
-
-#. format - defines how to interpret the numeric value in the date time column.
-<br>Format has to follow the pattern - size:timeunit:timeformat, where size 
and timeUnit together define the granularity of the time column value.
-<br>Size is the integer value of the granularity size.
-<br>TimeFormat tells us whether the time column value is expressed in epoch or 
is a simple date format pattern.
-<br>Consider 2 date time values for example 2017/07/01 00:00:00 and 2017/08/29 
05:20:00:
-1. If the time column value is defined in millisSinceEpoch (1498892400000, 
1504009200000), this configuration will be 1:MILLISECONDS:EPOCH
-2. If the time column value is defined in 5 minutes since epoch (4996308, 
5013364), this configuration will be 5:MINUTES:EPOCH
-3. If the time column value is defined in a simple date format of a day (e.g. 
20170701, 20170829), this configuration will be 
1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd (the pattern can be configured as desired)
-
-#. granularity - defines in what granularity the data is bucketed.
-<br>Granularity has to follow pattern- size:timeunit, where size and timeUnit 
together define the bucket granularity of the data. This is independent of the 
format, which is purely defining how to interpret the numeric value in the 
datetime column.
-1. if a time column is defined in millisSinceEpoch 
(format=1:MILLISECONDS:EPOCH), but the data buckets are 5 minutes, the 
granularity will be 5:MINUTES.
-2. if a time column is defined in hoursSinceEpoch (format=1:HOURS:EPOCH), and 
the data buckets are 1 hours, the granularity will be 1:HOURS
-
-#. dateTimeType - this is an enum of values
-1. PRIMARY: The primary date time column. This will be the date time column 
which keeps the milliseconds value. This will be used as the default time 
column, in references by pinot code (e.g. retention manager)
-2. SECONDARY: The date time columns which are not the primary columns with 
milliseconds value. These can be date time columns in other granularity, put in 
by applications for their specific use cases
-3. DERIVED: The date time columns which are derived, say using other columns, 
generated via rollups, etc
-
-Examples:
-
-.. code-block:: none
-
-  “dateTimeFieldSpec”:
-  {
-
-    “name” : “Date”,
-    “dataType” : “LONG”,
-    “format” : “1:HOURS:EPOCH”,
-    “granularity” : “1:HOURS”,
-    “dateTimeType” : "PRIMARY"
-
-  }
-
-  “dateTimeFieldSpec”:
-  {
-
-    “name” : “Date”,
-    “dataType” : “LONG”,
-    “format” : “1:MILLISECONDS:EPOCH”,
-    “granularity” : “5:MINUTES”,
-    “dateTimeType” : "PRIMARY"
-
-  }
-
-  “dateTimeFieldSpec”:
-  {
-
-    “name” : “Date”,
-    “dataType” : “LONG”,
-    “format” : “1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd”,
-    “granularity” : “1:DAYS”,
-    “dateTimeType” : "SECONDARY"
-
-  }
-
-Migration
-~~~~~~~~~
-
-Once this change is pushed in, we will migrate all our clients to start 
populating the new DateTimeFieldSpec, along with the TimeSpec.
-<br>We can then go over all older schemas, and fill up the DateTimeFieldSpec 
referring to the TimeFieldSpec.
-<br>We then migrate our clients to start using DateTimeFieldSpec instead of 
TimeFieldSpec.
-<br>At this point, we can deprecate the TimeFieldSpec.
diff --git a/docs/segment-consumer-fsm.png b/docs/segment-consumer-fsm.png
deleted file mode 100644
index 718ba5d..0000000
Binary files a/docs/segment-consumer-fsm.png and /dev/null differ
diff --git a/docs/segment-creation.png b/docs/segment-creation.png
deleted file mode 100644
index 1f8b442..0000000
Binary files a/docs/segment-creation.png and /dev/null differ
diff --git a/docs/segment-helix-fsm.png b/docs/segment-helix-fsm.png
deleted file mode 100644
index 8fee5f3..0000000
Binary files a/docs/segment-helix-fsm.png and /dev/null differ
diff --git a/docs/zk-setup.png b/docs/zk-setup.png
deleted file mode 100644
index ca1ae24..0000000
Binary files a/docs/zk-setup.png and /dev/null differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot] branch master updated: [PINOT-7658] Moving design documents to cwiki (#3802)

Reply via email to