This is an automated email from the ASF dual-hosted git repository.
mcvsubbu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
The following commit(s) were added to refs/heads/master by this push:
new ab203a9 [PINOT-7658] Moving design documents to cwiki (#3802)
ab203a9 is described below
commit ab203a9573771a14640560f6010cd9ae646614d8
Author: Subbu Subramaniam <[email protected]>
AuthorDate: Thu Feb 7 16:36:57 2019 -0800
[PINOT-7658] Moving design documents to cwiki (#3802)
* [PINOT-7658] Moving design documents to cwiki
See https://cwiki.apache.org/confluence/display/PINOT/Home
* Fixing design doc references in documentation
---
docs/High-level-stream.png | Bin 39800 -> 0 bytes
docs/Low-level-stream.png | Bin 37306 -> 0 bytes
docs/PlanNode.png | Bin 41056 -> 0 bytes
docs/ServerSegmentCompletion.dot.png | Bin 72968 -> 0 bytes
docs/architecture.rst | 2 +-
docs/commit-happy-path-1.png | Bin 22933 -> 0 bytes
docs/commit-happy-path-2.png | Bin 22939 -> 0 bytes
docs/committer-failed.png | Bin 15412 -> 0 bytes
docs/controller-failed.png | Bin 23599 -> 0 bytes
docs/controller-segment-completion.png | Bin 85092 -> 0 bytes
docs/delayed-server.png | Bin 21690 -> 0 bytes
docs/expressionTree.jpg | Bin 12881 -> 0 bytes
docs/expressions_udf.rst | 120 ------------------------
docs/index.rst | 23 -----
docs/llc.rst | 164 ---------------------------------
docs/multiple-server-failure.png | Bin 18036 -> 0 bytes
docs/parseTree.png | Bin 55136 -> 0 bytes
docs/partition_aware_routing.rst | 141 ----------------------------
docs/pluggable_streams.rst | 3 +-
docs/schema_timespec.rst | 111 ----------------------
docs/segment-consumer-fsm.png | Bin 30109 -> 0 bytes
docs/segment-creation.png | Bin 9830 -> 0 bytes
docs/segment-helix-fsm.png | Bin 7494 -> 0 bytes
docs/zk-setup.png | Bin 39246 -> 0 bytes
24 files changed, 3 insertions(+), 561 deletions(-)
diff --git a/docs/High-level-stream.png b/docs/High-level-stream.png
deleted file mode 100644
index ac68d60..0000000
Binary files a/docs/High-level-stream.png and /dev/null differ
diff --git a/docs/Low-level-stream.png b/docs/Low-level-stream.png
deleted file mode 100644
index 9bc87f7..0000000
Binary files a/docs/Low-level-stream.png and /dev/null differ
diff --git a/docs/PlanNode.png b/docs/PlanNode.png
deleted file mode 100644
index 65b69e3..0000000
Binary files a/docs/PlanNode.png and /dev/null differ
diff --git a/docs/ServerSegmentCompletion.dot.png
b/docs/ServerSegmentCompletion.dot.png
deleted file mode 100644
index f56d112..0000000
Binary files a/docs/ServerSegmentCompletion.dot.png and /dev/null differ
diff --git a/docs/architecture.rst b/docs/architecture.rst
index 27d583e..48a4148 100644
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -120,7 +120,7 @@ Realtime segments are immutable once they are completed.
While realtime segments
in the sense that new rows can be added to them. Rows cannot be deleted from
segments.
-See :doc:`realtime design <llc>` for details.
+See `Consuming and Indexing rows in Realtime
<https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_
for details.
Pinot Segments
diff --git a/docs/commit-happy-path-1.png b/docs/commit-happy-path-1.png
deleted file mode 100644
index df65dd8..0000000
Binary files a/docs/commit-happy-path-1.png and /dev/null differ
diff --git a/docs/commit-happy-path-2.png b/docs/commit-happy-path-2.png
deleted file mode 100644
index 77c9e48..0000000
Binary files a/docs/commit-happy-path-2.png and /dev/null differ
diff --git a/docs/committer-failed.png b/docs/committer-failed.png
deleted file mode 100644
index f1b536d..0000000
Binary files a/docs/committer-failed.png and /dev/null differ
diff --git a/docs/controller-failed.png b/docs/controller-failed.png
deleted file mode 100644
index 18bef06..0000000
Binary files a/docs/controller-failed.png and /dev/null differ
diff --git a/docs/controller-segment-completion.png
b/docs/controller-segment-completion.png
deleted file mode 100644
index e6ceac2..0000000
Binary files a/docs/controller-segment-completion.png and /dev/null differ
diff --git a/docs/delayed-server.png b/docs/delayed-server.png
deleted file mode 100644
index b7fda39..0000000
Binary files a/docs/delayed-server.png and /dev/null differ
diff --git a/docs/expressionTree.jpg b/docs/expressionTree.jpg
deleted file mode 100644
index 5fe9772..0000000
Binary files a/docs/expressionTree.jpg and /dev/null differ
diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst
deleted file mode 100644
index 1e1b675..0000000
--- a/docs/expressions_udf.rst
+++ /dev/null
@@ -1,120 +0,0 @@
-Expressions and UDFs
-====================
-
-Requirements
-~~~~~~~~~~~~
-The query language for Pinot (:doc:`PQL <reference>`) currently only supports
*selection*, *aggregation* & *group by* operations on columns, and moreover, do
not support nested operations. There are a growing number of use-cases of Pinot
that require some sort of transformation on the column values, before and/or
after performing *selection*, *aggregation* & *group by*. One very common
example is when we would want to aggregate *metrics* over different granularity
of times, without needi [...]
-
-The high level requirement here is to support *expressions* that represent a
function on a set of columns in the queries, as opposed to just columns.
-
-.. code-block:: none
-
- select <exp1> from myTable where ... [group by <exp2>]
-
-Where exp1 and exp2 can be of the form:
-
-.. code-block:: none
-
- func1(func2(col1, col2...func3(...)...), coln...)...
-
-
-Proposal
-~~~~~~~~
-
-We first propose the concept of a Transform Operator (*xform*) below. We then
propose using these *xform* operators to perform transformations before/after
*selection*, *aggregation* and *group by* operations.
-
-The *xform* operator takes following inputs:
-
-#. An expression tree capturing *functions* and the *columns* they are applied
on. The figure below shows one such tree for expression: ``f1(f2(col1, col2),
col3)``
-#. A set of document Id's on which to perform *xform*
-
-.. figure:: expressionTree.jpg
-
-The *xform* produces the following output:
-
-* For each document Id in the input, it evaluates the specified expression,
and produces one value.
-
- * It is Many:1 for columns, i.e. many columns in the input produce one
column value in the output.
- * It is 1:1 for document Id's, i.e. for each document in the input, it
produces one value in the output.
-
-The *functions* in the *expression* can be either built-in into Pinot, or can
be user-defined. We will discuss the mechanism for hooking up *UDF* and the
manageability aspects in later sections.
-
-Parser
-~~~~~~
-
-The PQL parser is already capable of parsing expressions in the *selection*,
*aggregation* and *group by* sections. Following is a sample query containing
expression, and its parse tree shown in the image.
-
-.. code-block:: none
-
- select f1(f2(col1, col2), col3) from myTable where (col4 = 'x') group by
f3(col5, f4(col6, col7))
-
-
-.. figure:: parseTree.png
-
-BrokerRequest
-~~~~~~~~~~~~~
-
-We convert the Parse Tree from the parser into what we refer to as
*BrokerRequest*, which captures the parse tree along with other information and
is serialized from Pinot broker to server.
-While the parser does already recognize expressions in these sections, the
*BrokerRequest* currently assumes these to be columns and not expressions. We
propose the following enhancements here:
-
-#. *BrokerRequest* needs to be enhanced to support not just *columns* but also
*expressions* in the *selection*, *aggregation* & *group by* sections.
*BrokerRequest* is currently implemented via 'Thrift'. We will need to enhance
*request.thrift* to be able to support expressions. There are a couple of
options here:
-
- * We use the same mechanism as *FilterQuery* (which is how the predicates
are represented).
- * Evaluate other structures that may be more suitable for expression
evaluation. (TBD).
-
-
-#. The back-end of the parser generates *BrokerRequest* based on the parse
tree of the query. We need to implement the functionality that takes the parse
tree containing expressions in these sections and generates the new/enhanced
*BrokerRequest* containing expressions.
-
-
-Query Planning
-~~~~~~~~~~~~~~
-
-In the `query planning
<https://github.com/linkedin/pinot/wiki/Query-Execution>`_ phase, Pinot server
receives a *BrokerRequest* (per query) and parses it to build a query plan,
where it hooks up various plan nodes for filtering,
*Selection/Aggregation/GroupBy*, combining together.
-
-A new *TransformPlanNode* class will be implemented that implements the
*PlanNode* interface.
-The query planning phase will be enhanced to include new *xform* plan nodes if
the *BrokerRequest* contains *expressions* for *selection*, *aggregation* &
*group by*. These plan nodes will get hooked up appropriately during planning
phase.
-
-.. figure:: PlanNode.png
-
-Query Execution
-~~~~~~~~~~~~~~~
-
-In the query execution phase, the *run* method for *TransformPlanNode* will
return a new *TransformOperator*. This operator is responsible for applying a
transformation to a given set of documents, as specified by the *expression* in
the query. The output *block* of this operator will be fed into other operators
as per the query plan.
-
-UDFs
-~~~~
-
-The functions in *expressions* can either be built-in functions in Pinot, or
they can be user-defined. There are a couple of approaches for supporting
hooking up of UDF's into Pinot:
-
-#. If the function is generic enough and reusable by more than one clients, it
might be better to include it as part of Pinot code base. In this case, the
process for users would be to file a pull-request, which would then be reviewed
and become part of Pinot code base.
-
-#. Dynamic loading of user-defined functions:
-
- * Users can specify jars containing their UDF's in the class path.
- * List of UDF's can be specified in server config, and the server can ensure
that it can find and load classes for each UDF specified in the config. This
allows for a one-time static checking of availability of all specified UDF's.
- * Alternatively, the server may do a dynamic check for each query to ensure
all UDF's specified in the query are available and can be loaded.
-
-
-Backward compatibility
-~~~~~~~~~~~~~~~~~~~~~~
-
-Given that this proposal requires modifying *BrokerRequest*, we are exposed to
backward compatibility issues where different versions of broker and server
are running (one with the new feature and another without). We propose to
address this as follows:
-
-#. The changes to *BrokerRequest* to include *expressions* instead of
*columns* would only take effect if a query containing *expression* is
received. For the query just contains *columns* instead of *expressions*, we
fall be to existing behavior and send the *columns* as they are being sent in
the current design (ie not as a special case of an *expresion*).
-
-#. This will warrant the following sequencing:
- * Broker upgraded before server.
- * New queries containing *expressions* should be sent only after both
broker and server are upgraded.
-
-Limitations
-~~~~~~~~~~~
-
-We see the following limitations in functionality currently:
-
-#. Nesting of *aggregation* functions is not supported in the expression tree.
This is because the number of documents after *aggregation* is reduced. In the
expression below, *sum* of *col2* would yield one value, whereas *xform1* one
*col1* would yield the same number of documents as in the input.
-
-.. code-block:: none
-
- sum(xform1(col1), sum(col2))
-
-#. The current parser does not support precedence/associativity of operators,
it just builds parse tree from left to right. Addressing this is outside of the
scope of this project. Once the parser is enhanced to support this,
*expression* evaluation within query execution would work correctly without any
code changes required.
diff --git a/docs/index.rst b/docs/index.rst
index f929f57..92b9b09 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -38,26 +38,3 @@ Customizing Pinot
segment_fetcher
pluggable_storage
-################
-Design Documents
-################
-
-.. toctree::
- :maxdepth: 1
-
-
- llc
- partition_aware_routing
- expressions_udf
- schema_timespec
-
-################
-Design Proposals
-################
-
-.. toctree::
- :maxdepth: 1
-
-
- multitenancy
-
diff --git a/docs/llc.rst b/docs/llc.rst
deleted file mode 100644
index 4c08e70..0000000
--- a/docs/llc.rst
+++ /dev/null
@@ -1,164 +0,0 @@
-Realtime Design
-===============
-
-Pinot consumes rows from streaming data (such as Kafka) and serves queries on
the data consumed thus far.
-
-Two modes of consumption are supported in Pinot:
-
-.. _hlc-section:
-
-High Level Consumers
---------------------
-
-.. figure:: High-level-stream.png
-
- High Level Stream Consumer Architecture
-
-
-*TODO*: Add design description of how HLC realtime works
-
-
-.. _llc-section:
-
-Low Level Consumers
--------------------
-
-.. figure:: Low-level-stream.png
-
- Low Level Stream Consumer Architecture
-
-The HighLevel Pinot consumer has the following properties:
-
-* Each consumer needs to consume from all partitions. So, if we run one
consumer in a host, we are limited by the capacity of that host to consume all
partitions of the topic, no matter what the ingestion rate is.
- A stream may provide a way by which multiple hosts can consume the same
topic, with each host receiving a subset of the messages. However in this mode
the stream may duplicate rows that across the machines when the machines go
down and come back up. Pinot cannot afford that.
-* A stream consumer has no control over the messages that are received. As a
result, the consumers may have more or less same segments, but not exactly the
same. This makes capacity expansion etc.operationally heavy (e.g. start a
consumer and wait 5 days before enabling it to serve queries). Having
equivalent segments allows us to store the segments in the controller (like the
offline segments) and download them onto a new server during capacity
expansion, drastically reducing the operat [...]
- If we have common realtime segments across servers, it allows the brokers
to use different routing algorithms (like we do with offline segments).
Otherwise, the broker needs to route a query to exactly one realtime server so
that we do not see duplicate data in results.
-
-Design
-------
-
-When a table is created, the controller determines the number of partitions
for the table, and "creates" one segment per partition, spraying these segments
evenly across all the tagged servers. The following steps are done as a part of
creating each segment:
-* Segment metadata is created in Zookeeper. The segments are named as
tableName__partitionNumber__segmentSeqNumber__Timestamp. For example:
"myTable__6__0__20180801T1647Z"
-* Segment metadata is set with the segment completion criteria -- the number
of rows. The controller computes this number by dividing the rows threshold set
in table configuration by the total number of segments of the table on the same
server.
-* Segment metadata is set with the offset from which to consume. Controller
determines the offset by querying the stream.
-* Table Idealstate is set with these segment names and the appropriate server
instances thay are hosted. The state is set to CONSUMING
-* Depending on the number of replicas set, each partition could be consumed in
multiple servers.
-
-When a server completes consuming the segment and reaches the end-criteria
(either time or number of rows as per segment metadata), the server goes
through a segment completion protocol sequence (described in diagrams below).
The controller orchestrates all the replicas to reach the same consumption
level, and allows one replica to "commit" a segment. The "commit" step involves:
-
-* The server uploads the completed segment to the controller
-* The controller updates that segments metadata to mark it completed, writes
the end offset, end time, etc. in the metadata
-* The controller creates a new segment for the same partition (e.g.
"myTable__6__1__20180801T1805Z") and sets the metadata exactly like before,
with the consumption offsets adjusted to reflect the end offset of the previous
segmentSeqNumber.
-* The controller updates the IdealState to change the state of the completing
segment to ONLINE, and add the new segment in CONSUMING state.
-
-As a first implementation, the end-criteria in the metadata points to the
table config. It can be used at some point if we want to implement a more fancy
end-criteria, perhaps based on traffic or other conditions, something that
varies on a per-segment basis. The end-criteria could be number of rows, or
time. If number of rows is specified, then the controller divides the specified
number by the number of segments (of that table) on the same server, and sets
the appropriate row threshold [...]
-
-We change the broker to introduce a new routing strategy that prefers ONLINE
to CONSUMING segments, and ensures that there is at most one segment in
CONSUMING state on a per partition basis in the segments that a query is to be
routed to.
-
-Important tuning parameters for Realtime Pinot
-----------------------------------------------
-
-* replicasPerPartition: This number indicates how many replicas are needed for
each partition to be consumed from the stream
-* realtime.segment.flush.threshold.size: This parameter should be set to the
total number of rows of a topic that a realtime consuming server can hold in
memory. Default value is 5M. If the value is set to 0, then the number of rows
is automatically adjusted such that the size of the segment generated is as per
the setting realtime.segment.flush.desired.size
-* realtime.segment.flush.desired.size: Default value is "200M". The setting is
used only if realtime.segment.flush.threshold.size is set to 0
-* realtime.segment.flush.threshold.size.llc: This parameter overrides
realtime.segment.flush.threshold.size. Useful when migrating live from HLC to
LLC
-* pinot.server.instance.realtime.alloc.offheap: Default is false. Set it to
true if you want off-heap allocation for dictionaries and no-dictionary column
-* pinot.server.instance.realtime.alloc.offheap.direct: Default is false. Set
it to true if you want off-heap allocation from DirectMemory (as opposed to
MMAP)
-* pinot.server.instance.realtime.max.parallel.segment.builds: Default is 0
(meaning infinite). Set it to a number if you want to limit number of segment
builds. Segment builds take up heap memory, so it is useful to have a max
setting and limit the number of simultaneous segment builds on a single server
instance JVM.
-
-Live migration of existing use cases from HLC to LLC
-----------------------------------------------------
-
-Preparation
-~~~~~~~~~~~
-
-* Set the new configurations as desired (replicasPerPartition,
realtime.segment.flush.threshold.size.llc,
realtime.segment.flush.threshold.time.llc). Note that the ".llc" versions of
the configs are to be used only if you want to do a live migration of an
existing table from HLC to LLC and need to have different thresholds for LLC
than HLC.
-* Set loadMode of segments to MMAP
-* Set configurations to use offheap (either direct or MMAP) for dictionaries
and no-dictinary items (realtime.alloc.offheap, realtime.alloc.offheap.direct)
-* If your stream is Kafka, add ``stream.kafka.broker.list`` configurations for
per-partition consumers
-* Increase the heap size (doubling it may be useful) since we will be
consuming both HLC and LLC on the same machines now. Restart the servers
-
-Consuming the streams via both mechanisms
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Configure two consumers but keep routing to be KafkaHighLevel
-
-* Change the ``stream.<your stream here>.consumer.type`` setting to be
``highLevel,simple``. This starts both LLC and HLC consumers on the nodes.
-* Change ``stream.<your stream here>.consumer.prop.auto.offset.reset`` to have
the value largest (otherwise, llc consumers will start consuming from the
beginning which may interfere with HLC operations)
-* Check memory consumption and query response times.
-* Set the broker routingTableuilderName to be ``KafkaHighLevel`` so that
queries are not routed to LLC until consumption is caught up.
-* Apply the table config
-* Restart brokers and servers
-* wait for retention period of the table.
-
-Disabling HLC
-~~~~~~~~~~~~~
-
-* Change the ``stream.<your stream here>.consumer.type`` setting to be "simple"
-* Remove the routingTableuilderName setting
-* Apply the table configs and restart the brokers and servers
-* The HLC segments will slowly age out on their own.
-
-Migration from HLC to LLC with downtime
----------------------------------------
-
-If it is all right to take a down time, then the easiest way is to disable the
table, do the last step of the previous migration steps, and restart the table
once the consumption has caught
-up.
-
-LLC Zookeeper setup and Segment Completion Protocol
----------------------------------------------------
-
-
-.. figure:: zk-setup.png
-
- Zookeeper setup
-
-
-.. figure:: segment-helix-fsm.png
-
- Server-side Helix State Machine
-
-.. figure:: ServerSegmentCompletion.dot.png
-
- Server-side Partition consumer state machine
-
-.. figure:: segment-consumer-fsm.png
-
- Server-side control flow
-
-.. figure:: controller-segment-completion.png
-
- Controller-side Segment completion state machine
-
-Scenarios
-~~~~~~~~~
-
-
-.. figure:: segment-creation.png
-
- Segment Creation
-
-
-.. figure:: commit-happy-path-1.png
-
- Happy path commit 1
-
-.. figure:: commit-happy-path-2.png
-
- Happy path commit 2
-
-.. figure:: delayed-server.png
-
- Delayed Server
-
-.. figure:: committer-failed.png
-
- Committer failure
-
-.. figure:: controller-failed.png
-
- Controller failure during commit
-
-.. figure:: multiple-server-failure.png
-
- Multiple failures
diff --git a/docs/multiple-server-failure.png b/docs/multiple-server-failure.png
deleted file mode 100644
index b0fa386..0000000
Binary files a/docs/multiple-server-failure.png and /dev/null differ
diff --git a/docs/parseTree.png b/docs/parseTree.png
deleted file mode 100644
index b5247bd..0000000
Binary files a/docs/parseTree.png and /dev/null differ
diff --git a/docs/partition_aware_routing.rst b/docs/partition_aware_routing.rst
deleted file mode 100644
index be9dcd4..0000000
--- a/docs/partition_aware_routing.rst
+++ /dev/null
@@ -1,141 +0,0 @@
-Partition Aware Query Routing
-=============================
-
-In ongoing efforts to support use cases with low latency high throughput
requirements, we have started to notice scaling issues in Pinot broker where
adding more replica sets does not improve throughput as expected beyond a
certain point. One of the reason behind this is the fact that the broker does
not perform any pruning of segments, and so every query ends up processing each
segment in the data set. This processing of potentially unnecessary segments
has been shown to impact throughp [...]
-
-Details
--------
-
-The Pinot broker component is responsible for receiving queries, fanning them
out to Pinot servers, merging the responses from servers and finally sending
the merged responses back to the client. The broker maintains multiple randomly
generated routing tables that map each server to a subset of segments, such
that one routing table covers one replica set (across various servers). This
implies that for each query all segments of a replica set are processed for a
server.
-
-This becomes an overhead when the answer for the given query lies within a
small subset of segments. One very common example is when queries have a narrow
time filter (say 30 days), but the retention is two years (730 segments, at the
rate of one segment per day). For each such query there are multiple overheads:
-
- Broker needs to use connections to servers that may not even be hosting
any segments worth processing.
- On the server side, there is query planning as well as execution once
per segment. This happens for hundreds (or even few thousands) of segments,
when only a few need to be actually processed.
-
-We observed through experiments as well as prototyping that while these
overheads may (or may not) impact latency, they certainly impact throughput
quite a bit. In an experiment with one SSD node with 500 GB of data (720
segments), we observed a baseline QPS of 150 without any pruning and pruning on
broker side improves QPS to about 1500.
-
-Proposed Solution
------------------
-
-We propose a solution that would allow the broker to quickly prune segments
for a given query, reducing the overheads and improving throughput. The idea is
to make information available in the segment metadata for broker to be able to
quickly prune a segment for a given query. Once the broker has compiled the
query, it has the filter query tree that represents the predicates in the
query. If there are predicates on column(s) for which there is partition/range
information in the metadata, [...]
-
-
-In our design, we propose two new components within the broker:
-
- * **Segment ZK Metadata Manager**: This component will be responsible for
caching the segment ZK metadata in memory within the broker. Its cache will be
kept upto date by listening to external view changes.
- * **Segment Pruner**: This component will be responsible pruning segments
for a given query. Segments will be pruned based on the segment metadata and
the predicates in the query.
-
-Segment ZK Metadata Manager
----------------------------
-
-While the broker does not have access to the segments themselves, it does have
access to the ZK metadata for segments. The SegmentZKMetadataManager will be
responsible for fetching and caching this metadata for each segment.
-Upon transition to online state, the SegmentZKMetadataManager will go over the
segments of the table(s) it hosts and build its cache.
-It will use the existing External View change listener to update its cache.
Since External View does not change when segments are refreshed, and setting
watches for each segment in the table is too expensive, we are choosing to live
with a limitation where this feature does not work for refresh usecase. This
limitation can be revisited at a later time, when inexpensive solutions are
available to detect segment level changes for refresh usecases.
-
-Table Level Partition Metadata
-------------------------------
-
-We will introduce a table level config to control enabling/disabling this
feature for a table. This config can actually be the pruner class name, so that
realtime segment generation can pick it up. Absence or incorrect specification
of this table config would imply the feature is disabled for the table.
-
-Segment Level Partition Metadata
---------------------------------
-
-The partition metadata would be a list of tuples, one for each partition
column, where each tuple contains:
-Partition column: Column on which the data is partitioned.
-Partition value: A min-max range with [start, end]. For single value start ==
end.
-Prune function: Name of the class that will be used by the broker to prune a
segment based on the predicate(s) in the query. It will take as argument the
predicate value and the partition value, and return true if segment can be
pruned, and false otherwise.
-
-For example, let us consider a case where the data is naturally partitioned on
time column ‘daysSinceEpoch’. The segment zk metadata will have information
like below:
-
-.. code-block:: none
-
- {
- “partitionColumn” : “daysSinceEpoch”,
- “partitionStart” : “17200”,
- “partitionEnd” : “17220”,
- “pruneFunction” : “TimePruner”
- }
-
-Now consider the following query comes in.
-
-.. code-block:: none
-
- Select count(*) from myTable where daysSinceEpoch between 17100 and 17110
-
-The broker will recognize the range predicate on the partition column, and
call the TimePruner which will identify that the predicate cannot be satisfied
and hence and return true, thus pruning the segment. If there is no segment
metadata or there is no range predicate on partition column, the segment will
not be pruned (return false).
-
-Let’s consider another example where the data is partitioned by memberId,
where a hash function was applied on the memberId to compute a partition number.
-
-.. code-block:: none
-
- {
- “partitionColumn” : “memberId”,
- “partitionStart” : “10”,
- “partitionEnd” : “10”,
- “pruneFunction” : “HashPartitionPruner”
- }
-
- Select count(*) from myTable where memberId = 1000`
-
-In this case, the HashPartitionPruner will compute the partition id for the
memberId (1000) in the query. And if it turns out to anything other than 10,
the segment would be pruned out. The prune function would contain the complete
logic (and information) to be able to compute partition id’s from meberId’s.
-
-Segment Pruner
---------------
-
-The broker will perform segment pruning as follows. This is not an exact
algorithm but meant for conveying the idea. Actual implementation might differ
slightly (if needed).
-
-#. Broker will compile the query and generate filter query tree as it does
currently.
-#. The broker will perform a DFS on the filter query tree and prune the
segment as follows:
-
- * If the current node is leaf and is not the partition column, return
false (not prune).
- * If the current node is leaf and is the partition column, return the
result of calling prune function with predicate value from leaf node, and
partition value from metadata.
- * If the current node is AND, return true as long as one of its
children returned true (prune).
- * If the current node is OR, return true if all of its children
returned true (prune).
-
-Segment Generation
-------------------
-
-The segment generation code would be enhanced as follows:
-It already auto-detects sorted columns, but only writes out the min-max range
for time column to metadata. It will be enhanced to write out the min-max range
for all sorted columns in the metadata.
-
-For columns with custom partitioning schemes, the name of partitioning
(pruner) class will be specified in the segment generation config. Segment
generator will ensure that the column values adhere to partitioning scheme and
then will write out the partition information into the metadata. In case column
values do not adhere to partition scheme, it will log a warning and will not
write partition information in the metadata. The impact of this will be that
broker would not able to perform [...]
-
-This will apply to both offline as well as realtime segment generation, except
that for realtime the pruner class name is specified in the table config.
-When the creation/annotation of segment ZK metadata from segment metadata
happens in controller (when adding a new segment) the partition info will also
be copied over.
-
-Backward compatibility
-----------------------
-
-This feature will be ensured to not create any backward compatibility issues.
-New code with old segments will not find any partition information and pruning
will be skipped.
-Old code will not look for the new information in segment Zk metadata and will
work as expected.
-
-Query response impact
----------------------
-
-The total number of documents in the table is returned as part of the query
response. This is computed by servers when they process segments. If some
segments are pruned, their documents will not be accounted for in the query
response.
-
-To address this, we will enhance the Segment ZK metadata to also store the
document count of the segment, which the broker has access to. The total
document count will then be computed on the broker side instead.
-
-Partitioning of existing data
------------------------------
-
-In the scope of this project, existing segments would not be partitioned. This
simply means that pruning would not apply to those segments. If there’s a
existing usecase that would benefit from this, then there are a few
possibilities that can be explored (outside the scope of this project):
-
-Data can be re-bootstrapped after partitioning
-----------------------------------------------
-
-If the existing segments already comply to some partitioning, a utility can be
created to to update the segment metadata and re-push the segments.
-
-Results
--------
-
-With Partition aware segment assignment and query routing, we were able to
demonstrate 6k qps with 99th %ile latency under 100ms, on a data size of 3TB,
in production.
-
-Limitations
------------
-
-The initial implementation will have the following limitations:
-It will not work for refresh usecases because currently there’s no cheap way
to detect segment ZK metadata change for segment refresh. The only available
way is to set watches at segment level that will be too expensive.
-Segment generation will not partition the data itself, but assume and assert
that input data is partitioned as specified in the config.
diff --git a/docs/pluggable_streams.rst b/docs/pluggable_streams.rst
index 4c6b26a..f11c40c 100644
--- a/docs/pluggable_streams.rst
+++ b/docs/pluggable_streams.rst
@@ -20,7 +20,8 @@ You may encounter some limitations either in Pinot or in the
stream system while
Please feel free to get in touch with us when you start writing a stream
plug-in, and we can help you out.
We are open to receiving PRs in order to improve these abstractions if they do
not work for a certain stream implementation.
-Refer to sections :ref:`hlc-section` and :ref:`llc-section` for details on how
Pinot consumes streaming data.
+Refer to `Consuming and Indexing rows in Realtime
<https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_
+for details on how Pinot consumes streaming data.
Requirements to support Stream Level (High Level) consumers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/schema_timespec.rst b/docs/schema_timespec.rst
deleted file mode 100644
index 531ff28..0000000
--- a/docs/schema_timespec.rst
+++ /dev/null
@@ -1,111 +0,0 @@
-Schema TimeSpec Refactoring
-============================
-
-Problems with current schema design
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The pinot schema timespec looks like this:
-
-.. code-block:: none
-
- {
- "timeFieldSpec":
- {
- "name" : <name of time column>,
- "dataType" : <datatype of time column>,
- "timeFormat" : <format of time column, EPOCH or
SIMPLE_DATE_FORMAT:format>,
- "timeUnitSize" : <time column granularity size>,
- "timeType" : <time unit of time column>
- }
- }
-
-We are missing data granularity information in pinot schema.
-TimeUnitSize, timeType and timeFormat allow us to define the granularity of
the time column, but don’t provide a way for applications to know in what
buckets the data granularity is.
-Currently, we can only have one time column in the table which is limiting
some use cases. We should allow multiple time columns and even allow derived
time columns. Derived columns can be useful in performing roll ups or
generating star tree aggregate nodes.
-
-
-Changes
-~~~~~~~
-
-We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs to the pinot schema
-
-.. code-block:: none
-
- {
- “dateTimeFieldSpec”:
- {
- “name” : <name of the date time column>,
- “dataType” : <datetype of the date time column>,
- “format” : <string for interpreting the datetime column>,
- “granularity” : <string for data granularity buckets>,
- “dateTimeType” : <DateTimeType enum PRIMARY,SECONDARY or DERIVED>
- }
- }
-
-#. name - this if the name of the date time column, similar to the older
timeSpec
-
-#. dataType - this is the DataType of the date time column, similar to the
older timeSpec
-
-#. format - defines how to interpret the numeric value in the date time column.
-<br>Format has to follow the pattern - size:timeunit:timeformat, where size
and timeUnit together define the granularity of the time column value.
-<br>Size is the integer value of the granularity size.
-<br>TimeFormat tells us whether the time column value is expressed in epoch or
is a simple date format pattern.
-<br>Consider 2 date time values for example 2017/07/01 00:00:00 and 2017/08/29
05:20:00:
-1. If the time column value is defined in millisSinceEpoch (1498892400000,
1504009200000), this configuration will be 1:MILLISECONDS:EPOCH
-2. If the time column value is defined in 5 minutes since epoch (4996308,
5013364), this configuration will be 5:MINUTES:EPOCH
-3. If the time column value is defined in a simple date format of a day (e.g.
20170701, 20170829), this configuration will be
1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd (the pattern can be configured as desired)
-
-#. granularity - defines in what granularity the data is bucketed.
-<br>Granularity has to follow pattern- size:timeunit, where size and timeUnit
together define the bucket granularity of the data. This is independent of the
format, which is purely defining how to interpret the numeric value in the
datetime column.
-1. if a time column is defined in millisSinceEpoch
(format=1:MILLISECONDS:EPOCH), but the data buckets are 5 minutes, the
granularity will be 5:MINUTES.
-2. if a time column is defined in hoursSinceEpoch (format=1:HOURS:EPOCH), and
the data buckets are 1 hours, the granularity will be 1:HOURS
-
-#. dateTimeType - this is an enum of values
-1. PRIMARY: The primary date time column. This will be the date time column
which keeps the milliseconds value. This will be used as the default time
column, in references by pinot code (e.g. retention manager)
-2. SECONDARY: The date time columns which are not the primary columns with
milliseconds value. These can be date time columns in other granularity, put in
by applications for their specific use cases
-3. DERIVED: The date time columns which are derived, say using other columns,
generated via rollups, etc
-
-Examples:
-
-.. code-block:: none
-
- “dateTimeFieldSpec”:
- {
-
- “name” : “Date”,
- “dataType” : “LONG”,
- “format” : “1:HOURS:EPOCH”,
- “granularity” : “1:HOURS”,
- “dateTimeType” : "PRIMARY"
-
- }
-
- “dateTimeFieldSpec”:
- {
-
- “name” : “Date”,
- “dataType” : “LONG”,
- “format” : “1:MILLISECONDS:EPOCH”,
- “granularity” : “5:MINUTES”,
- “dateTimeType” : "PRIMARY"
-
- }
-
- “dateTimeFieldSpec”:
- {
-
- “name” : “Date”,
- “dataType” : “LONG”,
- “format” : “1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd”,
- “granularity” : “1:DAYS”,
- “dateTimeType” : "SECONDARY"
-
- }
-
-Migration
-~~~~~~~~~
-
-Once this change is pushed in, we will migrate all our clients to start
populating the new DateTimeFieldSpec, along with the TimeSpec.
-<br>We can then go over all older schemas, and fill up the DateTimeFieldSpec
referring to the TimeFieldSpec.
-<br>We then migrate our clients to start using DateTimeFieldSpec instead of
TimeFieldSpec.
-<br>At this point, we can deprecate the TimeFieldSpec.
diff --git a/docs/segment-consumer-fsm.png b/docs/segment-consumer-fsm.png
deleted file mode 100644
index 718ba5d..0000000
Binary files a/docs/segment-consumer-fsm.png and /dev/null differ
diff --git a/docs/segment-creation.png b/docs/segment-creation.png
deleted file mode 100644
index 1f8b442..0000000
Binary files a/docs/segment-creation.png and /dev/null differ
diff --git a/docs/segment-helix-fsm.png b/docs/segment-helix-fsm.png
deleted file mode 100644
index 8fee5f3..0000000
Binary files a/docs/segment-helix-fsm.png and /dev/null differ
diff --git a/docs/zk-setup.png b/docs/zk-setup.png
deleted file mode 100644
index ca1ae24..0000000
Binary files a/docs/zk-setup.png and /dev/null differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]