[incubator-pinot] branch master updated: Fixes to doc (#3558)

mcvsubbu Tue, 27 Nov 2018 16:38:54 -0800

This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git



The following commit(s) were added to refs/heads/master by this push:
     new 232360f  Fixes to doc (#3558)
232360f is described below

commit 232360fa054e05b82d3fdd1ee2d0fbfa506dad1c
Author: Subbu Subramaniam <[email protected]>
AuthorDate: Tue Nov 27 16:38:22 2018 -0800

    Fixes to doc (#3558)
---
 docs/architecture.rst            | 128 +++++++++++++++++++++++++--------------
 docs/creating_pinot_segments.rst |   2 +-
 docs/expressions_udf.rst         |   2 +-
 docs/intro.rst                   |  26 +++-----
 4 files changed, 94 insertions(+), 64 deletions(-)

diff --git a/docs/architecture.rst b/docs/architecture.rst
index 9b61f2a..27d583e 100644
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -8,87 +8,125 @@ Architecture
 Terminology
 -----------
 
-* Table: A table is a logical abstraction to refer to a collection of related 
data. It consists of columns and rows (Document). Table Schema defines column 
names and their metadata.
-* Segment: Data in table is divided into shards referred to as segments.
+*Table*
+    A table is a logical abstraction to refer to a collection of related data. 
It consists of columns and rows (documents).
+*Segment*
+    Data in table is divided into (horizontal) shards referred to as segments.
 
 Pinot Components
 ----------------
 
-* Pinot Controller: Manages other pinot components (brokers, servers) as well 
as controls assignment of tables/segments to servers.
-* Pinot Server: Hosts one or more segments and serves queries from those 
segments
-* Pinot Broker: Accepts queries from clients and routes them to one or more 
servers, and returns consolidated response to the server.
+*Pinot Controller*
+    Manages other pinot components (brokers, servers) as well as controls 
assignment of tables/segments to servers.
+*Pinot Server*
+    Hosts one or more segments and serves queries from those segments
+*Pinot Broker*
+    Accepts queries from clients and routes them to one or more servers, and 
returns consolidated response to the client.
 
 Pinot leverages `Apache Helix <http://helix.apache.org>`_ for cluster 
management. 
-Apache Helix is a generic cluster management framework to manage partitions 
and replicas in a distributed system. See http://helix.apache.org for 
additional information.
+Helix is a cluster management framework to manage replicated, partitioned 
resources in a distributed system.
 Helix uses Zookeeper to store cluster state and metadata.
 
-Briefly, Helix divides nodes into 3 logical components based on their 
responsibilities:
+Briefly, Helix divides nodes into three logical components based on their 
responsibilities:
 
-*  **Participant**: The nodes that host distributed, partitioned resources
-*  **Spectator**: The nodes that observe the current state of each Participant 
and use that information to access the resources.
-   Spectators are notified of state changes in the cluster (state of a 
participant, or that of a partition in a participant).
-*  **Controller**: The node that observes and controls the Participant nodes. 
It is responsible for coordinating all transitions
-   in the cluster and ensuring that state constraints are satisfied while 
maintaining cluster stability
+*Participant*
+    The nodes that host distributed, partitioned resources
+*Spectator*
+    The nodes that observe the current state of each Participant and use that 
information to access the resources.
+    Spectators are notified of state changes in the cluster (state of a 
participant, or that of a partition in a participant).
+*Controller*
+    The node that observes and controls the Participant nodes. It is 
responsible for coordinating all transitions
+    in the cluster and ensuring that state constraints are satisfied while 
maintaining cluster stability
 
-Pinot Controller hosts Helix Controller, in addition to hosting APIs for Pinot 
cluster administration and data ingestion.
+Pinot Controller hosts Helix Controller, in addition to hosting REST APIs for 
Pinot cluster administration and data ingestion.
 There can be multiple instances of Pinot controller for redundancy. If there 
are multiple controllers, Pinot expects that all
-of them are configured with the same back-end storage system so that they have 
a common view of the segments (_e.g._ NFS).
+of them are configured with the same back-end storage system so that they have 
a common view of the segments (*e.g.* NFS).
 Pinot can use other storage systems such as HDFS or `ADLS 
<https://azure.microsoft.com/en-us/services/storage/data-lake-storage/>`_.
 
-Pinot Servers are modeled as Helix Participants, hosting Pinot tables 
(referred to as 'resources' in helix terminology).
+Pinot Servers are modeled as Helix Participants, hosting Pinot tables 
(referred to as *resources* in helix terminology).
 Segments of a table are modeled as Helix partitions (of a resource). Thus, a 
Pinot server hosts one or more helix partitions of one
-or more helix resources (_i.e._ one or more segments of one or more tables).
+or more helix resources (*i.e.* one or more segments of one or more tables).
+
+Pinot Brokers are modeled as Spectators. They need to know the location of 
each segment of a table (and each replica of the
+segments)
+and route requests to the
+appropriate server that hosts the segments of the table being queried. The 
broker ensures that all the rows of the table
+are queried exactly once so as to return correct, consistent results for a 
query. The brokers (or servers) may optimize
+to prune some of the segments as long as accuracy is not satisfied. In case of 
hybrid tables, the brokers ensure that
+the overlap between realtime and offline segment data is queried exactly once.
+Helix provides the framework by which spectators can learn the location 
(*i.e.* participant) in which each partition
+of a resource resides. The brokers use this mechanism to learn the servers 
that host specific segments of a table.
+
+Pinot Tables
+------------
 
-Pinot Brokers are modeled as Spectators. They need to know the location of 
each segment of a table and route requests to the
-appropriate server that hosts the segments of the table being queried. In 
general, all segments must be queried exactly once
-in order for a query to return the correct response. There may be multilpe 
copies of a segment (for redundancy). Helix provides
-the framework by which spectators can learn the location (i.e. participant) in 
which each partition of a resource resides.
+Pinot supports realtime, or offline, or hybrid tables. Data in Pinot tables is 
contained in the segments
+belonging to that table. A Pinot table is modeled as a Helix resource.  Each 
segment of a table is modeled as a Helix Partition,
 
-Pinot tables
-------------
+Table Schema defines column names and their metadata. Table configuration and 
schema is stored in zookeeper.
 
-Tables in Pinot can be configured to be offline only, or realtime only, or a 
hybrid of these two.
+Offline tables ingest pre-built pinot-segments from external data stores, 
whereas Reatime tables
+ingest data from streams (such as Kafka) and build segments.
 
-Segments for offline tables are constructed outside of Pinot, typically in 
Hadoop via map-reduce jobs. These segments are then ingested
-into Pinot via REST API provided by the Controller. The controller looks up 
the table's configuration and assigns the segment
-to the servers that host the table. It may assign multiple servers for each 
servers depending on the number of replicas 
-configured for that table.
+A hybrid Pinot table essentially has both realtime as well as offline tables. 
+In such a table, offline segments may be pushed periodically (say, once a 
day). The retention on the offline table
+can be set to a high value (say, a few years) since segments are coming in on 
a periodic basis, whereas the retention
+on the realtime part can be small (say, a few days). Once an offline segment 
is pushed to cover a recent time period,
+the brokers automatically switch to using the offline table for segments in 
_that_ time period, and use realtime table
+only to cover later segments for which offline data may not be available yet.
+
+Note that the query does not know the existence of offline or realtime tables. 
It only specifies the table name
+in the query.
+
+
+Ingesting Offline data
+^^^^^^^^^^^^^^^^^^^^^^
+Segments for offline tables are constructed outside of Pinot, typically in 
Hadoop via map-reduce jobs
+and ingested into Pinot via REST API provided by the Controller.
 Pinot provides libraries to create Pinot segments out of input files in AVRO, 
JSON or CSV formats in a hadoop job, and push
 the constructed segments to the controlers via REST APIs.
 
+When an Offline segment is ingested, the controller looks up the table's 
configuration and assigns the segment
+to the servers that host the table. It may assign multiple servers for each 
servers depending on the number of replicas 
+configured for that table.
+
 Pinot supports different segment assignment strategies that are optimized for 
various use cases.
 
 Once segments are assigned, Pinot servers get notified via Helix to "host" the 
segment. The servers download the segments
 (as a cached local copy to serve queries) and load them into local memory. All 
segment data is maintained in memory as long
 as the server hosts that segment.
 
-Once the server has loaded the segment, brokers come to know of the 
availability of these segments and start include the new
-segments for queries. Brokers support different routing strategeies depending 
on the type of table, the segment assignment
+Once the server has loaded the segment, Helix notifies brokers of the 
availability of these segments. The brokers 
+start include the new
+segments for queries. Brokers support different routing strategies depending 
on the type of table, the segment assignment
 strategy and the use case.
 
-Realtime tables, on the other hand, ingest data directly from incoming data 
streams (such as Kafka). Multiple servers may
-ingest the same data for replication. The servers stop ingesting data after 
reaching a threshold and "build" a segment of
-the data ingested so far. Once that segment is loaded (just like the offline 
segments described earlier), they continue
-to consume the next set of events from the stream.
+Data in offline segments are immmutable (Rows cannot be added, deleted, or 
modified). However, segments may be replaced modified data.
 
-Depending on the type of consumer configured, realtime segments may be held 
locally in the server, or pushed the controller.
+Ingesting Realtime Data
+^^^^^^^^^^^^^^^^^^^^^^^
+Segments for realtime tables are constructed by Pinot servers. The servers 
ingest rows from realtime streams (such as
+Kafka) until
+some completion threshold (such as number of rows, or a time threshold) and 
build a segment out of those rows. Depending
+on the type of ingestion mechanism used (stream or partition level), segments 
may be locally stored in the servers
+or in the controller's segment store.
 
-**TODO Add reference to the realtime section here**
+Multiple servers may ingest the same data to increase availability and share 
query load.
 
-A hybrid Pinot table essentially has both realtime as well as offline tables. 
-In such a table, offline segments may be pushed periodically (say, once a 
day). The retention on the offline table
-can be set to a high value (say, a few years) since segments are coming in on 
a periodic basis, whereas the retention
-on the realtime part can be small (say, a few days). Once an offline segment 
is pushed to cover a recent time period,
-the brokers automatically switch to using the offline table for segments in 
_that_ time period, and use realtime table
-only to cover later segments for which offline data may not be available yet.
+Once a realtime segment is built and loaded the servers continue
+to consume from where they left off.
+
+Realtime segments are immutable once they are completed. While realtime 
segments are being consumed they are mutable,
+in the sense that new rows can be added to them. Rows cannot be deleted from 
segments.
 
-Note that the query does not know the existence of offline or realtime tables. 
It only specifies the table name
-in the query.
+
+See :doc:`realtime design <llc>` for details.
 
 
 Pinot Segments
 --------------
-As mentioned earlier, each Pinot segment is a horizontal shard of the Pinot 
table. The segment is laid out in a columnar format
+
+A segment is laid out in a columnar format
 so that it can be directly mapped into memory for serving queries. Columns may 
be single or multi-valued. Column types may be
 STRING, INT, LONG, FLOAT, DOUBLE or BYTES. Columns may be declared to be 
metric or dimension (or specifically as a time dimension)
 in the schema.
@@ -102,5 +140,3 @@ configured for any set of columns. Inverted indices, while 
taking up more storag
 
 Specialized indexes like StartTree index is also supported.
 
-**TODO Add/Link startree doc here**
-
diff --git a/docs/creating_pinot_segments.rst b/docs/creating_pinot_segments.rst
index 0425c70..5bae71e 100644
--- a/docs/creating_pinot_segments.rst
+++ b/docs/creating_pinot_segments.rst
@@ -5,7 +5,7 @@ This document describes steps required for creating Pinot2_0 
segments from stand
 
 Compiling the code
 ------------------
-Follow the steps described in `trying_pinot`_ to build pinot. Locate 
``pinot-admin.sh`` in ``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
+Follow the steps described in the section on :doc: `Demonstration 
<trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in 
``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
 
 
 Data Preparation
diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst
index 7a42f0f..d373424 100644
--- a/docs/expressions_udf.rst
+++ b/docs/expressions_udf.rst
@@ -3,7 +3,7 @@ Expressions and UDFs
 
 Requirements
 ~~~~~~~~~~~~
-The query language for Pinot (pql_) currently only supports *selection*, 
*aggregation* & *group by* operations on columns, and moreover, do not support 
nested operations. There are a growing number of use-cases of Pinot that 
require some sort of transformation on the column values, before and/or after 
performing *selection*, *aggregation* & *group by*. One very common example is 
when we would want to aggregate *metrics* over different granularity of times, 
without needing to pre-aggregat [...]
+The query language for Pinot (:doc:`PQL <reference>`) currently only supports 
*selection*, *aggregation* & *group by* operations on columns, and moreover, do 
not support nested operations. There are a growing number of use-cases of Pinot 
that require some sort of transformation on the column values, before and/or 
after performing *selection*, *aggregation* & *group by*. One very common 
example is when we would want to aggregate *metrics* over different granularity 
of times, without needi [...]
 
 The high level requirement here is to support *expressions* that represent a 
function on a set of columns in the queries, as opposed to just columns.
 
diff --git a/docs/intro.rst b/docs/intro.rst
index c989b04..b169b4a 100644
--- a/docs/intro.rst
+++ b/docs/intro.rst
@@ -29,25 +29,19 @@ Because of the design choices we made to achieve these 
goals, there are certain
 
 Pinot works very well for querying time series data with lots of Dimensions 
and Metrics. For example:
 
-::
+.. code-block:: sql
 
-    SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable WHERE 
((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND accountId IN 
(123456789) GROUP BY daysSinceEpoch TOP 15000
-    SELECT sum(impressions) FROM AdAnalyticsTable WHERE (daysSinceEpoch >= 
17824 and daysSinceEpoch <= 17854) AND adveriserId = '1234356789' GROUP BY 
daysSinceEpoch,advertiserId TOP 1000
-    SELECT sum(cost) FROM AdAnalyticsTable GROUP BY advertiserId TOP 50
-
-
-Terminology
+    SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
+      WHERE ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND 
accountId IN (123456789)
+      GROUP BY daysSinceEpoch TOP 100
 
-* Table: A table is a logical abstraction to refer to a collection of related 
data. It consists of columns and rows (Document). Table Schema defines column 
names and their metadata.
-* Segment: Data in table is divided into shards referred to as segments.
+.. code-block:: sql
 
-Pinot Components
+    SELECT sum(impressions) FROM AdAnalyticsTable
+      WHERE (daysSinceEpoch >= 17824 and daysSinceEpoch <= 17854) AND 
adveriserId = '1234356789'
+      GROUP BY daysSinceEpoch,advertiserId TOP 100
 
-* Pinot Controller: Manages other pinot components (brokers, servers) as well 
as controls assignment of tables/segments to servers.
-* Pinot Server: Hosts one or more segments and serves queries from those 
segments
-* Pinot Broker: Accepts queries from clients and routes them to one or more 
servers, and returns consolidated response to the server.
+.. code-block:: sql
 
-Pinot leverages [Apache Helix](http://helix.apache.org) for cluster 
management. 
-
-For more information on Pinot Design and Architecture can be found 
[here](https://github.com/linkedin/pinot/wiki/Architecture)
+    SELECT sum(cost) FROM AdAnalyticsTable GROUP BY advertiserId TOP 50
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot] branch master updated: Fixes to doc (#3558)

Reply via email to