[incubator-pinot] 01/01: Re-org documentation

mcvsubbu Wed, 28 Nov 2018 10:13:05 -0800

This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a commit to branch doc-reorg
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git


commit a28c8ca4086269551be3c1447ee3e2c8f11b4a3e
Author: Subbu Subramaniam <[email protected]>
AuthorDate: Tue Nov 27 17:10:20 2018 -0800

    Re-org documentation
    
    Combined the sections on creating pinot segments into one.
    Removed extra pictures from pluggable stream section, referencing the 
realtime design instead.
    Created a new top level section on customizing pinot
    Other minor edits and warning fixes
---
 docs/client_api.rst              |  56 ++++++++--------
 docs/conf.py                     |   2 +-
 docs/creating_pinot_segments.rst |  98 ---------------------------
 docs/expressions_udf.rst         |   9 +--
 docs/index.rst                   |  36 ++++++----
 docs/intro.rst                   |   4 +-
 docs/llc.rst                     |  26 ++++++--
 docs/management_api.rst          |   6 +-
 docs/multitenancy.rst            |  24 +++----
 docs/partition_aware_routing.rst |   6 +-
 docs/pinot_hadoop.rst            | 139 ++++++++++++++++++++++++++++++--------
 docs/pluggable_streams.rst       |  64 +++++++++---------
 docs/pql_examples.rst            | 141 +++++++++++++++++++++++----------------
 docs/reference.rst               |   7 +-
 docs/schema_timespec.rst         |   6 +-
 docs/segment_fetcher.rst         |  16 ++---
 docs/trying_pinot.rst            |   6 +-
 17 files changed, 341 insertions(+), 305 deletions(-)

diff --git a/docs/client_api.rst b/docs/client_api.rst
index f802d02..6a99d9a 100644
--- a/docs/client_api.rst
+++ b/docs/client_api.rst
@@ -1,12 +1,14 @@
-REST API
-========
+Executing queries via REST API on the Broker
+============================================
 
-The Pinot REST API can be accessed by ``POST`` ing a JSON object containing 
the parameter ``pql`` to the ``/query`` endpoint on a broker. Depending on the 
type of query, the results can take different shapes. For example, using curl:
+The Pinot REST API can be accessed by invoking ``POST`` operation witha a JSON 
body containing the parameter ``pql``
+to the ``/query`` URI endpoint on a broker. Depending on the type of query, 
the results can take different shapes.
+The examples below use curl.
 
 Aggregation
 -----------
 
-::
+.. code-block:: none
 
   curl -X POST -d '{"pql":"select count(*) from flights"}' 
http://localhost:8099/query
 
@@ -30,7 +32,7 @@ Aggregation
 Aggregation with grouping
 -------------------------
 
-::
+.. code-block:: none
 
   curl -X POST -d '{"pql":"select count(*) from flights group by Carrier"}' 
http://localhost:8099/query
 
@@ -68,7 +70,7 @@ Aggregation with grouping
 Selection
 ---------
 
-::
+.. code-block:: none
 
   curl -X POST -d '{"pql":"select * from flights limit 3"}' 
http://localhost:8099/query
 
@@ -145,12 +147,12 @@ Connections to Pinot are created using the 
ConnectionFactory class' utility meth
 
 .. code-block:: java
 
- Connection connection = ConnectionFactory.fromZookeeper
-     (some-zookeeper-server:2191/zookeeperPath");
+   Connection connection = ConnectionFactory.fromZookeeper
+     ("some-zookeeper-server:2191/zookeeperPath");
 
- Connection connection = ConnectionFactory.fromProperties("demo.properties");
+   Connection connection = ConnectionFactory.fromProperties("demo.properties");
 
- Connection connection = ConnectionFactory.fromHostList
+   Connection connection = ConnectionFactory.fromHostList
      ("some-server:1234", "some-other-server:1234", ...);
 
 
@@ -158,8 +160,8 @@ Queries can be sent directly to the Pinot cluster using the 
Connection.execute(j
 
 .. code-block:: java
 
- ResultSetGroup resultSetGroup = connection.execute("select * from foo...");
- Future<ResultSetGroup> futureResultSetGroup = connection.executeAsync
+   ResultSetGroup resultSetGroup = connection.execute("select * from foo...");
+   Future<ResultSetGroup> futureResultSetGroup = connection.executeAsync
      ("select * from foo...");
 
 
@@ -167,38 +169,38 @@ Queries can also use a PreparedStatement to escape query 
parameters:
 
 .. code-block:: java
 
- PreparedStatement statement = connection.prepareStatement
+   PreparedStatement statement = connection.prepareStatement
      ("select * from foo where a = ?");
- statement.setString(1, "bar");
+   statement.setString(1, "bar");
 
- ResultSetGroup resultSetGroup = statement.execute();
- Future<ResultSetGroup> futureResultSetGroup = statement.executeAsync();
+   ResultSetGroup resultSetGroup = statement.execute();
+   Future<ResultSetGroup> futureResultSetGroup = statement.executeAsync();
 
 
 In the case of a selection query, results can be obtained with the various get 
methods in the first ResultSet, obtained through the getResultSet(int) method:
 
 .. code-block:: java
 
- ResultSet resultSet = connection.execute
+   ResultSet resultSet = connection.execute
      ("select foo, bar from baz where quux = 'quuux'").getResultSet(0);
 
- for(int i = 0; i < resultSet.getRowCount(); ++i) {
-     System.out.println("foo: " + resultSet.getString(i, 0);
-     System.out.println("bar: " + resultSet.getInt(i, 1);
- }
+   for (int i = 0; i < resultSet.getRowCount(); ++i) {
+     System.out.println("foo: " + resultSet.getString(i, 0));
+     System.out.println("bar: " + resultSet.getInt(i, 1));
+   }
 
- resultSet.close();
+   resultSet.close();
 
 
-In the case where there is an aggregation, each aggregation function is within 
its own ResultSet:
+In the case of aggregation, each aggregation function is within its own 
ResultSet:
 
 .. code-block:: java
 
- ResultSetGroup resultSetGroup = connection.execute("select count(*) from 
foo");
+   ResultSetGroup resultSetGroup = connection.execute("select count(*) from 
foo");
 
- ResultSet resultSet = resultSetGroup.getResultSet(0);
- System.out.println("Number of records: " + resultSet.getInt(0));
- resultSet.close();
+   ResultSet resultSet = resultSetGroup.getResultSet(0);
+   System.out.println("Number of records: " + resultSet.getInt(0));
+   resultSet.close();
 
 
 There can be more than one ResultSet, each of which can contain multiple 
results grouped by a group key.
diff --git a/docs/conf.py b/docs/conf.py
index a95d37e..1cf24c1 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -48,6 +48,7 @@ extensions = [
     'sphinx.ext.todo',
     'sphinx.ext.mathjax',
     'sphinx.ext.githubpages',
+    'sphinx.ext.intersphinx',
 ]
 
 # Add any paths that contain templates here, relative to this directory.
@@ -293,7 +294,6 @@ texinfo_documents = [
      'Miscellaneous'),
 ]
 
-extensions = ['sphinx.ext.intersphinx']
 
 # Documents to append as an appendix to all manuals.
 #texinfo_appendices = []
diff --git a/docs/creating_pinot_segments.rst b/docs/creating_pinot_segments.rst
deleted file mode 100644
index 5bae71e..0000000
--- a/docs/creating_pinot_segments.rst
+++ /dev/null
@@ -1,98 +0,0 @@
-Creating Pinot segments outside of Hadoop
-=========================================
-
-This document describes steps required for creating Pinot2_0 segments from 
standard formats like CSV/JSON.
-
-Compiling the code
-------------------
-Follow the steps described in the section on :doc: `Demonstration 
<trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in 
``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
-
-
-Data Preparation
-----------------
-
-#.  Create a top level directory containing all the CSV/JSON files that need 
to be converted.
-#.  The file name extensions are expected to be the same as the format name 
(_i.e_ ``.csv``, or ``.json``), and are case insensitive.
-    Note that the converter expects the .csv extension even if the data is 
delimited using tabs or spaces instead.
-#.  Prepare a schema file describing the schema of the input data. This file 
needs to be in JSON format. An example is provided at the end of the article.
-#.  Specifically for CSV format, an optional csv config file can be provided 
(also in JSON format). This is used to configure parameters like the 
delimiter/header for the CSV file etc.  
-        A detailed description of this follows below.  
-
-Creating a segment
-------------------
-Run the pinot-admin command to generate the segments. The command can be 
invoked as follows. Options within "[ ]" are optional. For -format, the default 
value is AVRO.
-
-::
-
-    bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format 
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile 
<generator_config_file>] -segmentName <segment_name> -schemaFile 
<input_schema_file> -tableName <table_name> -outDir <output_data_dir> 
[-overwrite]
-
-CSV Reader Config file
-----------------------
-To configure various parameters for CSV a config file in JSON format can be 
provided. This file is optional, as are each of its parameters. When not 
provided, default values used for these parameters are described below:
-
-#.  fileFormat: Specify one of the following. Default is EXCEL.  
-
- ##.  EXCEL
- ##.  MYSQL
- ##.  RFC4180
- ##.  TDF
-
-#.  header: If the input CSV file does not contain a header, it can be 
specified using this field. Note, if this is specified, then the input file is 
expected to not contain the header row, or else it will result in parse error. 
The columns in the header must be delimited by the same delimiter character as 
the rest of the CSV file.
-#.  delimiter: Use this to specify a delimiter character. The default value is 
",".
-#.  dateFormat: If there are columns that are in date format and need to be 
converted into Epoch (in milliseconds), use this to specify the format. Default 
is "mm-dd-yyyy".
-#.  dateColumns: If there are multiple date columns, use this to list those 
columns.
-
-Below is a sample config file.
-
-::
-
-  {
-    "fileFormat" : "EXCEL",
-    "header" : "col1,col2,col3,col4",
-    "delimiter" : "\t",
-    "dateFormat" : "mm-dd-yy"
-    "dateColumns" : ["col1", "col2"]
-  }
-
-Sample JSON schema file:
-
-::
-
-  {
-    "dimensionFieldSpecs" : [
-      {                           
-        "dataType" : "STRING",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "name"
-      },
-      {
-        "dataType" : "INT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "age"
-      }
-    ],
-    "timeFieldSpec" : {
-      "incomingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "incomingName1"
-      },
-      "outgoingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "outgoingName1"
-      }
-    },
-    "metricFieldSpecs" : [
-      {
-        "dataType" : "FLOAT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "percent"
-      }
-     ]
-    },
-    "schemaName" : "mySchema",
-  }
diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst
index d373424..270be67 100644
--- a/docs/expressions_udf.rst
+++ b/docs/expressions_udf.rst
@@ -7,13 +7,13 @@ The query language for Pinot (:doc:`PQL <reference>`) 
currently only supports *s
 
 The high level requirement here is to support *expressions* that represent a 
function on a set of columns in the queries, as opposed to just columns.
 
-::
+.. code-block:: none
 
   select <exp1> from myTable where ... [group by <exp2>]
 
 Where exp1 and exp2 can be of the form:
 
-::
+.. code-block:: none
 
   func1(func2(col1, col2...func3(...)...), coln...)...
 
@@ -44,7 +44,8 @@ Parser
 
 The PQL parser is already capable of parsing expressions in the *selection*, 
*aggregation* and *group by* sections. Following is a sample query containing 
expression, and its parse tree shown in the image.
 
-::
+.. code-block:: none
+
   select f1(f2(col1, col2), col3) from myTable where (col4 = 'x') group by 
f3(col5, f4(col6, col7))
 
 
@@ -112,7 +113,7 @@ We see the following limitations in functionality currently:
 
 #. Nesting of *aggregation* functions is not supported in the expression tree. 
This is because the number of documents after *aggregation* is reduced. In the 
expression below, *sum* of *col2* would yield one value, whereas *xform1* one 
*col1* would yield the same number of documents as in the input.
 
-::
+.. code-block:: none
 
    sum(xform1(col1), sum(col2))
 
diff --git a/docs/index.rst b/docs/index.rst
index 66d6ec2..f1defaa 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -3,12 +3,10 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Welcome to Pinot's documentation!
-=================================
+############
+Introduction
+############
 
-########
-Contents
-########
 
 .. toctree::
    :maxdepth: 1
@@ -24,8 +22,20 @@ Reference
 .. toctree::
    :maxdepth: 1
 
-   in_production
+
    reference
+   in_production
+
+#################
+Customizing Pinot
+#################
+
+.. toctree::
+   :maxdepth: 1
+
+
+   pluggable_streams
+   segment_fetcher
 
 ################
 Design Documents
@@ -34,19 +44,19 @@ Design Documents
 .. toctree::
    :maxdepth: 1
 
+
    llc
    partition_aware_routing
    expressions_udf
-   multitenancy
    schema_timespec
 
-   
+################
+Design Proposals
+################
 
+.. toctree::
+   :maxdepth: 1
 
-Indices and tables
-==================
 
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+   multitenancy
 
diff --git a/docs/intro.rst b/docs/intro.rst
index b169b4a..9f5cb93 100644
--- a/docs/intro.rst
+++ b/docs/intro.rst
@@ -1,5 +1,5 @@
-Introduction
-============
+About Pinot
+===========
 
 Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to 
deliver scalable real time analytics with low latency. It can ingest data
 from offline data sources (such as Hadoop and flat files) as well as streaming 
events (such as Kafka). Pinot is designed to scale horizontally,
diff --git a/docs/llc.rst b/docs/llc.rst
index 15d2e0a..4c08e70 100644
--- a/docs/llc.rst
+++ b/docs/llc.rst
@@ -1,14 +1,28 @@
-LLC Design
-==========
+Realtime Design
+===============
+
+Pinot consumes rows from streaming data (such as Kafka) and serves queries on 
the data consumed thus far.
+
+Two modes of consumption are supported in Pinot:
+
+.. _hlc-section:
+
+High Level Consumers
+--------------------
 
-High Level Consumer
--------------------
 .. figure:: High-level-stream.png
 
    High Level Stream Consumer Architecture
 
-Low Level Consumer
-------------------
+
+*TODO*: Add design description of how HLC realtime works
+
+
+.. _llc-section:
+
+Low Level Consumers
+------------------- 
+
 .. figure:: Low-level-stream.png
 
    Low Level Stream Consumer Architecture
diff --git a/docs/management_api.rst b/docs/management_api.rst
index a2c367a..5e4d843 100644
--- a/docs/management_api.rst
+++ b/docs/management_api.rst
@@ -1,5 +1,7 @@
-Management REST API
--------------------
+Managing Pinot via REST API on the Controller
+=============================================
+
+*TODO* : Remove this section altogether and find a place somewhere for a 
pointer to the management API. Maybe in the 'Running pinot in production' 
section?
 
 There is a REST API which allows management of tables, tenants, segments and 
schemas. It can be accessed by going to ``http://[controller host]/help`` which 
offers a web UI to do these tasks, as well as document the REST API.
 
diff --git a/docs/multitenancy.rst b/docs/multitenancy.rst
index edd4bf8..ffcf157 100644
--- a/docs/multitenancy.rst
+++ b/docs/multitenancy.rst
@@ -87,7 +87,7 @@ Adding Nodes to cluster
 
 Adding node to cluster can be done in two ways, manual or automatic. This is 
controlled by a property set in cluster config called 
"allowPariticpantAutoJoin". If this is set to true, participants can join the 
cluster when they are started. If not, they need to be pre-registered in Helix 
via `Helix Admin <http://helix.apache.org/0.6.4-docs/tutorial_admin.html>`_ 
command addInstance.
 
-::
+.. code-block:: none
 
   {
    "id" : "PinotPerfTestCluster",
@@ -102,7 +102,7 @@ In Pinot 2.0 we will set AUTO_JOIN to true. This means 
after the SRE's procure t
 
 The znode ``CONFIGS/PARTICIPANT/ServerInstanceName`` looks lik below:
 
-::
+.. code-block:: none
 
     {
      "id":"Server_localhost_8098"
@@ -120,7 +120,7 @@ The znode ``CONFIGS/PARTICIPANT/ServerInstanceName`` looks 
lik below:
 
 And the znode ``CONFIGS/PARTICIPANT/BrokerInstanceName`` looks like below:
 
-::
+.. code-block:: none
 
     {
      "id":"Broker_localhost_8099"
@@ -143,7 +143,7 @@ There is one resource idealstate created for Broker by 
default called broker_res
 
 *CLUSTERNAME/IDEALSTATES/BrokerResource (Broker IdealState before adding data 
resource)*
 
-::
+.. code-block:: none
 
   {
    "id" : "brokerResource",
@@ -166,7 +166,7 @@ After adding a resource using the following data resource 
creation command, a re
 Sample Curl request
 -------------------
 
-::
+.. code-block:: none
 
   curl -i -X POST -H 'Content-Type: application/json' -d 
'{"requestType":"create", "resourceName":"XLNT","tableName":"T1", 
"timeColumnName":"daysSinceEpoch", 
"timeType":"daysSinceEpoch","numberOfDataInstances":4,"numberOfCopies":2,"retentionTimeUnit":"DAYS",
 "retentionTimeValue":"700","pushFrequency":"daily", "brokerTagName":"XLNT", 
"numberOfBrokerInstances":1, 
"segmentAssignmentStrategy":"BalanceNumSegmentAssignmentStrategy", 
"resourceType":"OFFLINE", "metadata":{}}'
 
@@ -175,7 +175,7 @@ This is how it looks in Helix after running the above 
command.
 
 The znode ``CONFIGS/PARTICIPANT/Broker_localhost_8099`` looks as follows:
 
-::
+.. code-block:: none
 
     {
      "id":"Broker_localhost_8099"
@@ -193,7 +193,7 @@ The znode ``CONFIGS/PARTICIPANT/Broker_localhost_8099`` 
looks as follows:
 
 And the znode ``IDEALSTATES/brokerResource`` looks like below after Data 
resource is created
 
-::
+.. code-block:: none
 
     {
      "id":"brokerResource"
@@ -220,7 +220,7 @@ Server Info in Helix
 
 The znode ``CONFIGS/PARTICIPANT/Server_localhost_8098`` looks as below
 
-::
+.. code-block:: none
 
     {
      "id":"Server_localhost_8098"
@@ -238,7 +238,7 @@ The znode ``CONFIGS/PARTICIPANT/Server_localhost_8098`` 
looks as below
 
 And the znode ``/IDEALSTATES/XLNT (XLNT Data Resource IdealState)`` looks as 
below:
 
-::
+.. code-block:: none
 
     {
      "id":"XLNT"
@@ -267,7 +267,7 @@ Add a table to data resource
 
 Sample Curl request
 
-::
+.. code-block:: none
 
     curl -i -X PUT -H 'Content-Type: application/json' -d 
'{"requestType":"addTableToResource","resourceName":"XLNT","tableName":"T1", 
"resourceType":"OFFLINE", "metadata":{}}' <span 
class="nolink">[http://CONTROLLER-HOST:PORT/dataresources](http://CONTROLLER-HOST:PORT/dataresources)
 
@@ -275,7 +275,7 @@ After the table is added, mapping between Resources and 
Tables are maintained in
 
 The znode ``/PROPERTYSTORE/CONFIGS/RESOURCE/XLNT`` like like:
 
-::
+.. code-block:: none
 
     {
      "id":"mirrorProfileViewOfflineEvents1_O"
@@ -307,7 +307,7 @@ The znode ``/PROPERTYSTORE/CONFIGS/RESOURCE/XLNT`` like 
like:
 
 The znode ``/IDEALSTATES/XLNT (XLNT Data Resource IdealState)``
 
-::
+.. code-block:: none
 
     {
      "id":"XLNT_O"
diff --git a/docs/partition_aware_routing.rst b/docs/partition_aware_routing.rst
index 1a6a4d9..be9dcd4 100644
--- a/docs/partition_aware_routing.rst
+++ b/docs/partition_aware_routing.rst
@@ -48,7 +48,7 @@ Prune function: Name of the class that will be used by the 
broker to prune a seg
 
 For example, let us consider a case where the data is naturally partitioned on 
time column ‘daysSinceEpoch’. The segment zk metadata will have information 
like below:
 
-::
+.. code-block:: none
 
   {
     “partitionColumn”  : “daysSinceEpoch”,
@@ -59,7 +59,7 @@ For example, let us consider a case where the data is 
naturally partitioned on t
 
 Now consider the following query comes in. 
 
-::
+.. code-block:: none
 
   Select count(*) from myTable where daysSinceEpoch between 17100 and 17110
 
@@ -67,7 +67,7 @@ The broker will recognize the range predicate on the 
partition column, and call
 
 Let’s consider another example where the data is partitioned by memberId, 
where a hash function was applied on the memberId to compute a partition number.
 
-::
+.. code-block:: none
 
   {
     “partitionColumn”  : “memberId”,
diff --git a/docs/pinot_hadoop.rst b/docs/pinot_hadoop.rst
index 0fa4181..479095d 100644
--- a/docs/pinot_hadoop.rst
+++ b/docs/pinot_hadoop.rst
@@ -1,39 +1,43 @@
-Creating Pinot segments in Hadoop
-=================================
+Creating Pinot segments
+=======================
 
-Pinot index files can be created offline on Hadoop, then pushed onto a 
production cluster. Because index generation does not happen on the Pinot nodes 
serving traffic, this means that these nodes can continue to serve traffic 
without impacting performance while data is being indexed. The index files are 
then pushed onto the Pinot cluster, where the files are distributed and loaded 
by the server nodes with minimal performance impact.
+Pinot segments can be created offline on Hadoop, or via command line from data 
files. Controller REST endpoint
+can then be used to add the segment to the table to which the segment belongs.
+
+Creating segments using hadoop
+------------------------------
 
 .. figure:: Pinot-Offline-only-flow.png
 
   Offline Pinot workflow
 
-To create index files offline  a Hadoop workflow can be created to complete 
the following steps:
+To create Pinot segments on Hadoop, a workflow can be created to complete the 
following steps:
 
-1. Pre-aggregate, clean up and prepare the data, writing it as Avro format 
files in a single HDFS directory
-2. Create the index files
-3. Upload the index files to the Pinot cluster
+#. Pre-aggregate, clean up and prepare the data, writing it as Avro format 
files in a single HDFS directory
+#. Create segments
+#. Upload segments to the Pinot cluster
 
-Step one can be done using your favorite tool (such as Pig, Hive or Spark), 
while Pinot provides two MapReduce jobs to do step two and three.
+Step one can be done using your favorite tool (such as Pig, Hive or Spark), 
Pinot provides two MapReduce jobs to do step two and three.
 
-Configuration
--------------
+Configuring the job
+^^^^^^^^^^^^^^^^^^^
 
 Create a job properties configuration file, such as one below:
 
-::
+.. code-block:: none
 
   # === Index segment creation job config ===
 
   # path.to.input: Input directory containing Avro files
   path.to.input=/user/pinot/input/data
 
-  # path.to.output: Output directory containing Pinot index segments
+  # path.to.output: Output directory containing Pinot segments
   path.to.output=/user/pinot/output
 
   # path.to.schema: Schema file for the table, stored locally
   path.to.schema=flights-schema.json
 
-  # segment.table.name: Name of the table for which to generate index segments
+  # segment.table.name: Name of the table for which to generate segments
   segment.table.name=flights
 
   # === Segment tar push job config ===
@@ -45,28 +49,109 @@ Create a job properties configuration file, such as one 
below:
   push.to.port=8888
 
 
-Index file creation
--------------------
+Executing the job
+^^^^^^^^^^^^^^^^^
 
 The Pinot Hadoop module contains a job that you can incorporate into your
-workflow to generate Pinot indices. Note that this will only create data for 
you. 
-In order to have this data on your cluster, you want to also run the 
SegmentTarPush
-job, details below. To run SegmentCreation through the command line:
+workflow to generate Pinot segments.
 
-::
+.. code-block:: none
 
   mvn clean install -DskipTests -Pbuild-shaded-jar
   hadoop jar pinot-hadoop-0.016-shaded.jar SegmentCreation job.properties
 
+You can then use the SegmentTarPush job to push segments via the controller 
REST API.
 
-Index file push
----------------
-
-This job takes generated Pinot index files from an input directory and pushes
-them to a Pinot controller node.
-
-::
+.. code-block:: none
 
-  mvn clean install -DskipTests -Pbuild-shaded-jar
   hadoop jar pinot-hadoop-0.016-shaded.jar SegmentTarPush job.properties
 
+
+Creating Pinot segments outside of Hadoop
+-----------------------------------------
+
+This document describes steps required for creating Pinot segments from 
standard formats like CSV/JSON.
+
+#. Follow the steps described in the section on :doc: `Demonstration 
<trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in 
``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
+#.  Create a top level directory containing all the CSV/JSON files that need 
to be converted into segments.
+#.  The file name extensions are expected to be the same as the format name 
(*i.e* ``.csv``, or ``.json``), and are case insensitive.
+    Note that the converter expects the ``.csv`` extension even if the data is 
delimited using tabs or spaces instead.
+#.  Prepare a schema file describing the schema of the input data. The schema 
needs to be in JSON format. See example later in this section.
+#.  Specifically for CSV format, an optional csv config file can be provided 
(also in JSON format). This is used to configure parameters like the 
delimiter/header for the CSV file etc.  
+        A detailed description of this follows below.  
+
+Run the pinot-admin command to generate the segments. The command can be 
invoked as follows. Options within "[ ]" are optional. For -format, the default 
value is AVRO.
+
+.. code-block:: none
+
+    bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format 
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile 
<generator_config_file>] -segmentName <segment_name> -schemaFile 
<input_schema_file> -tableName <table_name> -outDir <output_data_dir> 
[-overwrite]
+
+
+To configure various parameters for CSV a config file in JSON format can be 
provided. This file is optional, as are each of its parameters. When not 
provided, default values used for these parameters are described below:
+
+#.  fileFormat: Specify one of the following. Default is EXCEL.  
+
+ ##.  EXCEL
+ ##.  MYSQL
+ ##.  RFC4180
+ ##.  TDF
+
+#.  header: If the input CSV file does not contain a header, it can be 
specified using this field. Note, if this is specified, then the input file is 
expected to not contain the header row, or else it will result in parse error. 
The columns in the header must be delimited by the same delimiter character as 
the rest of the CSV file.
+#.  delimiter: Use this to specify a delimiter character. The default value is 
",".
+#.  dateFormat: If there are columns that are in date format and need to be 
converted into Epoch (in milliseconds), use this to specify the format. Default 
is "mm-dd-yyyy".
+#.  dateColumns: If there are multiple date columns, use this to list those 
columns.
+
+Below is a sample config file.
+
+.. code-block:: none
+
+  {
+    "fileFormat" : "EXCEL",
+    "header" : "col1,col2,col3,col4",
+    "delimiter" : "\t",
+    "dateFormat" : "mm-dd-yy"
+    "dateColumns" : ["col1", "col2"]
+  }
+
+Sample Schema:
+
+.. code-block:: none
+
+  {
+    "dimensionFieldSpecs" : [
+      {                           
+        "dataType" : "STRING",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "name"
+      },
+      {
+        "dataType" : "INT",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "age"
+      }
+    ],
+    "timeFieldSpec" : {
+      "incomingGranularitySpec" : {
+        "timeType" : "DAYS",
+        "dataType" : "LONG",
+        "name" : "incomingName1"
+      },
+      "outgoingGranularitySpec" : {
+        "timeType" : "DAYS",
+        "dataType" : "LONG",
+        "name" : "outgoingName1"
+      }
+    },
+    "metricFieldSpecs" : [
+      {
+        "dataType" : "FLOAT",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "percent"
+      }
+     ]
+    },
+    "schemaName" : "mySchema",
+  }
diff --git a/docs/pluggable_streams.rst b/docs/pluggable_streams.rst
index 876e31c..4c6b26a 100644
--- a/docs/pluggable_streams.rst
+++ b/docs/pluggable_streams.rst
@@ -16,17 +16,14 @@ Some of the streams for which plug-ins can be added are:
 * `Pulsar <https://pulsar.apache.org/docs/en/client-libraries-java/>`_
 
 
-You may encounter some limitations either in Pinot or in the stream system 
while developing plug-ins. Please feel free to get in touch with us when you 
start writing a stream plug-in, and we can help you out. We are open to 
receiving PRs in order to improve these abstractions if they do not work for a 
certain stream implementation.
+You may encounter some limitations either in Pinot or in the stream system 
while developing plug-ins.
+Please feel free to get in touch with us when you start writing a stream 
plug-in, and we can help you out.
+We are open to receiving PRs in order to improve these abstractions if they do 
not work for a certain stream implementation.
 
-Pinot Stream Consumers
-----------------------
-Pinot consumes rows from event streams and serves queries on the data 
consumed. Rows may be consumed either at stream level (also referred to as high 
level) or at partition level (also referred to as low level).
+Refer to sections :ref:`hlc-section` and :ref:`llc-section` for details on how 
Pinot consumes streaming data.
 
-.. figure:: pluggable_streams.png
-
-.. figure:: High-level-stream.png
-
-   Stream Level Consumer
+Requirements to support Stream Level (High Level) consumers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The stream should provide the following guarantees:
 
@@ -36,13 +33,12 @@ The stream should provide the following guarantees:
 * The checkpoints should be recorded only when Pinot makes a call to do so.
 * The consumer should be able to start consumption from one of:
 
-  ** latest avaialble data
-  ** earliest available data
-  ** last saved checkpoint
-
-.. figure:: Low-level-stream.png
+  * latest avaialble data
+  * earliest available data
+  * last saved checkpoint
 
-  Partition Level Consumer
+Requirements to support Partition Level (Low Level) consumers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 While consuming rows at a partition level, the stream should support the 
following
 properties:
@@ -52,9 +48,10 @@ properties:
 * Refer to a partition as a number not exceeding 32 bits long.
 * Stream should provide the following mechanisms to get an offset for a given 
partition of the stream:
 
-  ** get the offset of the oldest event available (assuming events are aged 
out periodically) in the partition.
-  ** get the offset of the most recent event published in the partition
-  ** (optionally) get the offset of an event that was published at a specified 
time
+  * get the offset of the oldest event available (assuming events are aged out 
periodically) in the partition.
+  * get the offset of the most recent event published in the partition
+  * (optionally) get the offset of an event that was published at a specified 
time
+
 * Stream should provide a mechanism to consume a set of events from a 
partition starting from a specified offset.
 * Events with higher offsets should be more recent (the offsets of events need 
not be contiguous)
 
@@ -91,24 +88,25 @@ The rest of the configuration properties for your stream 
should be set with the
 
 All values should be strings. For example:
 
-.::
 
-  "streamType" : "foo",
-  "stream.foo.topic.name" : "SomeTopic",
-  "stream.foo.consumer.type": "lowlevel",
-  "stream.foo.consumer.factory.class.name": 
"fully.qualified.pkg.ConsumerFactoryClassName",
-  "stream.foo.consumer.prop.auto.offset.reset": "largest",
-  "stream.foo.decoder.class.name" : "fully.qualified.pkg.DecoderClassName",
-  "stream.foo.decoder.prop.a.decoder.property" : "decoderPropValue",
-  "stream.foo.connection.timeout.millis" : "10000", // default 30_000
-  "stream.foo.fetch.timeout.millis" : "10000" // default 5_000
+.. code-block:: none
+
+    "streamType" : "foo",
+    "stream.foo.topic.name" : "SomeTopic",
+    "stream.foo.consumer.type": "lowlevel",
+    "stream.foo.consumer.factory.class.name": 
"fully.qualified.pkg.ConsumerFactoryClassName",
+    "stream.foo.consumer.prop.auto.offset.reset": "largest",
+    "stream.foo.decoder.class.name" : "fully.qualified.pkg.DecoderClassName",
+    "stream.foo.decoder.prop.a.decoder.property" : "decoderPropValue",
+    "stream.foo.connection.timeout.millis" : "10000", // default 30_000
+    "stream.foo.fetch.timeout.millis" : "10000" // default 5_000
 
 
 You can have additional properties that are specific to your stream. For 
example:
 
-.::
+.. code-block:: none
 
-"stream.foo.some.buffer.size" : "24g"
+  "stream.foo.some.buffer.size" : "24g"
 
 In addition to these properties, you can define thresholds for the consuming 
segments:
 
@@ -117,10 +115,10 @@ In addition to these properties, you can define 
thresholds for the consuming seg
 
 The properties for the thresholds are as follows:
 
-.::
+.. code-block:: none
 
-"realtime.segment.flush.threshold.size" : "100000"
-"realtime.segment.flush.threshold.time" : "6h"
+  "realtime.segment.flush.threshold.size" : "100000"
+  "realtime.segment.flush.threshold.time" : "6h"
 
 
 An example of this implementation can be found in the `KafkaConsumerFactory 
<com.linkedin.pinot.core.realtime.impl.kafka.KafkaConsumerFactory>`_, which is 
an implementation for the kafka stream.
diff --git a/docs/pql_examples.rst b/docs/pql_examples.rst
index a160763..4688617 100644
--- a/docs/pql_examples.rst
+++ b/docs/pql_examples.rst
@@ -77,66 +77,30 @@ Note: results might not be consistent if column ordered by 
has same value in mul
     ORDER BY bar DESC
     LIMIT 50, 100
 
-Wild-card match
----------------
+Wild-card match (in WHERE clause only)
+--------------------------------------
+
+To count rows where the column ``airlineName`` starts with ``U``
 
 .. code-block:: sql
 
   SELECT count(*) FROM SomeTable
-    WHERE regexp_like(columnName, '.*regex-here?')
-    GROUP BY someOtherColumn TOP 10
+    WHERE regexp_like(airlineName, '^U.*')
+    GROUP BY airlineName TOP 10
+
+Examples with UDF
+-----------------
 
-Time-Convert UDF
-----------------
+As of now, functions have to be implemented within Pinot. Injecting functions 
is not allowed yet.
+The examples below demonstrate the use of UDFs
 
 .. code-block:: sql
 
   SELECT count(*) FROM myTable
     GROUP BY timeConvert(timeColumnName, 'SECONDS', 'DAYS')
 
-Differences with SQL
---------------------
-
-* ``JOIN`` is not supported
-* Use ``TOP`` instead of ``LIMIT`` for truncation
-* ``LIMIT n`` has no effect in grouping queries, should use ``TOP n`` instead. 
If no ``TOP n`` defined, PQL will use ``TOP 10`` as default truncation setting.
-* No need to select the columns to group with.
-
-The following two queries are both supported in PQL, where the non-aggregation 
columns are ignored.
-
-.. code-block:: sql
-
-  SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
-    GROUP BY bar, baz
-    TOP 50
-
-  SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
-    GROUP BY bar, baz
-    TOP 50
-
-* Always order by the aggregated value
-  The results will always order by the aggregated value itself.
-* Results equivalent to grouping on each aggregation
-  The results for query:
-
-.. code-block:: sql
-
-  SELECT MIN(foo), MAX(foo) FROM myTable
-    GROUP BY bar
-    TOP 50
-
-will be the same as the combining results from the following queries:
-
-.. code-block:: sql
-
-  SELECT MIN(foo) FROM myTable
-    GROUP BY bar
-    TOP 50
-  SELECT MAX(foo) FROM myTable
-    GROUP BY bar
-    TOP 50
-
-where we don't put the results for the same group together.
+  SELECT count(*) FROM myTable
+    GROUP BY div(tim
 
 PQL Specification
 -----------------
@@ -189,10 +153,12 @@ WHERE
 
 Supported predicates are comparisons with a constant using the standard SQL 
operators (``=``, ``<``, ``<=``, ``>``, ``>=``, ``<>``, '!=') , range 
comparisons using ``BETWEEN`` (``foo BETWEEN 42 AND 69``), set membership 
(``foo IN (1, 2, 4, 8)``) and exclusion (``foo NOT IN (1, 2, 4, 8)``). For 
``BETWEEN``, the range is inclusive.
 
+Comparison with a regular expression is supported using the regexp_like 
function, as in ``WHERE regexp_like(columnName, 'regular expression')``
+
 GROUP BY
 ^^^^^^^^
 
-The ``GROUP BY`` clause groups aggregation results by a list of columns.
+The ``GROUP BY`` clause groups aggregation results by a list of columns, or 
transform functions on columns (see below)
 
 
 ORDER BY
@@ -224,11 +190,72 @@ For example, the following query will calculate the 
maximum value of column ``fo
 
 Supported transform functions
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``ADD``
+   Sum of at least two values
+
+``SUB``
+   Difference between two values
 
-* ``ADD``: sum of at least two values
-* ``SUB``: difference between two values
-* ``MULT``: product of at least two values
-* ``DIV``: quotient of two values
-* ``TIMECONVERT``: takes 3 arguments, converts the value into another time 
unit. E.g. ``TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')``
-* ``DATETIMECONVERT``: takes 4 arguments, converts the value into another date 
time format, and buckets time based on the given time granularity. E.g. 
``DATETIMECONVERT(date, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', 
'15:MINUTES')``
-* ``VALUEIN``: takes at least 2 arguments, where the first argument is a 
multi-valued column, and the following arguments are constant values. The 
transform function will filter the value from the multi-valued column with the 
given constant values. The ``VALUEIN`` transform function is especially useful 
when the same multi-valued column is both filtering column and grouping column. 
E.g. ``VALUEIN(mvColumn, 3, 5, 15)``
+``MULT``
+   Product of at least two values
+
+``DIV``
+   Quotient of two values
+
+``TIMECONVERT``
+   Takes 3 arguments, converts the value into another time unit. E.g. 
``TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')``
+
+``DATETIMECONVERT``
+   Takes 4 arguments, converts the value into another date time format, and 
buckets time based on the given time granularity.
+   *e.g.* ``DATETIMECONVERT(date, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', 
'15:MINUTES')``
+
+``VALUEIN``
+   Takes at least 2 arguments, where the first argument is a multi-valued 
column, and the following arguments are constant values.
+   The transform function will filter the value from the multi-valued column 
with the given constant values.
+   The ``VALUEIN`` transform function is especially useful when the same 
multi-valued column is both filtering column and grouping column.
+   *e.g.* ``VALUEIN(mvColumn, 3, 5, 15)``
+
+
+Differences with SQL
+--------------------
+
+* ``JOIN`` is not supported
+* Use ``TOP`` instead of ``LIMIT`` for truncation
+* ``LIMIT n`` has no effect in grouping queries, should use ``TOP n`` instead. 
If no ``TOP n`` defined, PQL will use ``TOP 10`` as default truncation setting.
+* No need to select the columns to group with.
+
+The following two queries are both supported in PQL, where the non-aggregation 
columns are ignored.
+
+.. code-block:: sql
+
+  SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
+    GROUP BY bar, baz
+    TOP 50
+
+  SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
+    GROUP BY bar, baz
+    TOP 50
+
+* Always order by the aggregated value
+  The results will always order by the aggregated value itself.
+* Results equivalent to grouping on each aggregation
+  The results for query:
+
+.. code-block:: sql
+
+  SELECT MIN(foo), MAX(foo) FROM myTable
+    GROUP BY bar
+    TOP 50
+
+will be the same as the combining results from the following queries:
+
+.. code-block:: sql
+
+  SELECT MIN(foo) FROM myTable
+    GROUP BY bar
+    TOP 50
+  SELECT MAX(foo) FROM myTable
+    GROUP BY bar
+    TOP 50
+
+where we don't put the results for the same group together.
diff --git a/docs/reference.rst b/docs/reference.rst
index 51881d8..2171119 100644
--- a/docs/reference.rst
+++ b/docs/reference.rst
@@ -1,15 +1,10 @@
 .. _reference:
 
-Pinot Reference Manual
-======================
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    pql_examples
    client_api
    management_api
    pinot_hadoop
-   creating_pinot_segments
-   pluggable_streams
-   segment_fetcher
 
diff --git a/docs/schema_timespec.rst b/docs/schema_timespec.rst
index 1d7e769..531ff28 100644
--- a/docs/schema_timespec.rst
+++ b/docs/schema_timespec.rst
@@ -6,7 +6,7 @@ Problems with current schema design
 
 The pinot schema timespec looks like this:
 
-::
+.. code-block:: none
 
   {
     "timeFieldSpec":
@@ -29,7 +29,7 @@ Changes
 
 We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs to the pinot schema
 
-::
+.. code-block:: none
 
   {
     “dateTimeFieldSpec”:
@@ -67,7 +67,7 @@ We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs 
to the pinot schema
 
 Examples:
 
-::
+.. code-block:: none
 
   “dateTimeFieldSpec”:
   {
diff --git a/docs/segment_fetcher.rst b/docs/segment_fetcher.rst
index 7e57602..97e2339 100644
--- a/docs/segment_fetcher.rst
+++ b/docs/segment_fetcher.rst
@@ -14,15 +14,15 @@ HDFS segment fetcher configs
 -----------------------------
 
 In your Pinot controller/server configuration, you will need to provide the 
following configs:
-::
+
+.. code-block:: none
 
     pinot.controller.segment.fetcher.hdfs.hadoop.conf.path=`<file path to 
hadoop conf folder>
 
 
 or
 
-
-::
+.. code-block:: none
 
 
     pinot.server.segment.fetcher.hdfs.hadoop.conf.path=`<file path to hadoop 
conf folder>
@@ -30,14 +30,14 @@ or
 
 This path should point the local folder containing ``core-site.xml`` and 
``hdfs-site.xml`` files from your Hadoop installation 
 
-::
+.. code-block:: none
 
     pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=`<your 
kerberos principal>
     pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=`<your 
kerberos keytab>
 
 or
 
-::
+.. code-block:: none
 
     pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=`<your 
kerberos principal>
     pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=`<your kerberos 
keytab>
@@ -54,7 +54,7 @@ To push HDFS segment files to Pinot controller, you just need 
to ensure you have
 
 For example, the following curl requests to Controller will notify it to 
download segment files to the proper table:
 
-::
+.. code-block:: none
  
   curl -X POST -H "UPLOAD_TYPE:URI" -H 
"DOWNLOAD_URI:hdfs://nameservice1/hadoop/path/to/segment/file.gz" -H 
"content-type:application/json" -d '' localhost:9000/segments
 
@@ -63,13 +63,13 @@ Implement your own segment fetcher for other systems
 
 You can also implement your own segment fetchers for other file systems and 
load into Pinot system with an external jar. All you need to do is to implement 
a class that extends the interface of `SegmentFetcher 
<https://github.com/linkedin/pinot/blob/master/pinot-common/src/main/java/com/linkedin/pinot/common/segment/fetcher/SegmentFetcher.java>`_
 and provides config to Pinot Controller and Server as follows:
 
-::
+.. code-block:: none
 
     pinot.controller.segment.fetcher.`<protocol>`.class =`<class path to your 
implementation>
 
 or
 
-::
+.. code-block:: none
 
     pinot.server.segment.fetcher.`<protocol>`.class =`<class path to your 
implementation>
 
diff --git a/docs/trying_pinot.rst b/docs/trying_pinot.rst
index ddc1197..40d932d 100644
--- a/docs/trying_pinot.rst
+++ b/docs/trying_pinot.rst
@@ -1,5 +1,5 @@
-Running the Pinot Demonstration
-===============================
+Quickstart guide
+================
 
 A quick way to get familiar with Pinot is to run the Pinot examples. The 
examples can be run either by compiling the
 code or by running the prepackaged Docker images.
@@ -35,7 +35,7 @@ Trying out the demo
 Once the Pinot cluster is running, you can query it by going to 
http://localhost:9000/query/
 
 You can also use the REST API to query Pinot, as well as the Java client. As 
this is outside of the scope of this
-introduction, the reference documentation to use the Pinot client APIs is in 
the :ref:`client-api` section.
+introduction, the reference documentation to use the Pinot client APIs is in 
the :doc:`client_api` section.
 
 Pinot uses PQL, a SQL-like query language, to query data. Here are some sample 
queries:
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot] 01/01: Re-org documentation

Reply via email to