[incubator-pinot] branch master updated: Update doc for "Creating Pinot segments" to reflect the current code base (#4086)

jackie Mon, 08 Apr 2019 16:04:10 -0700

This is an automated email from the ASF dual-hosted git repository.

jackie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git



The following commit(s) were added to refs/heads/master by this push:
     new 435e8be  Update doc for "Creating Pinot segments" to reflect the 
current code base (#4086)
435e8be is described below

commit 435e8be323427057b6ed2b40887707e226aa642b
Author: Xiaotian (Jackie) Jiang <[email protected]>
AuthorDate: Mon Apr 8 16:04:00 2019 -0700

    Update doc for "Creating Pinot segments" to reflect the current code base 
(#4086)
    
    1. Remove delimiter from schema
    2. Update fields in CSV record reader config
    3. Remove redundant fields in table config
    4. Reformat to follow rst convension and fix some wrong format (e.g. index 
number)
---
 docs/pinot_hadoop.rst | 262 ++++++++++++++++++++++++--------------------------
 1 file changed, 123 insertions(+), 139 deletions(-)

diff --git a/docs/pinot_hadoop.rst b/docs/pinot_hadoop.rst
index fc3573b..81f0728 100644
--- a/docs/pinot_hadoop.rst
+++ b/docs/pinot_hadoop.rst
@@ -31,7 +31,7 @@ Creating segments using hadoop
 
 .. figure:: img/Pinot-Offline-only-flow.png
 
-  Offline Pinot workflow
+   Offline Pinot workflow
 
 To create Pinot segments on Hadoop, a workflow can be created to complete the 
following steps:
 
@@ -48,28 +48,27 @@ Create a job properties configuration file, such as one 
below:
 
 .. code-block:: none
 
-  # === Index segment creation job config ===
+   # === Index segment creation job config ===
 
-  # path.to.input: Input directory containing Avro files
-  path.to.input=/user/pinot/input/data
+   # path.to.input: Input directory containing Avro files
+   path.to.input=/user/pinot/input/data
 
-  # path.to.output: Output directory containing Pinot segments
-  path.to.output=/user/pinot/output
+   # path.to.output: Output directory containing Pinot segments
+   path.to.output=/user/pinot/output
 
-  # path.to.schema: Schema file for the table, stored locally
-  path.to.schema=flights-schema.json
+   # path.to.schema: Schema file for the table, stored locally
+   path.to.schema=flights-schema.json
 
-  # segment.table.name: Name of the table for which to generate segments
-  segment.table.name=flights
+   # segment.table.name: Name of the table for which to generate segments
+   segment.table.name=flights
 
-  # === Segment tar push job config ===
+   # === Segment tar push job config ===
 
-  # push.to.hosts: Comma separated list of controllers host names to which to 
push
-  push.to.hosts=controller_host_0,controller_host_1
-
-  # push.to.port: The port on which the controller runs
-  push.to.port=8888
+   # push.to.hosts: Comma separated list of controllers host names to which to 
push
+   push.to.hosts=controller_host_0,controller_host_1
 
+   # push.to.port: The port on which the controller runs
+   push.to.port=8888
 
 Executing the job
 ^^^^^^^^^^^^^^^^^
@@ -77,122 +76,105 @@ Executing the job
 The Pinot Hadoop module contains a job that you can incorporate into your
 workflow to generate Pinot segments.
 
-.. code-block:: none
+.. code-block:: bash
 
-  mvn clean install -DskipTests -Pbuild-shaded-jar
-  hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation 
job.properties
+   mvn clean install -DskipTests -Pbuild-shaded-jar
+   hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation 
job.properties
 
 You can then use the SegmentTarPush job to push segments via the controller 
REST API.
 
-.. code-block:: none
-
-  hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush 
job.properties
+.. code-block:: bash
 
+   hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush 
job.properties
 
 Creating Pinot segments outside of Hadoop
 -----------------------------------------
 
-Here is how you can create Pinot segments from standard formats like CSV/JSON.
+Here is how you can create Pinot segments from standard formats like 
CSV/JSON/AVRO.
 
 #. Follow the steps described in the section on :ref:`compiling-code-section` 
to build pinot. Locate ``pinot-admin.sh`` in 
``pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh``.
-#.  Create a top level directory containing all the CSV/JSON files that need 
to be converted into segments.
-#.  The file name extensions are expected to be the same as the format name 
(*i.e* ``.csv``, or ``.json``), and are case insensitive.
-    Note that the converter expects the ``.csv`` extension even if the data is 
delimited using tabs or spaces instead.
-#.  Prepare a schema file describing the schema of the input data. The schema 
needs to be in JSON format. See example later in this section.
-#.  Specifically for CSV format, an optional csv config file can be provided 
(also in JSON format). This is used to configure parameters like the 
delimiter/header for the CSV file etc.
-        A detailed description of this follows below.
+#. Create a top level directory containing all the CSV/JSON/AVRO files that 
need to be converted into segments.
+#. The file name extensions are expected to be the same as the format name 
(*i.e* ``.csv``, ``.json`` or ``.avro``), and are case insensitive. Note that 
the converter expects the ``.csv`` extension even if the data is delimited 
using tabs or spaces instead.
+#. Prepare a schema file describing the schema of the input data. The schema 
needs to be in JSON format. See example later in this section.
+#. Specifically for CSV format, an optional csv config file can be provided 
(also in JSON format). This is used to configure parameters like the 
delimiter/header for the CSV file etc. A detailed description of this follows 
below.
 
 Run the pinot-admin command to generate the segments. The command can be 
invoked as follows. Options within "[ ]" are optional. For -format, the default 
value is AVRO.
 
-.. code-block:: none
-
-    bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format 
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile 
<generator_config_file>] -segmentName <segment_name> -schemaFile 
<input_schema_file> -tableName <table_name> -outDir <output_data_dir> 
[-overwrite]
+.. code-block:: bash
 
+   bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format 
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile 
<generator_config_file>] -segmentName <segment_name> -schemaFile 
<input_schema_file> -tableName <table_name> -outDir <output_data_dir> 
[-overwrite]
 
 To configure various parameters for CSV a config file in JSON format can be 
provided. This file is optional, as are each of its parameters. When not 
provided, default values used for these parameters are described below:
 
-#.  fileFormat: Specify one of the following. Default is EXCEL.
+#. fileFormat: Specify one of the following. Default is EXCEL.
 
- ##.  EXCEL
- ##.  MYSQL
- ##.  RFC4180
- ##.  TDF
+   #. EXCEL
+   #. MYSQL
+   #. RFC4180
+   #. TDF
 
-#.  header: If the input CSV file does not contain a header, it can be 
specified using this field. Note, if this is specified, then the input file is 
expected to not contain the header row, or else it will result in parse error. 
The columns in the header must be delimited by the same delimiter character as 
the rest of the CSV file.
-#.  delimiter: Use this to specify a delimiter character. The default value is 
",".
-#.  dateFormat: If there are columns that are in date format and need to be 
converted into Epoch (in milliseconds), use this to specify the format. Default 
is "mm-dd-yyyy".
-#.  dateColumns: If there are multiple date columns, use this to list those 
columns.
+#. header: If the input CSV file does not contain a header, it can be 
specified using this field. Note, if this is specified, then the input file is 
expected to not contain the header row, or else it will result in parse error. 
The columns in the header must be delimited by the same delimiter character as 
the rest of the CSV file.
+#. delimiter: Use this to specify a delimiter character. The default value is 
",".
+#. multiValueDelimiter: Use this to specify a delimiter character for each 
value in multi-valued columns. The default value is ";".
 
 Below is a sample config file.
 
-.. code-block:: none
+.. code-block:: json
 
-  {
-    "fileFormat" : "EXCEL",
-    "header" : "col1,col2,col3,col4",
-    "delimiter" : "\t",
-    "dateFormat" : "mm-dd-yy"
-    "dateColumns" : ["col1", "col2"]
-  }
+   {
+     "fileFormat": "EXCEL",
+     "header": "col1,col2,col3,col4",
+     "delimiter": "\t",
+     "multiValueDelimiter": ","
+   }
 
 Sample Schema:
 
-.. code-block:: none
-
-  {
-    "dimensionFieldSpecs" : [
-      {
-        "dataType" : "STRING",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "name"
-      },
-      {
-        "dataType" : "INT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "age"
-      }
-    ],
-    "timeFieldSpec" : {
-      "incomingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "incomingName1"
-      },
-      "outgoingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "outgoingName1"
-      }
-    },
-    "metricFieldSpecs" : [
-      {
-        "dataType" : "FLOAT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "percent"
-      }
-     ]
-    },
-    "schemaName" : "mySchema",
-  }
+.. code-block:: json
+
+   {
+     "schemaName": "flights",
+     "dimensionFieldSpecs": [
+       {
+         "name": "flightNumber",
+         "dataType": "LONG"
+       },
+       {
+         "name": "tags",
+         "dataType": "STRING",
+         "singleValueField": false
+       }
+     ],
+     "metricFieldSpecs": [
+       {
+         "name": "price",
+         "dataType": "DOUBLE"
+       }
+     ],
+     "timeFieldSpec": {
+       "incomingGranularitySpec": {
+         "name": "daysSinceEpoch",
+         "dataType": "INT",
+         "timeType": "DAYS"
+       }
+     }
+   }
 
 Pushing offline segments to Pinot
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 You can use curl to push a segment to pinot:
 
-.. code-block:: none
+.. code-block:: bash
 
    curl -X POST -F segment=@<segment-tar-file-path> 
http://controllerHost:controllerPort/segments
 
 
 Alternatively you can use the pinot-admin.sh utility to upload one or more 
segments:
 
-.. code-block:: none
+.. code-block:: bash
 
-  pinot-tools/target/pinot-tools-pkg/bin//pinot-admin.sh UploadSegment 
-controllerHost <hostname> -controllerPort <port> -segmentDir 
<segmentDirectoryPath>
+   pinot-tools/target/pinot-tools-pkg/bin//pinot-admin.sh UploadSegment 
-controllerHost <hostname> -controllerPort <port> -segmentDir 
<segmentDirectoryPath>
 
 The command uploads all the segments found in ``segmentDirectoryPath``.
 The segments could be either tar-compressed (in which case it is a file under 
``segmentDirectoryPath``)
@@ -201,64 +183,66 @@ or uncompressed (in which case it is a directory under 
``segmentDirectoryPath``)
 Realtime segment generation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-To consume in realtime, we simply need to create a table that uses the same 
schema and points to the Kafka topic to
-consume from, using a table definition such as this one:
-
-.. code-block:: none
-
-  {
-    "tableName":"flights",
-    "segmentsConfig" : {
-        "retentionTimeUnit":"DAYS",
-        "retentionTimeValue":"7",
-        "segmentPushFrequency":"daily",
-        "segmentPushType":"APPEND",
-        "replication" : "1",
-        "schemaName" : "flights",
-        "timeColumnName" : "daysSinceEpoch",
-        "timeType" : "DAYS",
-        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
-    },
-    "tableIndexConfig" : {
-        "invertedIndexColumns" : ["Carrier"],
-        "loadMode"  : "HEAP",
-        "lazyLoad"  : "false",
-                "streamConfigs": {
-                        "streamType": "kafka",
-                        "stream.kafka.consumer.type": "highLevel",
-                        "stream.kafka.topic.name": "flights-realtime",
-                        "stream.kafka.decoder.class.name": 
"org.apache.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
-                        "stream.kafka.zk.broker.url": "localhost:2181",
-                        "stream.kafka.hlc.zk.connect.string": "localhost:2181"
-                }
-    },
-    "tableType":"REALTIME",
-        "tenants" : {
-                "broker":"DefaultTenant_BROKER",
-                "server":"DefaultTenant_SERVER"
-        },
-    "metadata": {
-    }
-  }
+To consume in realtime, we simply need to create a table with the same name as 
the schema and point to the Kafka topic
+to consume from, using a table definition such as this one:
+
+.. code-block:: json
+
+   {
+     "tableName": "flights",
+     "tableType": "REALTIME",
+     "segmentsConfig": {
+       "retentionTimeUnit": "DAYS",
+       "retentionTimeValue": "7",
+       "segmentPushFrequency": "daily",
+       "segmentPushType": "APPEND",
+       "replication": "1",
+       "timeColumnName": "daysSinceEpoch",
+       "timeType": "DAYS",
+       "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy"
+     },
+     "tableIndexConfig": {
+       "invertedIndexColumns": [
+         "flightNumber",
+         "tags",
+         "daysSinceEpoch"
+       ],
+       "loadMode": "MMAP",
+       "streamConfigs": {
+         "streamType": "kafka",
+         "stream.kafka.consumer.type": "highLevel",
+         "stream.kafka.topic.name": "flights-realtime",
+         "stream.kafka.decoder.class.name": 
"org.apache.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
+         "stream.kafka.zk.broker.url": "localhost:2181",
+         "stream.kafka.hlc.zk.connect.string": "localhost:2181"
+       }
+     },
+     "tenants": {
+       "broker": "brokerTenant",
+       "server": "serverTenant"
+     },
+     "metadata": {
+     }
+   }
 
 First, we'll start a local instance of Kafka and start streaming data into it:
 
-.. code-block:: none
+.. code-block:: bash
 
-  bin/pinot-admin.sh StartKafka &
-  bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2014.avro 
-kafkaTopic flights-realtime &
+   bin/pinot-admin.sh StartKafka &
+   bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2014.avro 
-kafkaTopic flights-realtime &
 
 This will stream one event per second from the Avro file to the Kafka topic. 
Then, we'll create a realtime table, which
 will start consuming from the Kafka topic.
 
-.. code-block:: none
+.. code-block:: bash
 
-  bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
+   bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
 
-We can then query the table and see the events stream in:
+We can then query the table with the following query to see the events stream 
in:
 
-.. code-block:: none
+.. code-block:: sql
 
-  
{"traceInfo":{},"numDocsScanned":17,"aggregationResults":[{"function":"count_star","value":"17"}],"timeUsedMs":27,"segmentStatistics":[],"exceptions":[],"totalDocs":17}
+   SELECT COUNT(*) FROM flights
 
 Repeating the query multiple times should show the events slowly being 
streamed into the table.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-pinot] branch master updated: Update doc for "Creating Pinot segments" to reflect the current code base (#4086)

Reply via email to