This is an automated email from the ASF dual-hosted git repository.
jackie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
The following commit(s) were added to refs/heads/master by this push:
new 435e8be Update doc for "Creating Pinot segments" to reflect the
current code base (#4086)
435e8be is described below
commit 435e8be323427057b6ed2b40887707e226aa642b
Author: Xiaotian (Jackie) Jiang <[email protected]>
AuthorDate: Mon Apr 8 16:04:00 2019 -0700
Update doc for "Creating Pinot segments" to reflect the current code base
(#4086)
1. Remove delimiter from schema
2. Update fields in CSV record reader config
3. Remove redundant fields in table config
4. Reformat to follow rst convension and fix some wrong format (e.g. index
number)
---
docs/pinot_hadoop.rst | 262 ++++++++++++++++++++++++--------------------------
1 file changed, 123 insertions(+), 139 deletions(-)
diff --git a/docs/pinot_hadoop.rst b/docs/pinot_hadoop.rst
index fc3573b..81f0728 100644
--- a/docs/pinot_hadoop.rst
+++ b/docs/pinot_hadoop.rst
@@ -31,7 +31,7 @@ Creating segments using hadoop
.. figure:: img/Pinot-Offline-only-flow.png
- Offline Pinot workflow
+ Offline Pinot workflow
To create Pinot segments on Hadoop, a workflow can be created to complete the
following steps:
@@ -48,28 +48,27 @@ Create a job properties configuration file, such as one
below:
.. code-block:: none
- # === Index segment creation job config ===
+ # === Index segment creation job config ===
- # path.to.input: Input directory containing Avro files
- path.to.input=/user/pinot/input/data
+ # path.to.input: Input directory containing Avro files
+ path.to.input=/user/pinot/input/data
- # path.to.output: Output directory containing Pinot segments
- path.to.output=/user/pinot/output
+ # path.to.output: Output directory containing Pinot segments
+ path.to.output=/user/pinot/output
- # path.to.schema: Schema file for the table, stored locally
- path.to.schema=flights-schema.json
+ # path.to.schema: Schema file for the table, stored locally
+ path.to.schema=flights-schema.json
- # segment.table.name: Name of the table for which to generate segments
- segment.table.name=flights
+ # segment.table.name: Name of the table for which to generate segments
+ segment.table.name=flights
- # === Segment tar push job config ===
+ # === Segment tar push job config ===
- # push.to.hosts: Comma separated list of controllers host names to which to
push
- push.to.hosts=controller_host_0,controller_host_1
-
- # push.to.port: The port on which the controller runs
- push.to.port=8888
+ # push.to.hosts: Comma separated list of controllers host names to which to
push
+ push.to.hosts=controller_host_0,controller_host_1
+ # push.to.port: The port on which the controller runs
+ push.to.port=8888
Executing the job
^^^^^^^^^^^^^^^^^
@@ -77,122 +76,105 @@ Executing the job
The Pinot Hadoop module contains a job that you can incorporate into your
workflow to generate Pinot segments.
-.. code-block:: none
+.. code-block:: bash
- mvn clean install -DskipTests -Pbuild-shaded-jar
- hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation
job.properties
+ mvn clean install -DskipTests -Pbuild-shaded-jar
+ hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation
job.properties
You can then use the SegmentTarPush job to push segments via the controller
REST API.
-.. code-block:: none
-
- hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush
job.properties
+.. code-block:: bash
+ hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush
job.properties
Creating Pinot segments outside of Hadoop
-----------------------------------------
-Here is how you can create Pinot segments from standard formats like CSV/JSON.
+Here is how you can create Pinot segments from standard formats like
CSV/JSON/AVRO.
#. Follow the steps described in the section on :ref:`compiling-code-section`
to build pinot. Locate ``pinot-admin.sh`` in
``pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh``.
-#. Create a top level directory containing all the CSV/JSON files that need
to be converted into segments.
-#. The file name extensions are expected to be the same as the format name
(*i.e* ``.csv``, or ``.json``), and are case insensitive.
- Note that the converter expects the ``.csv`` extension even if the data is
delimited using tabs or spaces instead.
-#. Prepare a schema file describing the schema of the input data. The schema
needs to be in JSON format. See example later in this section.
-#. Specifically for CSV format, an optional csv config file can be provided
(also in JSON format). This is used to configure parameters like the
delimiter/header for the CSV file etc.
- A detailed description of this follows below.
+#. Create a top level directory containing all the CSV/JSON/AVRO files that
need to be converted into segments.
+#. The file name extensions are expected to be the same as the format name
(*i.e* ``.csv``, ``.json`` or ``.avro``), and are case insensitive. Note that
the converter expects the ``.csv`` extension even if the data is delimited
using tabs or spaces instead.
+#. Prepare a schema file describing the schema of the input data. The schema
needs to be in JSON format. See example later in this section.
+#. Specifically for CSV format, an optional csv config file can be provided
(also in JSON format). This is used to configure parameters like the
delimiter/header for the CSV file etc. A detailed description of this follows
below.
Run the pinot-admin command to generate the segments. The command can be
invoked as follows. Options within "[ ]" are optional. For -format, the default
value is AVRO.
-.. code-block:: none
-
- bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile
<generator_config_file>] -segmentName <segment_name> -schemaFile
<input_schema_file> -tableName <table_name> -outDir <output_data_dir>
[-overwrite]
+.. code-block:: bash
+ bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile
<generator_config_file>] -segmentName <segment_name> -schemaFile
<input_schema_file> -tableName <table_name> -outDir <output_data_dir>
[-overwrite]
To configure various parameters for CSV a config file in JSON format can be
provided. This file is optional, as are each of its parameters. When not
provided, default values used for these parameters are described below:
-#. fileFormat: Specify one of the following. Default is EXCEL.
+#. fileFormat: Specify one of the following. Default is EXCEL.
- ##. EXCEL
- ##. MYSQL
- ##. RFC4180
- ##. TDF
+ #. EXCEL
+ #. MYSQL
+ #. RFC4180
+ #. TDF
-#. header: If the input CSV file does not contain a header, it can be
specified using this field. Note, if this is specified, then the input file is
expected to not contain the header row, or else it will result in parse error.
The columns in the header must be delimited by the same delimiter character as
the rest of the CSV file.
-#. delimiter: Use this to specify a delimiter character. The default value is
",".
-#. dateFormat: If there are columns that are in date format and need to be
converted into Epoch (in milliseconds), use this to specify the format. Default
is "mm-dd-yyyy".
-#. dateColumns: If there are multiple date columns, use this to list those
columns.
+#. header: If the input CSV file does not contain a header, it can be
specified using this field. Note, if this is specified, then the input file is
expected to not contain the header row, or else it will result in parse error.
The columns in the header must be delimited by the same delimiter character as
the rest of the CSV file.
+#. delimiter: Use this to specify a delimiter character. The default value is
",".
+#. multiValueDelimiter: Use this to specify a delimiter character for each
value in multi-valued columns. The default value is ";".
Below is a sample config file.
-.. code-block:: none
+.. code-block:: json
- {
- "fileFormat" : "EXCEL",
- "header" : "col1,col2,col3,col4",
- "delimiter" : "\t",
- "dateFormat" : "mm-dd-yy"
- "dateColumns" : ["col1", "col2"]
- }
+ {
+ "fileFormat": "EXCEL",
+ "header": "col1,col2,col3,col4",
+ "delimiter": "\t",
+ "multiValueDelimiter": ","
+ }
Sample Schema:
-.. code-block:: none
-
- {
- "dimensionFieldSpecs" : [
- {
- "dataType" : "STRING",
- "delimiter" : null,
- "singleValueField" : true,
- "name" : "name"
- },
- {
- "dataType" : "INT",
- "delimiter" : null,
- "singleValueField" : true,
- "name" : "age"
- }
- ],
- "timeFieldSpec" : {
- "incomingGranularitySpec" : {
- "timeType" : "DAYS",
- "dataType" : "LONG",
- "name" : "incomingName1"
- },
- "outgoingGranularitySpec" : {
- "timeType" : "DAYS",
- "dataType" : "LONG",
- "name" : "outgoingName1"
- }
- },
- "metricFieldSpecs" : [
- {
- "dataType" : "FLOAT",
- "delimiter" : null,
- "singleValueField" : true,
- "name" : "percent"
- }
- ]
- },
- "schemaName" : "mySchema",
- }
+.. code-block:: json
+
+ {
+ "schemaName": "flights",
+ "dimensionFieldSpecs": [
+ {
+ "name": "flightNumber",
+ "dataType": "LONG"
+ },
+ {
+ "name": "tags",
+ "dataType": "STRING",
+ "singleValueField": false
+ }
+ ],
+ "metricFieldSpecs": [
+ {
+ "name": "price",
+ "dataType": "DOUBLE"
+ }
+ ],
+ "timeFieldSpec": {
+ "incomingGranularitySpec": {
+ "name": "daysSinceEpoch",
+ "dataType": "INT",
+ "timeType": "DAYS"
+ }
+ }
+ }
Pushing offline segments to Pinot
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can use curl to push a segment to pinot:
-.. code-block:: none
+.. code-block:: bash
curl -X POST -F segment=@<segment-tar-file-path>
http://controllerHost:controllerPort/segments
Alternatively you can use the pinot-admin.sh utility to upload one or more
segments:
-.. code-block:: none
+.. code-block:: bash
- pinot-tools/target/pinot-tools-pkg/bin//pinot-admin.sh UploadSegment
-controllerHost <hostname> -controllerPort <port> -segmentDir
<segmentDirectoryPath>
+ pinot-tools/target/pinot-tools-pkg/bin//pinot-admin.sh UploadSegment
-controllerHost <hostname> -controllerPort <port> -segmentDir
<segmentDirectoryPath>
The command uploads all the segments found in ``segmentDirectoryPath``.
The segments could be either tar-compressed (in which case it is a file under
``segmentDirectoryPath``)
@@ -201,64 +183,66 @@ or uncompressed (in which case it is a directory under
``segmentDirectoryPath``)
Realtime segment generation
^^^^^^^^^^^^^^^^^^^^^^^^^^^
-To consume in realtime, we simply need to create a table that uses the same
schema and points to the Kafka topic to
-consume from, using a table definition such as this one:
-
-.. code-block:: none
-
- {
- "tableName":"flights",
- "segmentsConfig" : {
- "retentionTimeUnit":"DAYS",
- "retentionTimeValue":"7",
- "segmentPushFrequency":"daily",
- "segmentPushType":"APPEND",
- "replication" : "1",
- "schemaName" : "flights",
- "timeColumnName" : "daysSinceEpoch",
- "timeType" : "DAYS",
- "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
- },
- "tableIndexConfig" : {
- "invertedIndexColumns" : ["Carrier"],
- "loadMode" : "HEAP",
- "lazyLoad" : "false",
- "streamConfigs": {
- "streamType": "kafka",
- "stream.kafka.consumer.type": "highLevel",
- "stream.kafka.topic.name": "flights-realtime",
- "stream.kafka.decoder.class.name":
"org.apache.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
- "stream.kafka.zk.broker.url": "localhost:2181",
- "stream.kafka.hlc.zk.connect.string": "localhost:2181"
- }
- },
- "tableType":"REALTIME",
- "tenants" : {
- "broker":"DefaultTenant_BROKER",
- "server":"DefaultTenant_SERVER"
- },
- "metadata": {
- }
- }
+To consume in realtime, we simply need to create a table with the same name as
the schema and point to the Kafka topic
+to consume from, using a table definition such as this one:
+
+.. code-block:: json
+
+ {
+ "tableName": "flights",
+ "tableType": "REALTIME",
+ "segmentsConfig": {
+ "retentionTimeUnit": "DAYS",
+ "retentionTimeValue": "7",
+ "segmentPushFrequency": "daily",
+ "segmentPushType": "APPEND",
+ "replication": "1",
+ "timeColumnName": "daysSinceEpoch",
+ "timeType": "DAYS",
+ "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy"
+ },
+ "tableIndexConfig": {
+ "invertedIndexColumns": [
+ "flightNumber",
+ "tags",
+ "daysSinceEpoch"
+ ],
+ "loadMode": "MMAP",
+ "streamConfigs": {
+ "streamType": "kafka",
+ "stream.kafka.consumer.type": "highLevel",
+ "stream.kafka.topic.name": "flights-realtime",
+ "stream.kafka.decoder.class.name":
"org.apache.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
+ "stream.kafka.zk.broker.url": "localhost:2181",
+ "stream.kafka.hlc.zk.connect.string": "localhost:2181"
+ }
+ },
+ "tenants": {
+ "broker": "brokerTenant",
+ "server": "serverTenant"
+ },
+ "metadata": {
+ }
+ }
First, we'll start a local instance of Kafka and start streaming data into it:
-.. code-block:: none
+.. code-block:: bash
- bin/pinot-admin.sh StartKafka &
- bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2014.avro
-kafkaTopic flights-realtime &
+ bin/pinot-admin.sh StartKafka &
+ bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2014.avro
-kafkaTopic flights-realtime &
This will stream one event per second from the Avro file to the Kafka topic.
Then, we'll create a realtime table, which
will start consuming from the Kafka topic.
-.. code-block:: none
+.. code-block:: bash
- bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
+ bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
-We can then query the table and see the events stream in:
+We can then query the table with the following query to see the events stream
in:
-.. code-block:: none
+.. code-block:: sql
-
{"traceInfo":{},"numDocsScanned":17,"aggregationResults":[{"function":"count_star","value":"17"}],"timeUsedMs":27,"segmentStatistics":[],"exceptions":[],"totalDocs":17}
+ SELECT COUNT(*) FROM flights
Repeating the query multiple times should show the events slowly being
streamed into the table.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]