[GitHub] mcvsubbu commented on a change in pull request #3563: Re-org documentation

GitBox Wed, 28 Nov 2018 15:13:53 -0800

mcvsubbu commented on a change in pull request #3563: Re-org documentation
URL: https://github.com/apache/incubator-pinot/pull/3563#discussion_r237298875


 ##########
 File path: docs/pinot_hadoop.rst
 ##########
 @@ -45,28 +49,109 @@ Create a job properties configuration file, such as one 
below:
   push.to.port=8888
 
 
-Index file creation
--------------------
+Executing the job
+^^^^^^^^^^^^^^^^^
 
 The Pinot Hadoop module contains a job that you can incorporate into your
-workflow to generate Pinot indices. Note that this will only create data for 
you. 
-In order to have this data on your cluster, you want to also run the 
SegmentTarPush
-job, details below. To run SegmentCreation through the command line:
+workflow to generate Pinot segments.
 
-::
+.. code-block:: none
 
   mvn clean install -DskipTests -Pbuild-shaded-jar
   hadoop jar pinot-hadoop-0.016-shaded.jar SegmentCreation job.properties
 
+You can then use the SegmentTarPush job to push segments via the controller 
REST API.
 
-Index file push
----------------
-
-This job takes generated Pinot index files from an input directory and pushes
-them to a Pinot controller node.
-
-::
+.. code-block:: none
 
-  mvn clean install -DskipTests -Pbuild-shaded-jar
   hadoop jar pinot-hadoop-0.016-shaded.jar SegmentTarPush job.properties
 
+
+Creating Pinot segments outside of Hadoop
+-----------------------------------------
+
+This document describes steps required for creating Pinot segments from 
standard formats like CSV/JSON.
+
+#. Follow the steps described in the section on :doc: `Demonstration 
<trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in 
``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``.
+#.  Create a top level directory containing all the CSV/JSON files that need 
to be converted into segments.
+#.  The file name extensions are expected to be the same as the format name 
(*i.e* ``.csv``, or ``.json``), and are case insensitive.
+    Note that the converter expects the ``.csv`` extension even if the data is 
delimited using tabs or spaces instead.
+#.  Prepare a schema file describing the schema of the input data. The schema 
needs to be in JSON format. See example later in this section.
+#.  Specifically for CSV format, an optional csv config file can be provided 
(also in JSON format). This is used to configure parameters like the 
delimiter/header for the CSV file etc.  
+        A detailed description of this follows below.  
+
+Run the pinot-admin command to generate the segments. The command can be 
invoked as follows. Options within "[ ]" are optional. For -format, the default 
value is AVRO.
+
+.. code-block:: none
+
+    bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format 
[CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile 
<generator_config_file>] -segmentName <segment_name> -schemaFile 
<input_schema_file> -tableName <table_name> -outDir <output_data_dir> 
[-overwrite]
+
+
+To configure various parameters for CSV a config file in JSON format can be 
provided. This file is optional, as are each of its parameters. When not 
provided, default values used for these parameters are described below:
+
+#.  fileFormat: Specify one of the following. Default is EXCEL.  
+
+ ##.  EXCEL
+ ##.  MYSQL
+ ##.  RFC4180
+ ##.  TDF
+
+#.  header: If the input CSV file does not contain a header, it can be 
specified using this field. Note, if this is specified, then the input file is 
expected to not contain the header row, or else it will result in parse error. 
The columns in the header must be delimited by the same delimiter character as 
the rest of the CSV file.
+#.  delimiter: Use this to specify a delimiter character. The default value is 
",".
+#.  dateFormat: If there are columns that are in date format and need to be 
converted into Epoch (in milliseconds), use this to specify the format. Default 
is "mm-dd-yyyy".
+#.  dateColumns: If there are multiple date columns, use this to list those 
columns.
+
+Below is a sample config file.
+
+.. code-block:: none
+
+  {
+    "fileFormat" : "EXCEL",
+    "header" : "col1,col2,col3,col4",
+    "delimiter" : "\t",
+    "dateFormat" : "mm-dd-yy"
+    "dateColumns" : ["col1", "col2"]
+  }
+
+Sample Schema:
+
+.. code-block:: none
+
+  {
+    "dimensionFieldSpecs" : [
+      {                           
+        "dataType" : "STRING",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "name"
+      },
+      {
+        "dataType" : "INT",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "age"
+      }
+    ],
+    "timeFieldSpec" : {
+      "incomingGranularitySpec" : {
+        "timeType" : "DAYS",
+        "dataType" : "LONG",
+        "name" : "incomingName1"
+      },
+      "outgoingGranularitySpec" : {
+        "timeType" : "DAYS",
+        "dataType" : "LONG",
+        "name" : "outgoingName1"
+      }
+    },
+    "metricFieldSpecs" : [
+      {
+        "dataType" : "FLOAT",
+        "delimiter" : null,
+        "singleValueField" : true,
+        "name" : "percent"
+      }
+     ]
+    },
+    "schemaName" : "mySchema",
+  }
 
 Review comment:
   done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] mcvsubbu commented on a change in pull request #3563: Re-org documentation

Reply via email to