This is an automated email from the ASF dual-hosted git repository. mcvsubbu pushed a commit to branch doc-reorg in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
commit a28c8ca4086269551be3c1447ee3e2c8f11b4a3e Author: Subbu Subramaniam <[email protected]> AuthorDate: Tue Nov 27 17:10:20 2018 -0800 Re-org documentation Combined the sections on creating pinot segments into one. Removed extra pictures from pluggable stream section, referencing the realtime design instead. Created a new top level section on customizing pinot Other minor edits and warning fixes --- docs/client_api.rst | 56 ++++++++-------- docs/conf.py | 2 +- docs/creating_pinot_segments.rst | 98 --------------------------- docs/expressions_udf.rst | 9 +-- docs/index.rst | 36 ++++++---- docs/intro.rst | 4 +- docs/llc.rst | 26 ++++++-- docs/management_api.rst | 6 +- docs/multitenancy.rst | 24 +++---- docs/partition_aware_routing.rst | 6 +- docs/pinot_hadoop.rst | 139 ++++++++++++++++++++++++++++++-------- docs/pluggable_streams.rst | 64 +++++++++--------- docs/pql_examples.rst | 141 +++++++++++++++++++++++---------------- docs/reference.rst | 7 +- docs/schema_timespec.rst | 6 +- docs/segment_fetcher.rst | 16 ++--- docs/trying_pinot.rst | 6 +- 17 files changed, 341 insertions(+), 305 deletions(-) diff --git a/docs/client_api.rst b/docs/client_api.rst index f802d02..6a99d9a 100644 --- a/docs/client_api.rst +++ b/docs/client_api.rst @@ -1,12 +1,14 @@ -REST API -======== +Executing queries via REST API on the Broker +============================================ -The Pinot REST API can be accessed by ``POST`` ing a JSON object containing the parameter ``pql`` to the ``/query`` endpoint on a broker. Depending on the type of query, the results can take different shapes. For example, using curl: +The Pinot REST API can be accessed by invoking ``POST`` operation witha a JSON body containing the parameter ``pql`` +to the ``/query`` URI endpoint on a broker. Depending on the type of query, the results can take different shapes. +The examples below use curl. Aggregation ----------- -:: +.. code-block:: none curl -X POST -d '{"pql":"select count(*) from flights"}' http://localhost:8099/query @@ -30,7 +32,7 @@ Aggregation Aggregation with grouping ------------------------- -:: +.. code-block:: none curl -X POST -d '{"pql":"select count(*) from flights group by Carrier"}' http://localhost:8099/query @@ -68,7 +70,7 @@ Aggregation with grouping Selection --------- -:: +.. code-block:: none curl -X POST -d '{"pql":"select * from flights limit 3"}' http://localhost:8099/query @@ -145,12 +147,12 @@ Connections to Pinot are created using the ConnectionFactory class' utility meth .. code-block:: java - Connection connection = ConnectionFactory.fromZookeeper - (some-zookeeper-server:2191/zookeeperPath"); + Connection connection = ConnectionFactory.fromZookeeper + ("some-zookeeper-server:2191/zookeeperPath"); - Connection connection = ConnectionFactory.fromProperties("demo.properties"); + Connection connection = ConnectionFactory.fromProperties("demo.properties"); - Connection connection = ConnectionFactory.fromHostList + Connection connection = ConnectionFactory.fromHostList ("some-server:1234", "some-other-server:1234", ...); @@ -158,8 +160,8 @@ Queries can be sent directly to the Pinot cluster using the Connection.execute(j .. code-block:: java - ResultSetGroup resultSetGroup = connection.execute("select * from foo..."); - Future<ResultSetGroup> futureResultSetGroup = connection.executeAsync + ResultSetGroup resultSetGroup = connection.execute("select * from foo..."); + Future<ResultSetGroup> futureResultSetGroup = connection.executeAsync ("select * from foo..."); @@ -167,38 +169,38 @@ Queries can also use a PreparedStatement to escape query parameters: .. code-block:: java - PreparedStatement statement = connection.prepareStatement + PreparedStatement statement = connection.prepareStatement ("select * from foo where a = ?"); - statement.setString(1, "bar"); + statement.setString(1, "bar"); - ResultSetGroup resultSetGroup = statement.execute(); - Future<ResultSetGroup> futureResultSetGroup = statement.executeAsync(); + ResultSetGroup resultSetGroup = statement.execute(); + Future<ResultSetGroup> futureResultSetGroup = statement.executeAsync(); In the case of a selection query, results can be obtained with the various get methods in the first ResultSet, obtained through the getResultSet(int) method: .. code-block:: java - ResultSet resultSet = connection.execute + ResultSet resultSet = connection.execute ("select foo, bar from baz where quux = 'quuux'").getResultSet(0); - for(int i = 0; i < resultSet.getRowCount(); ++i) { - System.out.println("foo: " + resultSet.getString(i, 0); - System.out.println("bar: " + resultSet.getInt(i, 1); - } + for (int i = 0; i < resultSet.getRowCount(); ++i) { + System.out.println("foo: " + resultSet.getString(i, 0)); + System.out.println("bar: " + resultSet.getInt(i, 1)); + } - resultSet.close(); + resultSet.close(); -In the case where there is an aggregation, each aggregation function is within its own ResultSet: +In the case of aggregation, each aggregation function is within its own ResultSet: .. code-block:: java - ResultSetGroup resultSetGroup = connection.execute("select count(*) from foo"); + ResultSetGroup resultSetGroup = connection.execute("select count(*) from foo"); - ResultSet resultSet = resultSetGroup.getResultSet(0); - System.out.println("Number of records: " + resultSet.getInt(0)); - resultSet.close(); + ResultSet resultSet = resultSetGroup.getResultSet(0); + System.out.println("Number of records: " + resultSet.getInt(0)); + resultSet.close(); There can be more than one ResultSet, each of which can contain multiple results grouped by a group key. diff --git a/docs/conf.py b/docs/conf.py index a95d37e..1cf24c1 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -48,6 +48,7 @@ extensions = [ 'sphinx.ext.todo', 'sphinx.ext.mathjax', 'sphinx.ext.githubpages', + 'sphinx.ext.intersphinx', ] # Add any paths that contain templates here, relative to this directory. @@ -293,7 +294,6 @@ texinfo_documents = [ 'Miscellaneous'), ] -extensions = ['sphinx.ext.intersphinx'] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] diff --git a/docs/creating_pinot_segments.rst b/docs/creating_pinot_segments.rst deleted file mode 100644 index 5bae71e..0000000 --- a/docs/creating_pinot_segments.rst +++ /dev/null @@ -1,98 +0,0 @@ -Creating Pinot segments outside of Hadoop -========================================= - -This document describes steps required for creating Pinot2_0 segments from standard formats like CSV/JSON. - -Compiling the code ------------------- -Follow the steps described in the section on :doc: `Demonstration <trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in ``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``. - - -Data Preparation ----------------- - -#. Create a top level directory containing all the CSV/JSON files that need to be converted. -#. The file name extensions are expected to be the same as the format name (_i.e_ ``.csv``, or ``.json``), and are case insensitive. - Note that the converter expects the .csv extension even if the data is delimited using tabs or spaces instead. -#. Prepare a schema file describing the schema of the input data. This file needs to be in JSON format. An example is provided at the end of the article. -#. Specifically for CSV format, an optional csv config file can be provided (also in JSON format). This is used to configure parameters like the delimiter/header for the CSV file etc. - A detailed description of this follows below. - -Creating a segment ------------------- -Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within "[ ]" are optional. For -format, the default value is AVRO. - -:: - - bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format [CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile <generator_config_file>] -segmentName <segment_name> -schemaFile <input_schema_file> -tableName <table_name> -outDir <output_data_dir> [-overwrite] - -CSV Reader Config file ----------------------- -To configure various parameters for CSV a config file in JSON format can be provided. This file is optional, as are each of its parameters. When not provided, default values used for these parameters are described below: - -#. fileFormat: Specify one of the following. Default is EXCEL. - - ##. EXCEL - ##. MYSQL - ##. RFC4180 - ##. TDF - -#. header: If the input CSV file does not contain a header, it can be specified using this field. Note, if this is specified, then the input file is expected to not contain the header row, or else it will result in parse error. The columns in the header must be delimited by the same delimiter character as the rest of the CSV file. -#. delimiter: Use this to specify a delimiter character. The default value is ",". -#. dateFormat: If there are columns that are in date format and need to be converted into Epoch (in milliseconds), use this to specify the format. Default is "mm-dd-yyyy". -#. dateColumns: If there are multiple date columns, use this to list those columns. - -Below is a sample config file. - -:: - - { - "fileFormat" : "EXCEL", - "header" : "col1,col2,col3,col4", - "delimiter" : "\t", - "dateFormat" : "mm-dd-yy" - "dateColumns" : ["col1", "col2"] - } - -Sample JSON schema file: - -:: - - { - "dimensionFieldSpecs" : [ - { - "dataType" : "STRING", - "delimiter" : null, - "singleValueField" : true, - "name" : "name" - }, - { - "dataType" : "INT", - "delimiter" : null, - "singleValueField" : true, - "name" : "age" - } - ], - "timeFieldSpec" : { - "incomingGranularitySpec" : { - "timeType" : "DAYS", - "dataType" : "LONG", - "name" : "incomingName1" - }, - "outgoingGranularitySpec" : { - "timeType" : "DAYS", - "dataType" : "LONG", - "name" : "outgoingName1" - } - }, - "metricFieldSpecs" : [ - { - "dataType" : "FLOAT", - "delimiter" : null, - "singleValueField" : true, - "name" : "percent" - } - ] - }, - "schemaName" : "mySchema", - } diff --git a/docs/expressions_udf.rst b/docs/expressions_udf.rst index d373424..270be67 100644 --- a/docs/expressions_udf.rst +++ b/docs/expressions_udf.rst @@ -7,13 +7,13 @@ The query language for Pinot (:doc:`PQL <reference>`) currently only supports *s The high level requirement here is to support *expressions* that represent a function on a set of columns in the queries, as opposed to just columns. -:: +.. code-block:: none select <exp1> from myTable where ... [group by <exp2>] Where exp1 and exp2 can be of the form: -:: +.. code-block:: none func1(func2(col1, col2...func3(...)...), coln...)... @@ -44,7 +44,8 @@ Parser The PQL parser is already capable of parsing expressions in the *selection*, *aggregation* and *group by* sections. Following is a sample query containing expression, and its parse tree shown in the image. -:: +.. code-block:: none + select f1(f2(col1, col2), col3) from myTable where (col4 = 'x') group by f3(col5, f4(col6, col7)) @@ -112,7 +113,7 @@ We see the following limitations in functionality currently: #. Nesting of *aggregation* functions is not supported in the expression tree. This is because the number of documents after *aggregation* is reduced. In the expression below, *sum* of *col2* would yield one value, whereas *xform1* one *col1* would yield the same number of documents as in the input. -:: +.. code-block:: none sum(xform1(col1), sum(col2)) diff --git a/docs/index.rst b/docs/index.rst index 66d6ec2..f1defaa 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,12 +3,10 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to Pinot's documentation! -================================= +############ +Introduction +############ -######## -Contents -######## .. toctree:: :maxdepth: 1 @@ -24,8 +22,20 @@ Reference .. toctree:: :maxdepth: 1 - in_production + reference + in_production + +################# +Customizing Pinot +################# + +.. toctree:: + :maxdepth: 1 + + + pluggable_streams + segment_fetcher ################ Design Documents @@ -34,19 +44,19 @@ Design Documents .. toctree:: :maxdepth: 1 + llc partition_aware_routing expressions_udf - multitenancy schema_timespec - +################ +Design Proposals +################ +.. toctree:: + :maxdepth: 1 -Indices and tables -================== -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` + multitenancy diff --git a/docs/intro.rst b/docs/intro.rst index b169b4a..9f5cb93 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -1,5 +1,5 @@ -Introduction -============ +About Pinot +=========== Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as streaming events (such as Kafka). Pinot is designed to scale horizontally, diff --git a/docs/llc.rst b/docs/llc.rst index 15d2e0a..4c08e70 100644 --- a/docs/llc.rst +++ b/docs/llc.rst @@ -1,14 +1,28 @@ -LLC Design -========== +Realtime Design +=============== + +Pinot consumes rows from streaming data (such as Kafka) and serves queries on the data consumed thus far. + +Two modes of consumption are supported in Pinot: + +.. _hlc-section: + +High Level Consumers +-------------------- -High Level Consumer -------------------- .. figure:: High-level-stream.png High Level Stream Consumer Architecture -Low Level Consumer ------------------- + +*TODO*: Add design description of how HLC realtime works + + +.. _llc-section: + +Low Level Consumers +------------------- + .. figure:: Low-level-stream.png Low Level Stream Consumer Architecture diff --git a/docs/management_api.rst b/docs/management_api.rst index a2c367a..5e4d843 100644 --- a/docs/management_api.rst +++ b/docs/management_api.rst @@ -1,5 +1,7 @@ -Management REST API -------------------- +Managing Pinot via REST API on the Controller +============================================= + +*TODO* : Remove this section altogether and find a place somewhere for a pointer to the management API. Maybe in the 'Running pinot in production' section? There is a REST API which allows management of tables, tenants, segments and schemas. It can be accessed by going to ``http://[controller host]/help`` which offers a web UI to do these tasks, as well as document the REST API. diff --git a/docs/multitenancy.rst b/docs/multitenancy.rst index edd4bf8..ffcf157 100644 --- a/docs/multitenancy.rst +++ b/docs/multitenancy.rst @@ -87,7 +87,7 @@ Adding Nodes to cluster Adding node to cluster can be done in two ways, manual or automatic. This is controlled by a property set in cluster config called "allowPariticpantAutoJoin". If this is set to true, participants can join the cluster when they are started. If not, they need to be pre-registered in Helix via `Helix Admin <http://helix.apache.org/0.6.4-docs/tutorial_admin.html>`_ command addInstance. -:: +.. code-block:: none { "id" : "PinotPerfTestCluster", @@ -102,7 +102,7 @@ In Pinot 2.0 we will set AUTO_JOIN to true. This means after the SRE's procure t The znode ``CONFIGS/PARTICIPANT/ServerInstanceName`` looks lik below: -:: +.. code-block:: none { "id":"Server_localhost_8098" @@ -120,7 +120,7 @@ The znode ``CONFIGS/PARTICIPANT/ServerInstanceName`` looks lik below: And the znode ``CONFIGS/PARTICIPANT/BrokerInstanceName`` looks like below: -:: +.. code-block:: none { "id":"Broker_localhost_8099" @@ -143,7 +143,7 @@ There is one resource idealstate created for Broker by default called broker_res *CLUSTERNAME/IDEALSTATES/BrokerResource (Broker IdealState before adding data resource)* -:: +.. code-block:: none { "id" : "brokerResource", @@ -166,7 +166,7 @@ After adding a resource using the following data resource creation command, a re Sample Curl request ------------------- -:: +.. code-block:: none curl -i -X POST -H 'Content-Type: application/json' -d '{"requestType":"create", "resourceName":"XLNT","tableName":"T1", "timeColumnName":"daysSinceEpoch", "timeType":"daysSinceEpoch","numberOfDataInstances":4,"numberOfCopies":2,"retentionTimeUnit":"DAYS", "retentionTimeValue":"700","pushFrequency":"daily", "brokerTagName":"XLNT", "numberOfBrokerInstances":1, "segmentAssignmentStrategy":"BalanceNumSegmentAssignmentStrategy", "resourceType":"OFFLINE", "metadata":{}}' @@ -175,7 +175,7 @@ This is how it looks in Helix after running the above command. The znode ``CONFIGS/PARTICIPANT/Broker_localhost_8099`` looks as follows: -:: +.. code-block:: none { "id":"Broker_localhost_8099" @@ -193,7 +193,7 @@ The znode ``CONFIGS/PARTICIPANT/Broker_localhost_8099`` looks as follows: And the znode ``IDEALSTATES/brokerResource`` looks like below after Data resource is created -:: +.. code-block:: none { "id":"brokerResource" @@ -220,7 +220,7 @@ Server Info in Helix The znode ``CONFIGS/PARTICIPANT/Server_localhost_8098`` looks as below -:: +.. code-block:: none { "id":"Server_localhost_8098" @@ -238,7 +238,7 @@ The znode ``CONFIGS/PARTICIPANT/Server_localhost_8098`` looks as below And the znode ``/IDEALSTATES/XLNT (XLNT Data Resource IdealState)`` looks as below: -:: +.. code-block:: none { "id":"XLNT" @@ -267,7 +267,7 @@ Add a table to data resource Sample Curl request -:: +.. code-block:: none curl -i -X PUT -H 'Content-Type: application/json' -d '{"requestType":"addTableToResource","resourceName":"XLNT","tableName":"T1", "resourceType":"OFFLINE", "metadata":{}}' <span class="nolink">[http://CONTROLLER-HOST:PORT/dataresources](http://CONTROLLER-HOST:PORT/dataresources) @@ -275,7 +275,7 @@ After the table is added, mapping between Resources and Tables are maintained in The znode ``/PROPERTYSTORE/CONFIGS/RESOURCE/XLNT`` like like: -:: +.. code-block:: none { "id":"mirrorProfileViewOfflineEvents1_O" @@ -307,7 +307,7 @@ The znode ``/PROPERTYSTORE/CONFIGS/RESOURCE/XLNT`` like like: The znode ``/IDEALSTATES/XLNT (XLNT Data Resource IdealState)`` -:: +.. code-block:: none { "id":"XLNT_O" diff --git a/docs/partition_aware_routing.rst b/docs/partition_aware_routing.rst index 1a6a4d9..be9dcd4 100644 --- a/docs/partition_aware_routing.rst +++ b/docs/partition_aware_routing.rst @@ -48,7 +48,7 @@ Prune function: Name of the class that will be used by the broker to prune a seg For example, let us consider a case where the data is naturally partitioned on time column ‘daysSinceEpoch’. The segment zk metadata will have information like below: -:: +.. code-block:: none { “partitionColumn” : “daysSinceEpoch”, @@ -59,7 +59,7 @@ For example, let us consider a case where the data is naturally partitioned on t Now consider the following query comes in. -:: +.. code-block:: none Select count(*) from myTable where daysSinceEpoch between 17100 and 17110 @@ -67,7 +67,7 @@ The broker will recognize the range predicate on the partition column, and call Let’s consider another example where the data is partitioned by memberId, where a hash function was applied on the memberId to compute a partition number. -:: +.. code-block:: none { “partitionColumn” : “memberId”, diff --git a/docs/pinot_hadoop.rst b/docs/pinot_hadoop.rst index 0fa4181..479095d 100644 --- a/docs/pinot_hadoop.rst +++ b/docs/pinot_hadoop.rst @@ -1,39 +1,43 @@ -Creating Pinot segments in Hadoop -================================= +Creating Pinot segments +======================= -Pinot index files can be created offline on Hadoop, then pushed onto a production cluster. Because index generation does not happen on the Pinot nodes serving traffic, this means that these nodes can continue to serve traffic without impacting performance while data is being indexed. The index files are then pushed onto the Pinot cluster, where the files are distributed and loaded by the server nodes with minimal performance impact. +Pinot segments can be created offline on Hadoop, or via command line from data files. Controller REST endpoint +can then be used to add the segment to the table to which the segment belongs. + +Creating segments using hadoop +------------------------------ .. figure:: Pinot-Offline-only-flow.png Offline Pinot workflow -To create index files offline a Hadoop workflow can be created to complete the following steps: +To create Pinot segments on Hadoop, a workflow can be created to complete the following steps: -1. Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory -2. Create the index files -3. Upload the index files to the Pinot cluster +#. Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory +#. Create segments +#. Upload segments to the Pinot cluster -Step one can be done using your favorite tool (such as Pig, Hive or Spark), while Pinot provides two MapReduce jobs to do step two and three. +Step one can be done using your favorite tool (such as Pig, Hive or Spark), Pinot provides two MapReduce jobs to do step two and three. -Configuration -------------- +Configuring the job +^^^^^^^^^^^^^^^^^^^ Create a job properties configuration file, such as one below: -:: +.. code-block:: none # === Index segment creation job config === # path.to.input: Input directory containing Avro files path.to.input=/user/pinot/input/data - # path.to.output: Output directory containing Pinot index segments + # path.to.output: Output directory containing Pinot segments path.to.output=/user/pinot/output # path.to.schema: Schema file for the table, stored locally path.to.schema=flights-schema.json - # segment.table.name: Name of the table for which to generate index segments + # segment.table.name: Name of the table for which to generate segments segment.table.name=flights # === Segment tar push job config === @@ -45,28 +49,109 @@ Create a job properties configuration file, such as one below: push.to.port=8888 -Index file creation -------------------- +Executing the job +^^^^^^^^^^^^^^^^^ The Pinot Hadoop module contains a job that you can incorporate into your -workflow to generate Pinot indices. Note that this will only create data for you. -In order to have this data on your cluster, you want to also run the SegmentTarPush -job, details below. To run SegmentCreation through the command line: +workflow to generate Pinot segments. -:: +.. code-block:: none mvn clean install -DskipTests -Pbuild-shaded-jar hadoop jar pinot-hadoop-0.016-shaded.jar SegmentCreation job.properties +You can then use the SegmentTarPush job to push segments via the controller REST API. -Index file push ---------------- - -This job takes generated Pinot index files from an input directory and pushes -them to a Pinot controller node. - -:: +.. code-block:: none - mvn clean install -DskipTests -Pbuild-shaded-jar hadoop jar pinot-hadoop-0.016-shaded.jar SegmentTarPush job.properties + +Creating Pinot segments outside of Hadoop +----------------------------------------- + +This document describes steps required for creating Pinot segments from standard formats like CSV/JSON. + +#. Follow the steps described in the section on :doc: `Demonstration <trying_pinot>` to build pinot. Locate ``pinot-admin.sh`` in ``pinot-tools/trget/pinot-tools=pkg/bin/pinot-admin.sh``. +#. Create a top level directory containing all the CSV/JSON files that need to be converted into segments. +#. The file name extensions are expected to be the same as the format name (*i.e* ``.csv``, or ``.json``), and are case insensitive. + Note that the converter expects the ``.csv`` extension even if the data is delimited using tabs or spaces instead. +#. Prepare a schema file describing the schema of the input data. The schema needs to be in JSON format. See example later in this section. +#. Specifically for CSV format, an optional csv config file can be provided (also in JSON format). This is used to configure parameters like the delimiter/header for the CSV file etc. + A detailed description of this follows below. + +Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within "[ ]" are optional. For -format, the default value is AVRO. + +.. code-block:: none + + bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format [CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile <generator_config_file>] -segmentName <segment_name> -schemaFile <input_schema_file> -tableName <table_name> -outDir <output_data_dir> [-overwrite] + + +To configure various parameters for CSV a config file in JSON format can be provided. This file is optional, as are each of its parameters. When not provided, default values used for these parameters are described below: + +#. fileFormat: Specify one of the following. Default is EXCEL. + + ##. EXCEL + ##. MYSQL + ##. RFC4180 + ##. TDF + +#. header: If the input CSV file does not contain a header, it can be specified using this field. Note, if this is specified, then the input file is expected to not contain the header row, or else it will result in parse error. The columns in the header must be delimited by the same delimiter character as the rest of the CSV file. +#. delimiter: Use this to specify a delimiter character. The default value is ",". +#. dateFormat: If there are columns that are in date format and need to be converted into Epoch (in milliseconds), use this to specify the format. Default is "mm-dd-yyyy". +#. dateColumns: If there are multiple date columns, use this to list those columns. + +Below is a sample config file. + +.. code-block:: none + + { + "fileFormat" : "EXCEL", + "header" : "col1,col2,col3,col4", + "delimiter" : "\t", + "dateFormat" : "mm-dd-yy" + "dateColumns" : ["col1", "col2"] + } + +Sample Schema: + +.. code-block:: none + + { + "dimensionFieldSpecs" : [ + { + "dataType" : "STRING", + "delimiter" : null, + "singleValueField" : true, + "name" : "name" + }, + { + "dataType" : "INT", + "delimiter" : null, + "singleValueField" : true, + "name" : "age" + } + ], + "timeFieldSpec" : { + "incomingGranularitySpec" : { + "timeType" : "DAYS", + "dataType" : "LONG", + "name" : "incomingName1" + }, + "outgoingGranularitySpec" : { + "timeType" : "DAYS", + "dataType" : "LONG", + "name" : "outgoingName1" + } + }, + "metricFieldSpecs" : [ + { + "dataType" : "FLOAT", + "delimiter" : null, + "singleValueField" : true, + "name" : "percent" + } + ] + }, + "schemaName" : "mySchema", + } diff --git a/docs/pluggable_streams.rst b/docs/pluggable_streams.rst index 876e31c..4c6b26a 100644 --- a/docs/pluggable_streams.rst +++ b/docs/pluggable_streams.rst @@ -16,17 +16,14 @@ Some of the streams for which plug-ins can be added are: * `Pulsar <https://pulsar.apache.org/docs/en/client-libraries-java/>`_ -You may encounter some limitations either in Pinot or in the stream system while developing plug-ins. Please feel free to get in touch with us when you start writing a stream plug-in, and we can help you out. We are open to receiving PRs in order to improve these abstractions if they do not work for a certain stream implementation. +You may encounter some limitations either in Pinot or in the stream system while developing plug-ins. +Please feel free to get in touch with us when you start writing a stream plug-in, and we can help you out. +We are open to receiving PRs in order to improve these abstractions if they do not work for a certain stream implementation. -Pinot Stream Consumers ----------------------- -Pinot consumes rows from event streams and serves queries on the data consumed. Rows may be consumed either at stream level (also referred to as high level) or at partition level (also referred to as low level). +Refer to sections :ref:`hlc-section` and :ref:`llc-section` for details on how Pinot consumes streaming data. -.. figure:: pluggable_streams.png - -.. figure:: High-level-stream.png - - Stream Level Consumer +Requirements to support Stream Level (High Level) consumers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The stream should provide the following guarantees: @@ -36,13 +33,12 @@ The stream should provide the following guarantees: * The checkpoints should be recorded only when Pinot makes a call to do so. * The consumer should be able to start consumption from one of: - ** latest avaialble data - ** earliest available data - ** last saved checkpoint - -.. figure:: Low-level-stream.png + * latest avaialble data + * earliest available data + * last saved checkpoint - Partition Level Consumer +Requirements to support Partition Level (Low Level) consumers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While consuming rows at a partition level, the stream should support the following properties: @@ -52,9 +48,10 @@ properties: * Refer to a partition as a number not exceeding 32 bits long. * Stream should provide the following mechanisms to get an offset for a given partition of the stream: - ** get the offset of the oldest event available (assuming events are aged out periodically) in the partition. - ** get the offset of the most recent event published in the partition - ** (optionally) get the offset of an event that was published at a specified time + * get the offset of the oldest event available (assuming events are aged out periodically) in the partition. + * get the offset of the most recent event published in the partition + * (optionally) get the offset of an event that was published at a specified time + * Stream should provide a mechanism to consume a set of events from a partition starting from a specified offset. * Events with higher offsets should be more recent (the offsets of events need not be contiguous) @@ -91,24 +88,25 @@ The rest of the configuration properties for your stream should be set with the All values should be strings. For example: -.:: - "streamType" : "foo", - "stream.foo.topic.name" : "SomeTopic", - "stream.foo.consumer.type": "lowlevel", - "stream.foo.consumer.factory.class.name": "fully.qualified.pkg.ConsumerFactoryClassName", - "stream.foo.consumer.prop.auto.offset.reset": "largest", - "stream.foo.decoder.class.name" : "fully.qualified.pkg.DecoderClassName", - "stream.foo.decoder.prop.a.decoder.property" : "decoderPropValue", - "stream.foo.connection.timeout.millis" : "10000", // default 30_000 - "stream.foo.fetch.timeout.millis" : "10000" // default 5_000 +.. code-block:: none + + "streamType" : "foo", + "stream.foo.topic.name" : "SomeTopic", + "stream.foo.consumer.type": "lowlevel", + "stream.foo.consumer.factory.class.name": "fully.qualified.pkg.ConsumerFactoryClassName", + "stream.foo.consumer.prop.auto.offset.reset": "largest", + "stream.foo.decoder.class.name" : "fully.qualified.pkg.DecoderClassName", + "stream.foo.decoder.prop.a.decoder.property" : "decoderPropValue", + "stream.foo.connection.timeout.millis" : "10000", // default 30_000 + "stream.foo.fetch.timeout.millis" : "10000" // default 5_000 You can have additional properties that are specific to your stream. For example: -.:: +.. code-block:: none -"stream.foo.some.buffer.size" : "24g" + "stream.foo.some.buffer.size" : "24g" In addition to these properties, you can define thresholds for the consuming segments: @@ -117,10 +115,10 @@ In addition to these properties, you can define thresholds for the consuming seg The properties for the thresholds are as follows: -.:: +.. code-block:: none -"realtime.segment.flush.threshold.size" : "100000" -"realtime.segment.flush.threshold.time" : "6h" + "realtime.segment.flush.threshold.size" : "100000" + "realtime.segment.flush.threshold.time" : "6h" An example of this implementation can be found in the `KafkaConsumerFactory <com.linkedin.pinot.core.realtime.impl.kafka.KafkaConsumerFactory>`_, which is an implementation for the kafka stream. diff --git a/docs/pql_examples.rst b/docs/pql_examples.rst index a160763..4688617 100644 --- a/docs/pql_examples.rst +++ b/docs/pql_examples.rst @@ -77,66 +77,30 @@ Note: results might not be consistent if column ordered by has same value in mul ORDER BY bar DESC LIMIT 50, 100 -Wild-card match ---------------- +Wild-card match (in WHERE clause only) +-------------------------------------- + +To count rows where the column ``airlineName`` starts with ``U`` .. code-block:: sql SELECT count(*) FROM SomeTable - WHERE regexp_like(columnName, '.*regex-here?') - GROUP BY someOtherColumn TOP 10 + WHERE regexp_like(airlineName, '^U.*') + GROUP BY airlineName TOP 10 + +Examples with UDF +----------------- -Time-Convert UDF ----------------- +As of now, functions have to be implemented within Pinot. Injecting functions is not allowed yet. +The examples below demonstrate the use of UDFs .. code-block:: sql SELECT count(*) FROM myTable GROUP BY timeConvert(timeColumnName, 'SECONDS', 'DAYS') -Differences with SQL --------------------- - -* ``JOIN`` is not supported -* Use ``TOP`` instead of ``LIMIT`` for truncation -* ``LIMIT n`` has no effect in grouping queries, should use ``TOP n`` instead. If no ``TOP n`` defined, PQL will use ``TOP 10`` as default truncation setting. -* No need to select the columns to group with. - -The following two queries are both supported in PQL, where the non-aggregation columns are ignored. - -.. code-block:: sql - - SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable - GROUP BY bar, baz - TOP 50 - - SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable - GROUP BY bar, baz - TOP 50 - -* Always order by the aggregated value - The results will always order by the aggregated value itself. -* Results equivalent to grouping on each aggregation - The results for query: - -.. code-block:: sql - - SELECT MIN(foo), MAX(foo) FROM myTable - GROUP BY bar - TOP 50 - -will be the same as the combining results from the following queries: - -.. code-block:: sql - - SELECT MIN(foo) FROM myTable - GROUP BY bar - TOP 50 - SELECT MAX(foo) FROM myTable - GROUP BY bar - TOP 50 - -where we don't put the results for the same group together. + SELECT count(*) FROM myTable + GROUP BY div(tim PQL Specification ----------------- @@ -189,10 +153,12 @@ WHERE Supported predicates are comparisons with a constant using the standard SQL operators (``=``, ``<``, ``<=``, ``>``, ``>=``, ``<>``, '!=') , range comparisons using ``BETWEEN`` (``foo BETWEEN 42 AND 69``), set membership (``foo IN (1, 2, 4, 8)``) and exclusion (``foo NOT IN (1, 2, 4, 8)``). For ``BETWEEN``, the range is inclusive. +Comparison with a regular expression is supported using the regexp_like function, as in ``WHERE regexp_like(columnName, 'regular expression')`` + GROUP BY ^^^^^^^^ -The ``GROUP BY`` clause groups aggregation results by a list of columns. +The ``GROUP BY`` clause groups aggregation results by a list of columns, or transform functions on columns (see below) ORDER BY @@ -224,11 +190,72 @@ For example, the following query will calculate the maximum value of column ``fo Supported transform functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``ADD`` + Sum of at least two values + +``SUB`` + Difference between two values -* ``ADD``: sum of at least two values -* ``SUB``: difference between two values -* ``MULT``: product of at least two values -* ``DIV``: quotient of two values -* ``TIMECONVERT``: takes 3 arguments, converts the value into another time unit. E.g. ``TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')`` -* ``DATETIMECONVERT``: takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity. E.g. ``DATETIMECONVERT(date, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '15:MINUTES')`` -* ``VALUEIN``: takes at least 2 arguments, where the first argument is a multi-valued column, and the following arguments are constant values. The transform function will filter the value from the multi-valued column with the given constant values. The ``VALUEIN`` transform function is especially useful when the same multi-valued column is both filtering column and grouping column. E.g. ``VALUEIN(mvColumn, 3, 5, 15)`` +``MULT`` + Product of at least two values + +``DIV`` + Quotient of two values + +``TIMECONVERT`` + Takes 3 arguments, converts the value into another time unit. E.g. ``TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')`` + +``DATETIMECONVERT`` + Takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity. + *e.g.* ``DATETIMECONVERT(date, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '15:MINUTES')`` + +``VALUEIN`` + Takes at least 2 arguments, where the first argument is a multi-valued column, and the following arguments are constant values. + The transform function will filter the value from the multi-valued column with the given constant values. + The ``VALUEIN`` transform function is especially useful when the same multi-valued column is both filtering column and grouping column. + *e.g.* ``VALUEIN(mvColumn, 3, 5, 15)`` + + +Differences with SQL +-------------------- + +* ``JOIN`` is not supported +* Use ``TOP`` instead of ``LIMIT`` for truncation +* ``LIMIT n`` has no effect in grouping queries, should use ``TOP n`` instead. If no ``TOP n`` defined, PQL will use ``TOP 10`` as default truncation setting. +* No need to select the columns to group with. + +The following two queries are both supported in PQL, where the non-aggregation columns are ignored. + +.. code-block:: sql + + SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable + GROUP BY bar, baz + TOP 50 + + SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable + GROUP BY bar, baz + TOP 50 + +* Always order by the aggregated value + The results will always order by the aggregated value itself. +* Results equivalent to grouping on each aggregation + The results for query: + +.. code-block:: sql + + SELECT MIN(foo), MAX(foo) FROM myTable + GROUP BY bar + TOP 50 + +will be the same as the combining results from the following queries: + +.. code-block:: sql + + SELECT MIN(foo) FROM myTable + GROUP BY bar + TOP 50 + SELECT MAX(foo) FROM myTable + GROUP BY bar + TOP 50 + +where we don't put the results for the same group together. diff --git a/docs/reference.rst b/docs/reference.rst index 51881d8..2171119 100644 --- a/docs/reference.rst +++ b/docs/reference.rst @@ -1,15 +1,10 @@ .. _reference: -Pinot Reference Manual -====================== .. toctree:: - :maxdepth: 2 + :maxdepth: 1 pql_examples client_api management_api pinot_hadoop - creating_pinot_segments - pluggable_streams - segment_fetcher diff --git a/docs/schema_timespec.rst b/docs/schema_timespec.rst index 1d7e769..531ff28 100644 --- a/docs/schema_timespec.rst +++ b/docs/schema_timespec.rst @@ -6,7 +6,7 @@ Problems with current schema design The pinot schema timespec looks like this: -:: +.. code-block:: none { "timeFieldSpec": @@ -29,7 +29,7 @@ Changes We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs to the pinot schema -:: +.. code-block:: none { “dateTimeFieldSpec”: @@ -67,7 +67,7 @@ We have added a List<DateTimeFieldSpec> _dateTimeFieldSpecs to the pinot schema Examples: -:: +.. code-block:: none “dateTimeFieldSpec”: { diff --git a/docs/segment_fetcher.rst b/docs/segment_fetcher.rst index 7e57602..97e2339 100644 --- a/docs/segment_fetcher.rst +++ b/docs/segment_fetcher.rst @@ -14,15 +14,15 @@ HDFS segment fetcher configs ----------------------------- In your Pinot controller/server configuration, you will need to provide the following configs: -:: + +.. code-block:: none pinot.controller.segment.fetcher.hdfs.hadoop.conf.path=`<file path to hadoop conf folder> or - -:: +.. code-block:: none pinot.server.segment.fetcher.hdfs.hadoop.conf.path=`<file path to hadoop conf folder> @@ -30,14 +30,14 @@ or This path should point the local folder containing ``core-site.xml`` and ``hdfs-site.xml`` files from your Hadoop installation -:: +.. code-block:: none pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=`<your kerberos principal> pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=`<your kerberos keytab> or -:: +.. code-block:: none pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=`<your kerberos principal> pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=`<your kerberos keytab> @@ -54,7 +54,7 @@ To push HDFS segment files to Pinot controller, you just need to ensure you have For example, the following curl requests to Controller will notify it to download segment files to the proper table: -:: +.. code-block:: none curl -X POST -H "UPLOAD_TYPE:URI" -H "DOWNLOAD_URI:hdfs://nameservice1/hadoop/path/to/segment/file.gz" -H "content-type:application/json" -d '' localhost:9000/segments @@ -63,13 +63,13 @@ Implement your own segment fetcher for other systems You can also implement your own segment fetchers for other file systems and load into Pinot system with an external jar. All you need to do is to implement a class that extends the interface of `SegmentFetcher <https://github.com/linkedin/pinot/blob/master/pinot-common/src/main/java/com/linkedin/pinot/common/segment/fetcher/SegmentFetcher.java>`_ and provides config to Pinot Controller and Server as follows: -:: +.. code-block:: none pinot.controller.segment.fetcher.`<protocol>`.class =`<class path to your implementation> or -:: +.. code-block:: none pinot.server.segment.fetcher.`<protocol>`.class =`<class path to your implementation> diff --git a/docs/trying_pinot.rst b/docs/trying_pinot.rst index ddc1197..40d932d 100644 --- a/docs/trying_pinot.rst +++ b/docs/trying_pinot.rst @@ -1,5 +1,5 @@ -Running the Pinot Demonstration -=============================== +Quickstart guide +================ A quick way to get familiar with Pinot is to run the Pinot examples. The examples can be run either by compiling the code or by running the prepackaged Docker images. @@ -35,7 +35,7 @@ Trying out the demo Once the Pinot cluster is running, you can query it by going to http://localhost:9000/query/ You can also use the REST API to query Pinot, as well as the Java client. As this is outside of the scope of this -introduction, the reference documentation to use the Pinot client APIs is in the :ref:`client-api` section. +introduction, the reference documentation to use the Pinot client APIs is in the :doc:`client_api` section. Pinot uses PQL, a SQL-like query language, to query data. Here are some sample queries: --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
