This is an automated email from the ASF dual-hosted git repository. mcvsubbu pushed a commit to branch 0.2.0 in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
commit 8164dc50c1464312e1e0fa6c940f818305041d63 Author: Mayank Shrivastava <[email protected]> AuthorDate: Tue Sep 24 16:59:02 2019 -0700 Adding documentation for Pinot Schema. (#4637) --- docs/admin_guide.rst | 1 + docs/schema.rst | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 125 insertions(+) diff --git a/docs/admin_guide.rst b/docs/admin_guide.rst index 2a50913..aa72486 100644 --- a/docs/admin_guide.rst +++ b/docs/admin_guide.rst @@ -27,6 +27,7 @@ Admin Guide :maxdepth: 1 tableconfig_schema + schema in_production pinot_hadoop customizations diff --git a/docs/schema.rst b/docs/schema.rst new file mode 100644 index 0000000..8cdb98d --- /dev/null +++ b/docs/schema.rst @@ -0,0 +1,124 @@ +.. +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. +.. + +.. _schema-section: + +Pinot Schema +============ + +Pinot Schema consists of columns that can be categorized as dimensions, metric or time. + +**Dimensions** - These are columns that organize the data. For example, ``accountId``, ``country``, ``industry``, etc. These columns are used to slice/dice the data and typically appear in the ``selection``, ``filter`` and ``group-by`` sections in queries. + +**Metrics** - These are columns that represent quantative measurements. For example, ``numClicks``, ``pageViews``, etc. These columns typically appear in the aggregation section of the query, e.g., ``select sum(pageViews) from...``. + +**Time** - The time column represents the timestamp of the data. There is only one time column in a schema, and typically appears in the ``filter`` and ``group-by`` sections in queries. + +A sample schema is shown below and will be used as an example to described the various fields. + +Sample Schema: +~~~~~~~~~~~~~~ + +.. code-block:: json + + { + "schemaName": "flights", + "dimensionFieldSpecs": [ + { + "name": "flightNumber", + "dataType": "LONG" + }, + { + "name": "tags", + "dataType": "STRING", + "singleValueField": false + } + ], + "metricFieldSpecs": [ + { + "name": "price", + "dataType": "DOUBLE" + } + ], + "timeFieldSpec": { + "incomingGranularitySpec": { + "name": "secondsSinceEpoch", + "dataType": "INT", + "timeFormat" : "EPOCH", + "timeType": "SECONDS" + }, + "outgoingGranularitySpec": { + "name": "messageTimeHours", + "dataType": "INT", + "timeFormat" : "EPOCH", + "timeType": "DAYS" + } + } + } + +Schema Name: +~~~~~~~~~~~~ + +Every Pinot schema has a name that is used to identify it. For example, schema for a table is specified in the table conig by its name. + + +FieldSpecs: +~~~~~~~~~~~ + +Each column in the schema is described using a ``fieldSpec`` that captures various attributes of the column, such as name, data-type, etc. A schema may contain an array of ``dimensionFieldSpecs`` that describe all the dimension columns, ``metricFieldSpecs`` that describe all the metric columns, and a ``timeFieldSpec`` that describes the time column. + +Dimensions: +~~~~~~~~~~~ + +The schema example above has two dimensions ``flightNumber`` of type ``Long`` and ``tags`` of type ``String``. Data types supported for dimension columns are: + +* **INT** +* **FLOAT** +* **LONG** +* **DOUBLE** +* **STRING** +* **BYTES** + +Dimension columns also have an optional attribute named ``singleValuedFiled`` with a default value of ``true``. This attribute describes whether the column can take a single value or multiple values for a row. In the example above, the dimension ``tags`` is multi-valued. This means that it can have multiple values for a particular row, say ``tag1, tag2, tag3``. For a multi-valued column, individual rows don't necessarily need to have the same number of values. Typical use case for this w [...] + + +Metrics: +~~~~~~~~ + +The column ``price`` in the schema is a metric column, as one can query aggregations such as ``average``, ``min``, ``max``, etc on it and are typically single-valued. Metric columns can have the following data types: + +* **INT** +* **FLOAT** +* **LONG** +* **DOUBLE** +* **BYTES** + +Metric columns are typically numeric. However, note that ``BYTES`` is an allowed data type for metric columns. This is typically used in cases of specialized representations such as HLL, TDigest, etc, where the column actually stores byte serialized version of the value. + +Time: +~~~~~ + +The schema above also contains a ``timeFieldSpec`` that is used to specify the attributes of the time column: + +* **incomingGranularitySpec** : Specifies the name, data type and time type for the time stamp present in the incoming data into Pinot. +* **outgoingGranularitySpec** : Specifies the name, data type and time type for the time stamp as desired to be stored in Pinot. + +In this example, the input timestamp specified in ``SECONDS`` will be automatically converted into ``DAYS`` before storing into Pinot. The ``timeFieldSpec`` also has an optional attribute ``timeFormat`` that can take values ``EPOCH`` (default) and ``SIMPLE_DATE_FORMAT:<format>``. + +Time columns are mandatory for ``APPEND`` (incremental data push) use cases but optional for ``REFRESH`` (data refresh with each push) use cases. More details on this can be found at the `Segment Config <tableconfig_schema.html#segments-config-section>`_ section. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
