[GitHub] jon-wei closed pull request #6137: [Backport] Add docs for virtual columns and transform specs

GitBox Thu, 09 Aug 2018 16:18:24 -0700

jon-wei closed pull request #6137: [Backport] Add docs for virtual columns and 
transform specs
URL: https://github.com/apache/incubator-druid/pull/6137


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/content/ingestion/index.md b/docs/content/ingestion/index.md
index 4d09374703a..f5393d1023e 100644
--- a/docs/content/ingestion/index.md
+++ b/docs/content/ingestion/index.md
@@ -87,7 +87,8 @@ An example dataSchema is shown below:
     "segmentGranularity" : "DAY",
     "queryGranularity" : "NONE",
     "intervals" : [ "2013-08-31/2013-09-01" ]
-  }
+  },
+  "transformSpec" : null
 }
 ```
 
@@ -97,6 +98,7 @@ An example dataSchema is shown below:
 | parser | JSON Object | Specifies how ingested data can be parsed. | yes |
 | metricsSpec | JSON Object array | A list of 
[aggregators](../querying/aggregations.html). | yes |
 | granularitySpec | JSON Object | Specifies how to create segments and roll up 
data. | yes |
+| transformSpec | JSON Object | Specifes how to filter and transform input 
data. See [transform specs](../ingestion/transform-spec.html).| no |
 
 ## Parser
 
@@ -233,7 +235,9 @@ For example, the following `dimensionsSpec` section from a 
`dataSchema` ingests
 }
 ```
 
-
+## metricsSpec
+ The `metricsSpec` is a list of [aggregators](../querying/aggregations.html). 
If `rollup` is false in the granularity spec, the metrics spec should be an 
empty list and all columns should be defined in the `dimensionsSpec` instead 
(without rollup, there isn't a real distinction between dimensions and metrics 
at ingestion time). This is optional, however.
+ 
 ## GranularitySpec
 
 The default granularity spec is `uniform`, and can be changed by setting the 
`type` field.
@@ -260,6 +264,10 @@ This spec is used to generate segments with arbitrary 
intervals (it tries to cre
 | rollup | boolean | rollup or not | no (default == true) |
 | intervals | string | A list of intervals for the raw data being ingested. 
Ignored for real-time ingestion. | yes for batch, no for real-time |
 
+# Transform Spec
+
+Transform specs allow Druid to transform and filter input data during 
ingestion. See [Transform specs](../ingestion/transform-spec.html)
+
 # IO Config
 
 Stream Push Ingestion: Stream push ingestion with Tranquility does not require 
an IO Config.
diff --git a/docs/content/ingestion/transform-spec.md 
b/docs/content/ingestion/transform-spec.md
new file mode 100644
index 00000000000..eedaaa6950b
--- /dev/null
+++ b/docs/content/ingestion/transform-spec.md
@@ -0,0 +1,84 @@
+---
+layout: doc_page
+---
+
+# Transform Specs
+
+Transform specs allow Druid to filter and transform input data during 
ingestion. 
+
+## Syntax
+
+The syntax for the transformSpec is shown below:
+
+```
+"transformSpec": {
+  "transforms: <List of transforms>,
+  "filter": <filter>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|transforms|A list of [transforms](#transforms) to be applied to input rows. 
|no|
+|filter|A [filter](../querying/filters.html) that will be applied to input 
rows; only rows that pass the filter will be ingested.|no|
+
+## Transforms
+
+The `transforms` list allows the user to specify a set of column 
transformations to be performed on input data.
+
+Transforms allow adding new fields to input rows. Each transform has a "name" 
(the name of the new field) which can be referred to by DimensionSpecs, 
AggregatorFactories, etc.
+
+A transform behaves as a "row function", taking an entire row as input and 
outputting a column value.
+
+If a transform has the same name as a field in an input row, then it will 
shadow the original field. Transforms that shadow fields may still refer to the 
fields they shadow. This can be used to transform a field "in-place".
+
+Transforms do have some limitations. They can only refer to fields present in 
the actual input rows; in particular, they cannot refer to other transforms. 
And they cannot remove fields, only add them. However, they can shadow a field 
with another field containing all nulls, which will act similarly to removing 
the field.
+
+Note that the transforms are applied before the filter.
+
+### Expression Transform
+
+Druid currently supports one kind of transform, the expression transform.
+
+An expression transform has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <output field name>,
+  "expression": <expr>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The output field name of the expression transform.|yes|
+|expression|An [expression](../misc/math-expr.html) that will be applied to 
input rows to produce a value for the transform's output field.|no|
+
+For example, the following expression transform prepends "foo" to the values 
of a `page` column in the input data, and creates a `fooPage` column.
+
+```
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)"
+    }
+```
+
+## Filtering
+
+The transformSpec allows Druid to filter out input rows during ingestion. A 
row that fails to pass the filter will not be ingested.
+
+Any of Druid's standard [filters](../querying/filters.html) can be used.
+
+Note that the filtering takes place after the transforms, so filters will 
operate on transformed rows and not the raw input data if transforms are 
present.
+
+For example, the following filter would ingest only input rows where a 
`country` column has the value "United States":
+
+```
+"filter": {
+  "type": "selector",
+  "dimension": "country",
+  "value": "United States"
+}
+```
\ No newline at end of file
diff --git a/docs/content/misc/math-expr.md b/docs/content/misc/math-expr.md
index abcebdd3b5e..d8214916c22 100644
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@@ -2,6 +2,12 @@
 layout: doc_page
 ---
 
+# Druid Expressions
+
+<div class="note info">
+This feature is still experimental. It has not been optimized for performance 
yet, and its implementation is known to have significant inefficiencies.
+</div>
+ 
 This expression language supports the following operators (listed in 
decreasing order of precedence).
 
 |Operators|Description|
diff --git a/docs/content/querying/virtual-columns.md 
b/docs/content/querying/virtual-columns.md
new file mode 100644
index 00000000000..117b75ea559
--- /dev/null
+++ b/docs/content/querying/virtual-columns.md
@@ -0,0 +1,60 @@
+---
+layout: doc_page
+---
+
+# Virtual Columns
+
+Virtual columns are queryable column "views" created from a set of columns 
during a query. 
+
+A virtual column can potentially draw from multiple underlying columns, 
although a virtual column always presents itself as a single column.
+
+Virtual columns can be used as dimensions or as inputs to aggregators.
+
+Each Druid query can accept a list of virtual columns as a parameter. The 
following scan query is provided as an example:
+
+```
+{
+ "queryType": "scan",
+ "dataSource": "page_data",
+ "columns":[],
+ "virtualColumns": [
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)",
+      "outputType": "STRING"
+    },
+    {
+      "type": "expression",
+      "name": "tripleWordCount",
+      "expression": "wordCount * 3",
+      "outputType": "LONG"
+    }
+  ],
+ "intervals": [
+   "2013-01-01/2019-01-02"
+ ] 
+}
+```
+
+
+## Virtual Column Types
+
+### Expression virtual column
+
+The expression virtual column has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <name of the virtual column>,
+  "expression": <row expression>,
+  "outputType": <output value type of expression>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The name of the virtual column.|yes|
+|expression|An [expression](../misc/math-expr.html) that takes a row as input 
and outputs a value for the virtual column.|yes|
+|outputType|The expression's output will be coerced to this type. Can be LONG, 
FLOAT, DOUBLE, or STRING.|no, default is FLOAT|
\ No newline at end of file
diff --git a/docs/content/toc.md b/docs/content/toc.md
index e51f95c05d4..7524c0b5906 100644
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@@ -22,6 +22,7 @@ layout: toc
     * [Stream Pull](/docs/VERSION/ingestion/stream-pull.html)
   * [Updating Existing Data](/docs/VERSION/ingestion/update-existing-data.html)
   * [Ingestion Tasks](/docs/VERSION/ingestion/tasks.html)
+  * [Transform Specs](/docs/VERSION/ingestion/transform-spec.html)
   * [FAQ](/docs/VERSION/ingestion/faq.html)
 
 ## Querying
@@ -50,6 +51,7 @@ layout: toc
   * [Multitenancy](/docs/VERSION/querying/multitenancy.html)
   * [Caching](/docs/VERSION/querying/caching.html)
   * [Sorting Orders](/docs/VERSION/querying/sorting-orders.html)
+  * [Virtual Columns](/docs/VERSION/querying/virtual-columns.html)
 
 ## Design
   * [Overview](/docs/VERSION/design/design.html)
@@ -110,5 +112,6 @@ layout: toc
 
 
 ## Misc
+  * [Druid Expressions Language](/docs/VERSION/misc/math-expr.html)
   * [Papers & Talks](/docs/VERSION/misc/papers-and-talks.html)
   * [Thanks](/thanks.html)


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] jon-wei closed pull request #6137: [Backport] Add docs for virtual columns and transform specs

Reply via email to