[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

GitBox Wed, 15 Apr 2020 11:16:28 -0700

suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409024524


 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common 
kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table 
datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always 
creating or writing into a table
+datasource. But at query time, there are many other types of datasources 
available.
 
-A data source is the Apache Druid equivalent of a database table. However, a 
query can also masquerade as a data source, providing subquery-like 
functionality. Query data sources are currently supported only by 
[GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` 
clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, 
or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) 
when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource 
you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, 
distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` 
schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply 
`dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as 
strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) 
objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all 
servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: 
`k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an 
explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas 
the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. 
This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": 
"hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can 
appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always 
surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN 
<table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These 
are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker 
as part of query execution.
+> This means that subqueries with large result sets can cause performance 
bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details 
on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales 
by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-       "type": "table",
-       "name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking 
joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join 
algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that 
the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with 
[lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there 
can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND 
t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can 
handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL 
will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to 
watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. 
The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column 
access.
+
+For more information about how Druid translates SQL to native queries, refer 
to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, 
`query`, or `inline`. Placing another join as the left datasource allows you to 
join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or 
`inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the 
right-hand datasource, to prevent them from colliding with columns from the 
left-hand datasource. Can be any string, so long as it is nonempty and is not 
be a prefix of the string `__time`. Any columns from the left-hand side that 
start with your `rightPrefix` will be shadowed. It is up to you to provide a 
prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where 
one side is an expression of the left-hand side, and the other side is a simple 
column reference to the right-hand side. Note that this is more rigid than what 
Druid SQL requires: here, the right-hand reference must be a simple column 
reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. 
Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most 
cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. 
Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries 
that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation 
for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the 
two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN 
lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition 
for each row. In the future, we expect
 
 Review comment:
   I think this is an extension of the first point about using the LOOKUP 
function instead of a join?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Reply via email to