techdocsmith commented on code in PR #13588:
URL: https://github.com/apache/druid/pull/13588#discussion_r1061929995
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
Review Comment:
Partitioning is a way of organizing your segments. The size of the segment
is irrelevant. Partitioning doesn't reduce the size of the data--it doesn't
affect the size data at all. What it does is increases data "locality". This
means that for dimensions you partition on, rows with the same value are stored
together, making it quicker for Druid to eliminate any data that doesn't match
the on the partitioned dimension.
Consider this example as one segment file:

Timestamp is the primary partition. Secondary partitions (on dimensions) are
in order. So imagine that `Page` is a secondary partition. In this case all the
rows with the same page value are stored together. When Druid executes the.
query, it can first check if the timestamp falls within the times included in
the segment and only open ones that have relevant timestamps. This is why
filtering on time makes things faster. Then, within the time, all the rows for
the `Page` are stored together. This means if we want rows that relate to
`Ke$ha`, that Druid filters out any other rows that don't match.
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
+ "\n",
+ "At ingestion, Apache Druid always partitions its data by time.\n",
+ "Each time chunk is then divided into one or more
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+ "\n",
+ "This tutorial describes how to configure partitioning for the Druid SQL
ingestion method. For information about partitioning configurations supported
by other ingestion methods, see [How to configure
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
Review Comment:
```suggestion
"This tutorial describes how to configure partitioning during ingestion
using Druid SQL. For information about partitioning configurations supported by
other ingestion methods, see [How to configure
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
```
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
+ "\n",
+ "At ingestion, Apache Druid always partitions its data by time.\n",
+ "Each time chunk is then divided into one or more
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+ "\n",
+ "This tutorial describes how to configure partitioning for the Druid SQL
ingestion method. For information about partitioning configurations supported
by other ingestion methods, see [How to configure
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "deletable": true,
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "Make sure that you meet the requirements outlined in the README.md file
of the [apache/druid
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n",
+ "Specifically, you need the following:\n",
+ "- Knowledge of SQL\n",
+ "- [Python3](https://www.python.org/downloads/)\n",
+ "- [The `requests` package for
Python](https://requests.readthedocs.io/en/latest/user/install/)\n",
+ "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or
[Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a
non-default port. Druid and Jupyter both default to port `8888`, so you need to
start Jupyter on a different port. \n",
+ "- An available Druid instance. This tutorial uses the `micro-quickstart`
configuration described in the [Druid
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no
authentication or authorization is required unless explicitly mentioned. If you
haven’t already, download Druid version 24.0 or higher and start Druid services
as described in the quickstart."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7",
+ "metadata": {},
+ "source": [
+ "## Prepare your environment\n",
+ "\n",
+ "Start by running the following cell. It imports the required Python
packages and defines a variable for the Druid host."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In a distributed environment, use the Router service as the
`druid_host`. \n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"partitioning-tutorial\"\n",
+ "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+ "metadata": {},
+ "source": [
+ "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload`
variables are updated to accomplish different tasks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+ "metadata": {},
+ "source": [
+ "## Segment size\n",
+ "\n",
+ "A segment is the smallest unit of storage in Druid.\n",
+ "It is recommended that you optimize your segment file size at ingestion
time for Druid to operate well under a heavy query load.\n",
+ "\n",
+ "Consider the following to optimize your segment file size:\n",
+ "\n",
+ "- The number of rows per segment should be around five million. You can
set the number of rows per segment using the `rowsPerSegment` query context
parameter in the [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC
connection properties
object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To
specify the `rowsPerSegment` parameters in the Druid web console, navigate to
the **Query** page, then click **Engine > Edit context** to bring up the **Edit
query context** dialog. For more information on how to specify query context
parameters, see [Setting the query
context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).\n",
Review Comment:
We should demonstrate how to do this using the API. See
https://github.com/apache/druid/pull/13465/files#r1061870567
```
{
"query": "SELECT 1 + 1",
"context": {
"<key>": "<value>",
"rowsPerSegment": 5000000
}
}
```
Consider excluding how the information about how to set the query context in
the UI since this is an API based tutorial.
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
Review Comment:
I see that we are only covering time in this tutorial. That is fine. So you
won't need to go into so much detail about secondary partitioning, but we
should mention it and that it will be another tutorial. I think the scope
you've already set out is good enough for this tutorial.
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
+ "\n",
+ "At ingestion, Apache Druid always partitions its data by time.\n",
+ "Each time chunk is then divided into one or more
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
Review Comment:
I feel like the description here is best (Druid docs could use some
consolidation):
https://druid.apache.org/docs/latest/design/architecture.html#datasources-and-segments
> Druid data is stored in datasources, which are similar to tables in a
traditional RDBMS. Each datasource is partitioned by time and, optionally,
further partitioned by other attributes.
> Each time range is called a chunk (for example, a single day, if your
datasource is partitioned by day). Within a chunk, data is partitioned into one
or more [segments](https://druid.apache.org/docs/latest/design/segments.html).
> Each segment is a single file, typically comprising up to a few million
rows of data.
And then saying that way if you filter your query on time, Druid doesn't
have to open any files that don't match the filter.
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
+ "\n",
+ "At ingestion, Apache Druid always partitions its data by time.\n",
+ "Each time chunk is then divided into one or more
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+ "\n",
+ "This tutorial describes how to configure partitioning for the Druid SQL
ingestion method. For information about partitioning configurations supported
by other ingestion methods, see [How to configure
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "deletable": true,
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
Review Comment:
see same comments in #13465 related to removing the "prepare your
environment" section
https://github.com/apache/druid/pull/13465/files#r1061863499
And regarding the formatting
https://github.com/apache/druid/pull/13465/files#r1061870567
##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "deletable": true,
+ "editable": true,
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Druid SQL segment sizing and partitioning\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "Partitioning is a method of organizing a large datasource into
independent partitions.\n",
+ "Partitioning reduces the size of your data and increases query
performance.\n",
+ "\n",
+ "At ingestion, Apache Druid always partitions its data by time.\n",
+ "Each time chunk is then divided into one or more
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+ "\n",
+ "This tutorial describes how to configure partitioning for the Druid SQL
ingestion method. For information about partitioning configurations supported
by other ingestion methods, see [How to configure
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "deletable": true,
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "Make sure that you meet the requirements outlined in the README.md file
of the [apache/druid
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n",
+ "Specifically, you need the following:\n",
+ "- Knowledge of SQL\n",
+ "- [Python3](https://www.python.org/downloads/)\n",
+ "- [The `requests` package for
Python](https://requests.readthedocs.io/en/latest/user/install/)\n",
+ "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or
[Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a
non-default port. Druid and Jupyter both default to port `8888`, so you need to
start Jupyter on a different port. \n",
+ "- An available Druid instance. This tutorial uses the `micro-quickstart`
configuration described in the [Druid
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no
authentication or authorization is required unless explicitly mentioned. If you
haven’t already, download Druid version 24.0 or higher and start Druid services
as described in the quickstart."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7",
+ "metadata": {},
+ "source": [
+ "## Prepare your environment\n",
+ "\n",
+ "Start by running the following cell. It imports the required Python
packages and defines a variable for the Druid host."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In a distributed environment, use the Router service as the
`druid_host`. \n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"partitioning-tutorial\"\n",
+ "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+ "metadata": {},
+ "source": [
+ "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload`
variables are updated to accomplish different tasks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+ "metadata": {},
+ "source": [
+ "## Segment size\n",
+ "\n",
+ "A segment is the smallest unit of storage in Druid.\n",
+ "It is recommended that you optimize your segment file size at ingestion
time for Druid to operate well under a heavy query load.\n",
+ "\n",
+ "Consider the following to optimize your segment file size:\n",
+ "\n",
+ "- The number of rows per segment should be around five million. You can
set the number of rows per segment using the `rowsPerSegment` query context
parameter in the [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC
connection properties
object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To
specify the `rowsPerSegment` parameters in the Druid web console, navigate to
the **Query** page, then click **Engine > Edit context** to bring up the **Edit
query context** dialog. For more information on how to specify query context
parameters, see [Setting the query
context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).\n",
+ "- Segment file size should be within the range of 300-700 MB. The number
of rows per segment takes precedence over the segment byte size. \n",
+ "\n",
+ "For more information on segment sizing, see [Segment size
optimization](https://druid.apache.org/docs/latest/operations/segment-optimization.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "84cb68a0-beb1-47d5-9fd5-384ea0caa35d",
+ "metadata": {},
+ "source": [
+ "## PARTITIONED BY\n",
+ "\n",
+ "In Druid SQL, the granularity of a segment is defined by the granularity
of the PARTITIONED BY clause.\n",
+ "\n",
+
"[INSERT](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#insert)
and
[REPLACE](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#replace)
statements both require the PARTITIONED BY clause.\n",
+ "\n",
+ "PARTITIONED BY accepts the following time granularity arguments:\n",
+ "- `time_unit`\n",
+ "- `TIME_FLOOR(__time, period)` \n",
+ "- `FLOOR(__time TO time_unit)`\n",
+ "- `ALL` or `ALL TIME`\n",
+ "\n",
+ "Continue reading to learn about each of the supported arguments.\n",
+ "\n",
+ "### Time unit\n",
+ "\n",
+ "`PARTITIONED BY(time_unit)`. Partition by `SECOND`, `MINUTE`, `HOUR`,
`DAY`, `WEEK`, `MONTH`, `QUARTER`, or `YEAR`.\n",
+ "\n",
+ "For example, run the following cell to ingest data from an external
source into a table named `partitioning-tutorial` and partition the datasource
by `DAY`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "045f782c-74d8-4447-9487-529071812b51",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql/task\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "# If you already have an existing datasource named partitioning-tutorial,
use REPLACE INTO instead of INSERT INTO.\n",
+ "payload = json.dumps({\n",
+ "\"query\": \"INSERT INTO \\\"partitioning-tutorial\\\" SELECT
TIME_PARSE(\\\"timestamp\\\") \\\n",
+ " AS __time, * FROM TABLE \\\n",
+ " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\":
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\":
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"},
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\":
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\",
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
Review Comment:
This datasource file only has 24K rows. Wondering if we could use the taxi
data with this:
```
REPLACE INTO "trips_xaa" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://static.imply.io/example-data/trips/trips_xaa.csv.gz","https://static.imply.io/example-data/trips/trips_xab.csv.gz","https://static.imply.io/example-data/trips/trips_xac.csv.gz"]}',
'{"type":"csv","findColumnsFromHeader":false,"columns":["trip_id","vendor_id","pickup_datetime","dropoff_datetime","store_and_fwd_flag","rate_code_id","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","passenger_count","trip_distance","fare_amount","extra","mta_tax","tip_amount","tolls_amount","ehail_fee","improvement_surcharge","total_amount","payment_type","trip_type","pickup","dropoff","cab_type","precipitation","snow_depth","snowfall","max_temperature","min_temperature","average_wind_speed","pickup_nyct2010_gid","pickup_ctlabel","pickup_borocode","pickup_boroname","pickup_ct2010","pickup_boroct2010","pickup_cdeligibil","pickup_ntacode","pickup_ntaname","pickup_puma","dropoff_nyct2010_gid","dropoff_ctlabel","dropoff_borocode","dropoff_boroname","dropoff_ct2010","dropoff_boroct2010","dropoff_cdeligibil","dropoff_ntacode","dropoff_ntaname","dropoff_puma"]}',
'[{"name":"trip_id","type":"long"},{"name":"vendor_id","type":"long"},{"name":"pickup_datetime","type":"string"},{"name":"dropoff_datetime","type":"string"},{"name":"store_and_fwd_flag","type":"string"},{"name":"rate_code_id","type":"long"},{"name":"pickup_longitude","type":"double"},{"name":"pickup_latitude","type":"double"},{"name":"dropoff_longitude","type":"double"},{"name":"dropoff_latitude","type":"double"},{"name":"passenger_count","type":"long"},{"name":"trip_distance","type":"double"},{"name":"fare_amount","type":"double"},{"name":"extra","type":"double"},{"name":"mta_tax","type":"double"},{"name":"tip_amount","type":"double"},{"name":"tolls_amount","type":"double"},{"name":"ehail_fee","type":"string"},{"name":"improvement_surcharge","type":"string"},{"name":"total_amount","type":"double"},{"name":"payment_type","type":"long"},{"name":"trip_type","type":"string"},{"name":"pickup","type":"string"},{"name":"dropoff","type":"string"},{"name":"cab_type","type":"string"},{
"name":"precipitation","type":"double"},{"name":"snow_depth","type":"long"},{"name":"snowfall","type":"long"},{"name":"max_temperature","type":"long"},{"name":"min_temperature","type":"long"},{"name":"average_wind_speed","type":"double"},{"name":"pickup_nyct2010_gid","type":"long"},{"name":"pickup_ctlabel","type":"long"},{"name":"pickup_borocode","type":"long"},{"name":"pickup_boroname","type":"string"},{"name":"pickup_ct2010","type":"long"},{"name":"pickup_boroct2010","type":"long"},{"name":"pickup_cdeligibil","type":"string"},{"name":"pickup_ntacode","type":"string"},{"name":"pickup_ntaname","type":"string"},{"name":"pickup_puma","type":"long"},{"name":"dropoff_nyct2010_gid","type":"long"},{"name":"dropoff_ctlabel","type":"long"},{"name":"dropoff_borocode","type":"long"},{"name":"dropoff_boroname","type":"string"},{"name":"dropoff_ct2010","type":"long"},{"name":"dropoff_boroct2010","type":"long"},{"name":"dropoff_cdeligibil","type":"string"},{"name":"dropoff_ntacode","type":"strin
g"},{"name":"dropoff_ntaname","type":"string"},{"name":"dropoff_puma","type":"long"}]'
)
))
SELECT
TIME_PARSE("pickup_datetime") AS "__time",
"trip_id",
"vendor_id",
"dropoff_datetime",
"store_and_fwd_flag",
"rate_code_id",
"pickup_longitude",
"pickup_latitude",
"dropoff_longitude",
"dropoff_latitude",
"passenger_count",
"trip_distance",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"ehail_fee",
"improvement_surcharge",
"total_amount",
"payment_type",
"trip_type",
"pickup",
"dropoff",
"cab_type",
"precipitation",
"snow_depth",
"snowfall",
"max_temperature",
"min_temperature",
"average_wind_speed",
"pickup_nyct2010_gid",
"pickup_ctlabel",
"pickup_borocode",
"pickup_boroname",
"pickup_ct2010",
"pickup_boroct2010",
"pickup_cdeligibil",
"pickup_ntacode",
"pickup_ntaname",
"pickup_puma",
"dropoff_nyct2010_gid",
"dropoff_ctlabel",
"dropoff_borocode",
"dropoff_boroname",
"dropoff_ct2010",
"dropoff_boroct2010",
"dropoff_cdeligibil",
"dropoff_ntacode",
"dropoff_ntaname",
"dropoff_puma"
FROM "ext"
PARTITIONED BY DAY
```
Also, consider not adding every single column. We should consider only
loading enough columns to make illustrate our point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]