[GitHub] [druid] techdocsmith commented on a diff in pull request #13588: docs: notebook only for Druid SQL Partitioning tutorial

GitBox Wed, 04 Jan 2023 15:06:02 -0800


techdocsmith commented on code in PR #13588:
URL: https://github.com/apache/druid/pull/13588#discussion_r1061929995



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",

Review Comment:
   Partitioning is a way of organizing your segments. The size of the segment 
is irrelevant. Partitioning doesn't reduce the size of the data--it doesn't 
affect the size data at all. What it does is increases data "locality". This 
means that for dimensions you partition on, rows with the same value are stored 
together, making it quicker for Druid to eliminate any data that doesn't match 
the on the partitioned dimension.
   
   Consider this example as one segment file:
   ![](https://druid.apache.org/docs/latest/assets/druid-column-types.png)
   
   Timestamp is the primary partition. Secondary partitions (on dimensions) are 
in order. So imagine that `Page` is a secondary partition. In this case all the 
rows with the same page value are stored together. When Druid executes the. 
query, it can first check if the timestamp falls within the times included in 
the segment and only open ones that have relevant timestamps. This is why 
filtering on time makes things faster. Then, within the time, all the rows for 
the `Page` are stored together. This means if we want rows that relate to 
`Ke$ha`, that Druid filters out any other rows that don't match.



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",
+    "\n",
+    "At ingestion, Apache Druid always partitions its data by time.\n",
+    "Each time chunk is then divided into one or more 
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+    "\n",
+    "This tutorial describes how to configure partitioning for the Druid SQL 
ingestion method. For information about partitioning configurations supported 
by other ingestion methods, see [How to configure 
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."

Review Comment:
   ```suggestion
       "This tutorial describes how to configure partitioning during ingestion 
using Druid SQL. For information about partitioning configurations supported by 
other ingestion methods, see [How to configure 
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
   ```



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",
+    "\n",
+    "At ingestion, Apache Druid always partitions its data by time.\n",
+    "Each time chunk is then divided into one or more 
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+    "\n",
+    "This tutorial describes how to configure partitioning for the Druid SQL 
ingestion method. For information about partitioning configurations supported 
by other ingestion methods, see [How to configure 
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "deletable": true,
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Make sure that you meet the requirements outlined in the README.md file 
of the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n",
+    "Specifically, you need the following:\n",
+    "- Knowledge of SQL\n",
+    "- [Python3](https://www.python.org/downloads/)\n",
+    "- [The `requests` package for 
Python](https://requests.readthedocs.io/en/latest/user/install/)\n",
+    "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or 
[Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a 
non-default port. Druid and Jupyter both default to port `8888`, so you need to 
start Jupyter on a different port. \n",
+    "- An available Druid instance. This tutorial uses the `micro-quickstart` 
configuration described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no 
authentication or authorization is required unless explicitly mentioned. If you 
haven’t already, download Druid version 24.0 or higher and start Druid services 
as described in the quickstart."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7",
+   "metadata": {},
+   "source": [
+    "## Prepare your environment\n",
+    "\n",
+    "Start by running the following cell. It imports the required Python 
packages and defines a variable for the Druid host."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"partitioning-tutorial\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+   "metadata": {},
+   "source": [
+    "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` 
variables are updated to accomplish different tasks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+   "metadata": {},
+   "source": [
+    "## Segment size\n",
+    "\n",
+    "A segment is the smallest unit of storage in Druid.\n",
+    "It is recommended that you optimize your segment file size at ingestion 
time for Druid to operate well under a heavy query load.\n",
+    "\n",
+    "Consider the following to optimize your segment file size:\n",
+    "\n",
+    "- The number of rows per segment should be around five million. You can 
set the number of rows per segment using the `rowsPerSegment` query context 
parameter in the [Druid SQL 
API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC 
connection properties 
object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To 
specify the `rowsPerSegment` parameters in the Druid web console, navigate to 
the **Query** page, then click **Engine > Edit context** to bring up the **Edit 
query context** dialog. For more information on how to specify query context 
parameters, see [Setting the query 
context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).\n",

Review Comment:
   We should demonstrate how to do this using the API. See 
https://github.com/apache/druid/pull/13465/files#r1061870567
   ```
   {
     "query": "SELECT 1 + 1",
     "context": {
       "<key>": "<value>",
       "rowsPerSegment": 5000000
     }
   }
   ```
   Consider excluding how the information about how to set the query context in 
the UI since this is an API based tutorial.



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",

Review Comment:
   I see that we are only covering time in this tutorial. That is fine. So you 
won't need to go into so much detail about secondary partitioning, but we 
should mention it and that it will be another tutorial. I think the scope 
you've already set out is good enough for this tutorial.



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",
+    "\n",
+    "At ingestion, Apache Druid always partitions its data by time.\n",
+    "Each time chunk is then divided into one or more 
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",

Review Comment:
   I feel like the description here is best (Druid docs could use some 
consolidation):
   
https://druid.apache.org/docs/latest/design/architecture.html#datasources-and-segments
   
   > Druid data is stored in datasources, which are similar to tables in a 
traditional RDBMS. Each datasource is partitioned by time and, optionally, 
further partitioned by other attributes.
   > Each time range is called a chunk (for example, a single day, if your 
datasource is partitioned by day). Within a chunk, data is partitioned into one 
or more [segments](https://druid.apache.org/docs/latest/design/segments.html).
   > Each segment is a single file, typically comprising up to a few million 
rows of data.
   
   And then saying that way if you filter your query on time, Druid doesn't 
have to open any files that don't match the filter.



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",
+    "\n",
+    "At ingestion, Apache Druid always partitions its data by time.\n",
+    "Each time chunk is then divided into one or more 
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+    "\n",
+    "This tutorial describes how to configure partitioning for the Druid SQL 
ingestion method. For information about partitioning configurations supported 
by other ingestion methods, see [How to configure 
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "deletable": true,
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",

Review Comment:
   see same comments in #13465 related to removing the "prepare your 
environment" section 
   https://github.com/apache/druid/pull/13465/files#r1061863499
   
   And regarding the formatting 
   https://github.com/apache/druid/pull/13465/files#r1061870567
   



##########
examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb:
##########
@@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Druid SQL segment sizing and partitioning\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Partitioning is a method of organizing a large datasource into 
independent partitions.\n",
+    "Partitioning reduces the size of your data and increases query 
performance.\n",
+    "\n",
+    "At ingestion, Apache Druid always partitions its data by time.\n",
+    "Each time chunk is then divided into one or more 
[segments](https://druid.apache.org/docs/latest/design/segments.html).\n",
+    "\n",
+    "This tutorial describes how to configure partitioning for the Druid SQL 
ingestion method. For information about partitioning configurations supported 
by other ingestion methods, see [How to configure 
partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "deletable": true,
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Make sure that you meet the requirements outlined in the README.md file 
of the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n",
+    "Specifically, you need the following:\n",
+    "- Knowledge of SQL\n",
+    "- [Python3](https://www.python.org/downloads/)\n",
+    "- [The `requests` package for 
Python](https://requests.readthedocs.io/en/latest/user/install/)\n",
+    "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or 
[Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a 
non-default port. Druid and Jupyter both default to port `8888`, so you need to 
start Jupyter on a different port. \n",
+    "- An available Druid instance. This tutorial uses the `micro-quickstart` 
configuration described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no 
authentication or authorization is required unless explicitly mentioned. If you 
haven’t already, download Druid version 24.0 or higher and start Druid services 
as described in the quickstart."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7",
+   "metadata": {},
+   "source": [
+    "## Prepare your environment\n",
+    "\n",
+    "Start by running the following cell. It imports the required Python 
packages and defines a variable for the Druid host."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"partitioning-tutorial\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+   "metadata": {},
+   "source": [
+    "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` 
variables are updated to accomplish different tasks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+   "metadata": {},
+   "source": [
+    "## Segment size\n",
+    "\n",
+    "A segment is the smallest unit of storage in Druid.\n",
+    "It is recommended that you optimize your segment file size at ingestion 
time for Druid to operate well under a heavy query load.\n",
+    "\n",
+    "Consider the following to optimize your segment file size:\n",
+    "\n",
+    "- The number of rows per segment should be around five million. You can 
set the number of rows per segment using the `rowsPerSegment` query context 
parameter in the [Druid SQL 
API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC 
connection properties 
object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To 
specify the `rowsPerSegment` parameters in the Druid web console, navigate to 
the **Query** page, then click **Engine > Edit context** to bring up the **Edit 
query context** dialog. For more information on how to specify query context 
parameters, see [Setting the query 
context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).\n",
+    "- Segment file size should be within the range of 300-700 MB. The number 
of rows per segment takes precedence over the segment byte size. \n",
+    "\n",
+    "For more information on segment sizing, see [Segment size 
optimization](https://druid.apache.org/docs/latest/operations/segment-optimization.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84cb68a0-beb1-47d5-9fd5-384ea0caa35d",
+   "metadata": {},
+   "source": [
+    "## PARTITIONED BY\n",
+    "\n",
+    "In Druid SQL, the granularity of a segment is defined by the granularity 
of the PARTITIONED BY clause.\n",
+    "\n",
+    
"[INSERT](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#insert)
 and 
[REPLACE](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#replace)
 statements both require the PARTITIONED BY clause.\n",
+    "\n",
+    "PARTITIONED BY accepts the following time granularity arguments:\n",
+    "- `time_unit`\n",
+    "- `TIME_FLOOR(__time, period)` \n",
+    "- `FLOOR(__time TO time_unit)`\n",
+    "- `ALL` or `ALL TIME`\n",
+    "\n",
+    "Continue reading to learn about each of the supported arguments.\n",
+    "\n",
+    "### Time unit\n",
+    "\n",
+    "`PARTITIONED BY(time_unit)`. Partition by `SECOND`, `MINUTE`, `HOUR`, 
`DAY`, `WEEK`, `MONTH`, `QUARTER`, or `YEAR`.\n",
+    "\n",
+    "For example, run the following cell to ingest data from an external 
source into a table named `partitioning-tutorial` and partition the datasource 
by `DAY`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "045f782c-74d8-4447-9487-529071812b51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "# If you already have an existing datasource named partitioning-tutorial, 
use REPLACE INTO instead of INSERT INTO.\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO \\\"partitioning-tutorial\\\" SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",

Review Comment:
   This datasource file only has 24K rows. Wondering if we could use the taxi 
data with this:
   ```
   REPLACE INTO "trips_xaa" OVERWRITE ALL
   WITH "ext" AS (SELECT *
   FROM TABLE(
     EXTERN(
       
'{"type":"http","uris":["https://static.imply.io/example-data/trips/trips_xaa.csv.gz","https://static.imply.io/example-data/trips/trips_xab.csv.gz","https://static.imply.io/example-data/trips/trips_xac.csv.gz"]}',
       
'{"type":"csv","findColumnsFromHeader":false,"columns":["trip_id","vendor_id","pickup_datetime","dropoff_datetime","store_and_fwd_flag","rate_code_id","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","passenger_count","trip_distance","fare_amount","extra","mta_tax","tip_amount","tolls_amount","ehail_fee","improvement_surcharge","total_amount","payment_type","trip_type","pickup","dropoff","cab_type","precipitation","snow_depth","snowfall","max_temperature","min_temperature","average_wind_speed","pickup_nyct2010_gid","pickup_ctlabel","pickup_borocode","pickup_boroname","pickup_ct2010","pickup_boroct2010","pickup_cdeligibil","pickup_ntacode","pickup_ntaname","pickup_puma","dropoff_nyct2010_gid","dropoff_ctlabel","dropoff_borocode","dropoff_boroname","dropoff_ct2010","dropoff_boroct2010","dropoff_cdeligibil","dropoff_ntacode","dropoff_ntaname","dropoff_puma"]}',
       
'[{"name":"trip_id","type":"long"},{"name":"vendor_id","type":"long"},{"name":"pickup_datetime","type":"string"},{"name":"dropoff_datetime","type":"string"},{"name":"store_and_fwd_flag","type":"string"},{"name":"rate_code_id","type":"long"},{"name":"pickup_longitude","type":"double"},{"name":"pickup_latitude","type":"double"},{"name":"dropoff_longitude","type":"double"},{"name":"dropoff_latitude","type":"double"},{"name":"passenger_count","type":"long"},{"name":"trip_distance","type":"double"},{"name":"fare_amount","type":"double"},{"name":"extra","type":"double"},{"name":"mta_tax","type":"double"},{"name":"tip_amount","type":"double"},{"name":"tolls_amount","type":"double"},{"name":"ehail_fee","type":"string"},{"name":"improvement_surcharge","type":"string"},{"name":"total_amount","type":"double"},{"name":"payment_type","type":"long"},{"name":"trip_type","type":"string"},{"name":"pickup","type":"string"},{"name":"dropoff","type":"string"},{"name":"cab_type","type":"string"},{
 
"name":"precipitation","type":"double"},{"name":"snow_depth","type":"long"},{"name":"snowfall","type":"long"},{"name":"max_temperature","type":"long"},{"name":"min_temperature","type":"long"},{"name":"average_wind_speed","type":"double"},{"name":"pickup_nyct2010_gid","type":"long"},{"name":"pickup_ctlabel","type":"long"},{"name":"pickup_borocode","type":"long"},{"name":"pickup_boroname","type":"string"},{"name":"pickup_ct2010","type":"long"},{"name":"pickup_boroct2010","type":"long"},{"name":"pickup_cdeligibil","type":"string"},{"name":"pickup_ntacode","type":"string"},{"name":"pickup_ntaname","type":"string"},{"name":"pickup_puma","type":"long"},{"name":"dropoff_nyct2010_gid","type":"long"},{"name":"dropoff_ctlabel","type":"long"},{"name":"dropoff_borocode","type":"long"},{"name":"dropoff_boroname","type":"string"},{"name":"dropoff_ct2010","type":"long"},{"name":"dropoff_boroct2010","type":"long"},{"name":"dropoff_cdeligibil","type":"string"},{"name":"dropoff_ntacode","type":"strin
 
g"},{"name":"dropoff_ntaname","type":"string"},{"name":"dropoff_puma","type":"long"}]'
     )
   ))
   SELECT
     TIME_PARSE("pickup_datetime") AS "__time",
     "trip_id",
     "vendor_id",
     "dropoff_datetime",
     "store_and_fwd_flag",
     "rate_code_id",
     "pickup_longitude",
     "pickup_latitude",
     "dropoff_longitude",
     "dropoff_latitude",
     "passenger_count",
     "trip_distance",
     "fare_amount",
     "extra",
     "mta_tax",
     "tip_amount",
     "tolls_amount",
     "ehail_fee",
     "improvement_surcharge",
     "total_amount",
     "payment_type",
     "trip_type",
     "pickup",
     "dropoff",
     "cab_type",
     "precipitation",
     "snow_depth",
     "snowfall",
     "max_temperature",
     "min_temperature",
     "average_wind_speed",
     "pickup_nyct2010_gid",
     "pickup_ctlabel",
     "pickup_borocode",
     "pickup_boroname",
     "pickup_ct2010",
     "pickup_boroct2010",
     "pickup_cdeligibil",
     "pickup_ntacode",
     "pickup_ntaname",
     "pickup_puma",
     "dropoff_nyct2010_gid",
     "dropoff_ctlabel",
     "dropoff_borocode",
     "dropoff_boroname",
     "dropoff_ct2010",
     "dropoff_boroct2010",
     "dropoff_cdeligibil",
     "dropoff_ntacode",
     "dropoff_ntaname",
     "dropoff_puma"
   FROM "ext"
   PARTITIONED BY DAY
   ```
   
   Also, consider not adding every single column. We should consider only 
loading enough columns to make illustrate our point. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a diff in pull request #13588: docs: notebook only for Druid SQL Partitioning tutorial

Reply via email to