Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

via GitHub Fri, 04 Aug 2023 10:19:38 -0700


sergioferragut commented on code in PR #14742:
URL: https://github.com/apache/druid/pull/14742#discussion_r1284675234



##########
examples/quickstart/jupyter-notebooks/notebooks/01-introduction/02-datagen-intro.ipynb:
##########
@@ -0,0 +1,540 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9e07b3f5-d919-4179-91a1-0f6b66c42757",
+   "metadata": {},
+   "source": [
+    "# Data Generator Server\n",
+    "The default docker compose deployment includes a data generation service 
created from the published docker image `imply/datagen:latest`. \n",
+    "This image is built by the project 
https://github.com/implydata/druid-datagenerator. \n",
+    "\n",
+    "To interact with the data generation service, you can use the rest client 
provided in the druidapi python package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f84766c7-c6a5-4496-91a3-abdb8ddd2375",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "# Datagen client \n",
+    "datagen = druidapi.rest.DruidRestClient(\"http://datagen:9999\";)\n",
+    "\n",
+    "if (os.environ['DRUID_HOST'] == None):\n",
+    "    druid_host=f\"http://router:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "\n",
+    "# Druid client\n",
+    "druid = druidapi.jupyter_client(druid_host)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c54af617-0998-4010-90c3-9b5a38a09a5f",
+   "metadata": {},
+   "source": [
+    "### List available configurations\n",
+    "Use /list API to get the data generator's available `config_file` values 
with pre-defined data generator schemas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ba6a80a-c49b-4abf-943b-9dad82f2ae13",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/list\", require_ok=False).json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae88a3b7-60da-405d-bcf4-fb4affcfe973",
+   "metadata": {},
+   "source": [
+    "### Generate a data file for back filling history\n",
+    "When generating a file for backfill purposes it is useful to be able to 
select the start time and the duration of the simulation.\n",
+    "This example shows how to do that:\n",
+    "- \"target\" specifies \"type\":\"file\" which generates a data file.\n",
+    "- \"path\" within the \"target\" is only a filename, it will ignore any 
path specified on the file.\n",
+    "- The data generator simulates time when you specify a start time in the 
\"time_type\" property and a duration in the \"time\" property.\n",
+    "- \"concurrency\" indicates the maximum number of entities used 
concurrently to generate events. Each entity is a separate state machine that 
simulates things like user sessions, IoT devices, or other concurrent sources 
of event data. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "811ff58f-75af-4092-a08d-5e07a51592ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime, timedelta\n",
+    "import json\n",
+    "\n",
+    "# determine start time, in this example we are starting one hour ago \n",
+    "startDateTime = (datetime.now() - timedelta(hours = 
1)).strftime('%Y-%m-%dT%H:%M:%S.001')\n",
+    "print(f\"Starting to generate history at {startDateTime}.\")\n",
+    "\n",
+    "job_name=\"gen_clickstream1\"\n",
+    "\n",
+    "headers = {\n",
+    "  'Content-Type': 'application/json'\n",
+    "}\n",
+    "\n",
+    "# this request if generating a data file at on the datagen server\n",
+    "datagen_request = {\n",
+    "    \"name\": job_name,\n",
+    "    \"target\": { \"type\": \"file\", \"path\":\"clicks.json\"},\n",
+    "    \"config_file\": \"clickstream/clickstream.json\", \n",
+    "    \"time\": \"1h\",\n",
+    "    \"concurrency\":100,\n",
+    "    \"time_type\": startDateTime\n",
+    "}\n",
+    "response = datagen.post(\"/start\", json.dumps(datagen_request), 
headers=headers, require_ok=False)\n",
+    "response.json()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d407d1d9-3f01-4128-a014-6a5f371c25a5",
+   "metadata": {},
+   "source": [
+    "### Display jobs\n",
+    "Use the /jobs API to get the current jobs and their status."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3de698c5-bcf4-40c7-b295-728fb54d1f0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/jobs\").json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "972ebed0-34a1-4ad2-909d-69b8b27c3046",
+   "metadata": {},
+   "source": [
+    "### Get status of a job\n",
+    "Use the /jobs API to get the current jobs and their status."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "debce4f8-9c16-476c-9593-21ec984985d2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/status/{job_name}\", require_ok=False).json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef818d78-6aa6-4d38-8a43-83416aede96f",
+   "metadata": {},
+   "source": [
+    "### Stop a job\n",
+    "Use the /stop/\\<job_name> API to stop a job."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7631b8b8-d3d6-4803-9162-587f440d2ef2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.post(f\"/stop/{job_name}\", '').json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a8dc7d3-64e5-41e3-8c28-c5f19c0536f5",
+   "metadata": {},
+   "source": [
+    "### List files created on datagen server\n",
+    "Use the /files API to list files available on the server."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "06ee36bd-2d2b-4904-9987-10636cf52aac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/files\", '').json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "83ef9edb-98e2-45b4-88e8-578703faedc1",
+   "metadata": {},
+   "source": [
+    "### Batch Loading of Generated Files\n",
+    "Use a [Druid HTTP input 
source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#http-input-source)
 in the [EXTERN 
function](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#extern-function)
 of a [SQL Based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) 
to load generated files.\n",
+    "The files can be accessed by name using the 
`http://datagen:9999/file/<name of the file>` or if ingesting into a Druid 
instance outside of docker, but still running locally, then use 
`http://localhost:9999/file/<name of the file>`.\n",
+    "The following example assumes that both Druid and the data generator 
server are running in docker compose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d72b015-f8ec-4713-b6f2-fe7a15afff59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "REPLACE INTO \"clicks\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"http://datagen:9999/file/clicks.json\"]}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"time\" VARCHAR, \"user_id\" VARCHAR, \"event_type\" VARCHAR, 
\"client_ip\" VARCHAR, \"client_device\" VARCHAR, \"client_lang\" VARCHAR, 
\"client_country\" VARCHAR, \"referrer\" VARCHAR, \"keyword\" VARCHAR, 
\"product\" VARCHAR))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"time\") AS \"__time\",\n",
+    "  \"user_id\",\n",
+    "  \"event_type\",\n",
+    "  \"client_ip\",\n",
+    "  \"client_device\",\n",
+    "  \"client_lang\",\n",
+    "  \"client_country\",\n",
+    "  \"referrer\",\n",
+    "  \"keyword\",\n",
+    "  \"product\"\n",
+    "FROM \"ext\"\n",
+    "PARTITIONED BY DAY\n",
+    "'''\n",
+    "druid.sql.run_task(sql)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0997b38-02c2-483e-bd15-439c4bf0097a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "druid.display.sql('''\n",
+    "SELECT  event_type, \n",
+    "        count( DISTINCT \"user_id\") users, \n",
+    "        count( DISTINCT \"client_ip\") ips, \n",
+    "        count( DISTINCT \"client_ip\") - count( DISTINCT \"user_id\") 
ips_minus_users\n",
+    "FROM \"clicks\"\n",
+    "GROUP BY 1\n",
+    "HAVING count( DISTINCT \"user_id\") - count( DISTINCT \"client_ip\") < 
0\n",
+    "ORDER BY 4 DESC\n",
+    "''')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66ec013f-28e4-4d5a-94a6-06e0ed537b4e",
+   "metadata": {},
+   "source": [
+    "## Generating custom data\n",
+    "\n",
+    "You can fine the full set of configuration option in the [data generator 
project's 
readme](https://github.com/implydata/druid-datagenerator#data-generator-configuration).\n",
+    "\n",
+    "In this section we use a simple custom configuration as an example to 
generate some data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d6451310-b7dd-4b39-a23b-7b735b152d6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gen_config = {\n",
+    "  \"emitters\": [\n",
+    "    {\n",
+    "      \"name\": \"simple_record\",\n",
+    "      \"dimensions\": [\n",
+    "        {\n",
+    "          \"type\": \"string\",\n",
+    "          \"name\": \"random_string_column\",\n",
+    "          \"length_distribution\": {\n",
+    "            \"type\": \"constant\",\n",
+    "            \"value\": 13\n",
+    "          },\n",
+    "          \"cardinality\": 0,\n",
+    "          \"chars\": \"#.abcdefghijklmnopqrstuvwxyz\"\n",
+    "        },\n",
+    "        {\n",
+    "          \"type\": \"int\",\n",
+    "          \"name\": \"distributed_number\",\n",
+    "          \"distribution\": {\n",
+    "            \"type\": \"uniform\",\n",
+    "            \"min\": 0,\n",
+    "            \"max\": 1000\n",
+    "          },\n",
+    "          \"cardinality\": 10,\n",
+    "          \"cardinality_distribution\": {\n",
+    "            \"type\": \"exponential\",\n",
+    "            \"mean\": 5\n",
+    "          }\n",
+    "        }\n",
+    "      ]\n",
+    "    }\n",
+    "  ],\n",
+    "  \"interarrival\": {\n",
+    "    \"type\": \"constant\",\n",
+    "    \"value\": 1\n",
+    "  },\n",
+    "  \"states\": [\n",
+    "    {\n",
+    "      \"name\": \"state_1\",\n",
+    "      \"emitter\": \"simple_record\",\n",
+    "      \"delay\": {\n",
+    "        \"type\": \"constant\",\n",
+    "        \"value\": 1\n",
+    "      },\n",
+    "      \"transitions\": [\n",
+    "        {\n",
+    "          \"next\": \"state_1\",\n",
+    "          \"probability\": 1.0\n",
+    "        }\n",
+    "      ]\n",
+    "    }\n",
+    "  ]\n",
+    "}\n",
+    "\n",
+    "target = { \"type\":\"file\", \"path\":\"sample_data.json\"}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89a22645-aea5-4c15-b81a-959b27df731f",
+   "metadata": {},
+   "source": [
+    "Now, instead of using a config_file, we use the config attribute of the 
request to use our new custom data generator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e5e5c535-3474-42b4-9772-14279e712f3d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# generate 1 hour of simulated time using custom configuration\n",
+    "datagen_request = {\n",
+    "    \"name\": \"sample_custom\",\n",
+    "    \"target\": target,\n",
+    "    \"config\": gen_config, \n",
+    "    \"time\": \"1h\",\n",
+    "    \"concurrency\":10,\n",
+    "    \"time_type\": \"SIM\"\n",
+    "}\n",
+    "response = datagen.post(\"/start\", json.dumps(datagen_request), 
headers=headers, require_ok=False)\n",
+    "response.json()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "952386f7-8181-4325-972b-5f30dc12cf21",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/jobs\", require_ok=False).json())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "470b3a2a-4fd9-45a2-9221-497d906f62a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "display( datagen.get(f\"/file/sample_data.json\").content[:1024])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "77bff054-0f16-4fd5-8ade-2d44b30d0cf2",
+   "metadata": {},
+   "source": [
+    "## Streaming generated data\n",
+    "\n",
+    "The data generator works exactly the same whether it is outputing data to 
a file or publishing messages into a stream, all you need to change is the 
target configuration.\n",
+    "\n",
+    "To use the kafka container running on the docker compose set use the host 
name `kafka:9092`. This piece of code uses the KAFKA_HOST variable specified 
when bringing up the cluster to designate the appropriate host. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9959b7c3-6223-479d-b0c2-115a1c555090",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if (os.environ['KAFKA_HOST'] == None):\n",
+    "    kafka_host=f\"kafka:9092\"\n",
+    "else:\n",
+    "    kafka_host=f\"{os.environ['KAFKA_HOST']}:9092\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "497abc18-6538-4536-a17f-fe10c4367611",
+   "metadata": {},
+   "source": [
+    "The simplest `target` object for kafka (and similarly confluent) is:"

Review Comment:
   actually, I'm referring to the value 'confluent' for the target type, 
perhaps it is best not to mention here. I've already pointed them at the README 
on the data gen project which has these options spelled out. so maybe this is 
just "The simplest `target` object for kafka is:"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

Reply via email to