Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

via GitHub Fri, 04 Aug 2023 10:15:58 -0700


sergioferragut commented on code in PR #14742:
URL: https://github.com/apache/druid/pull/14742#discussion_r1284672436



##########
examples/quickstart/jupyter-notebooks/notebooks/01-introduction/02-datagen-intro.ipynb:
##########
@@ -0,0 +1,540 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9e07b3f5-d919-4179-91a1-0f6b66c42757",
+   "metadata": {},
+   "source": [
+    "# Data Generator Server\n",
+    "The default docker compose deployment includes a data generation service 
created from the published docker image `imply/datagen:latest`. \n",
+    "This image is built by the project 
https://github.com/implydata/druid-datagenerator. \n",
+    "\n",
+    "To interact with the data generation service, you can use the rest client 
provided in the druidapi python package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f84766c7-c6a5-4496-91a3-abdb8ddd2375",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "# Datagen client \n",
+    "datagen = druidapi.rest.DruidRestClient(\"http://datagen:9999\";)\n",
+    "\n",
+    "if (os.environ['DRUID_HOST'] == None):\n",
+    "    druid_host=f\"http://router:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "\n",
+    "# Druid client\n",
+    "druid = druidapi.jupyter_client(druid_host)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c54af617-0998-4010-90c3-9b5a38a09a5f",
+   "metadata": {},
+   "source": [
+    "### List available configurations\n",
+    "Use /list API to get the data generator's available `config_file` values 
with pre-defined data generator schemas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ba6a80a-c49b-4abf-943b-9dad82f2ae13",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(datagen.get(f\"/list\", require_ok=False).json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae88a3b7-60da-405d-bcf4-fb4affcfe973",
+   "metadata": {},
+   "source": [
+    "### Generate a data file for back filling history\n",
+    "When generating a file for backfill purposes it is useful to be able to 
select the start time and the duration of the simulation.\n",
+    "This example shows how to do that:\n",
+    "- \"target\" specifies \"type\":\"file\" which generates a data file.\n",
+    "- \"path\" within the \"target\" is only a filename, it will ignore any 
path specified on the file.\n",
+    "- The data generator simulates time when you specify a start time in the 
\"time_type\" property and a duration in the \"time\" property.\n",

Review Comment:
   So, what it does is use the start timestamp you specify in "time_type" as 
the starting point... event timestamps will start there. The datagen 
configuration specifies the distribution of time between events, which is 
calculated at runtime for each state machine. When simulating time, the 
simulated clock is updated from event to event without actually waiting that 
amount of time, so it builds the data much faster. In REAL mode, each state 
machine waits for the delay and then publishes the next message.
   
   This explanation is a bit wordy, so I wonder whether it is worth it in the 
notebook. Any ideas on how to describe that in a few words? :-) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

Reply via email to