Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

via GitHub Wed, 09 Aug 2023 14:04:58 -0700


sergioferragut commented on code in PR #14742:
URL: https://github.com/apache/druid/pull/14742#discussion_r1289195932



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/01-streaming-from-kafka.ipynb:
##########
@@ -126,81 +105,70 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from kafka import KafkaProducer\n",
-    "from kafka import KafkaConsumer\n",
+    "# Use kafka_host variable when connecting to kafka \n",
+    "if 'KAFKA_HOST' not in os.environ.keys():\n",
+    "   kafka_host=f\"http://localhost:9092\"\n";,
+    "else:\n",
+    "    kafka_host=f\"{os.environ['KAFKA_HOST']}:9092\"\n",
     "\n",
-    "# Kafka runs on kafka:9092 in multi-container tutorial application\n",
-    "producer = KafkaProducer(bootstrap_servers='kafka:9092')\n",
+    "# this is the kafka topic we will be working with:\n",
     "topic_name = \"social_media\""
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Create the `social_media` topic and send a sample event. The `send()` 
command returns a metadata descriptor for the record."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "event = {\n",
-    "    \"__time\": \"2023-01-03T16:40:21.501\",\n",
-    "    \"username\": \"willow\",\n",
-    "    \"post_title\": \"This title is required\",\n",
-    "    \"views\": 15284,\n",
-    "    \"upvotes\": 124,\n",
-    "    \"comments\": 21,\n",
-    "    \"edited\": \"True\"\n",
-    "}\n",
     "\n",
-    "producer.send(topic_name, json.dumps(event).encode('utf-8'))"
+    "import json\n",
+    "\n",
+    "# shortcuts for display and sql api's\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "\n",
+    "\n",
+    "# client for Data Generator API\n",
+    "datagen = druidapi.rest.DruidRestClient(\"http://datagen:9999\";)\n",
+    "\n",
+    "# client for Druid API\n",
+    "rest_client = druid.rest\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To verify that the Kafka topic stored the event, create a consumer client 
to read records from the Kafka cluster, and get the next (only) message:"
+    "## Publish generated data directly to Kafka topic"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "consumer = KafkaConsumer(topic_name, bootstrap_servers=['kafka:9092'], 
auto_offset_reset='earliest',\n",
-    "     enable_auto_commit=True)\n",
-    "\n",
-    "print(next(consumer).value.decode('utf-8'))"
+    "We are using the datagen that is part of this cluster to generate a 
stream of messages. The data generator will create the topic if it does not 
exist. This data generator configuration output messages to topic 
`social_media`."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Load data into Kafka topic"
+    "### Generate data"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Instead of manually creating events to send to the Kafka topic, use a 
data generator to simulate a continuous data stream. This tutorial makes use of 
Druid Data Driver to simulate a continuous data stream into the `social_media` 
Kafka topic. To learn more about the Druid Data Driver, see the Druid Summit 
talk, [Generating Time centric Data for Apache 
Druid](https://www.youtube.com/watch?v=3zAOeLe3iAo).\n",
-    "\n",
-    "In this notebook, you use a background process to continuously load data 
into the Kafka topic.\n",
-    "This allows you to keep executing commands in this notebook while data is 
constantly being streamed into the topic."
+    "Instead of manually creating events to send to the Kafka topic, use a 
data generator to simulate a continuous data stream. This tutorial makes use of 
Druid Data Generator to simulate a continuous data stream into the 
`social_media` Kafka topic. To learn more about the Druid Data Generator, see 
the [project](https://github.com/implydata/druid-datagenerator) and/or the 
[data generation notebook](../01-introduction/02-datagen-intro.ipynb)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Run the following cells to load sample data into the `social_media` Kafka 
topic:"
+    "Run the following cells to load sample data into the `social_media` Kafka 
topic. It is configured to generate events until it reaches 50,000 messages:"

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adding data generation pod to jupyter notebooks deployment (druid)

Reply via email to