rajucomp commented on PR #1766:
URL: https://github.com/apache/stormcrawler/pull/1766#issuecomment-3734655502
Hi @rzo1. Followed the steps. Here is the output
## 1. Infrastructure Status
All Docker containers are running and healthy:
```
NAMES STATUS PORTS
supervisor Up 20 minutes
ui Up 20 minutes 127.0.0.1:8080->8080/tcp
nimbus Up 20 minutes
zookeeper Up 20 minutes 2181/tcp, 2888/tcp, 3888/tcp,
8080/tcp
mysql Up 20 minutes (healthy) 127.0.0.1:3306->3306/tcp
urlfrontier Up 20 minutes 127.0.0.1:7071->7071/tcp
```
**Key Points:**
- MySQL container shows `(healthy)` status indicating successful health
checks
- All Storm components (nimbus, supervisor, ui) are operational
- URLFrontier service is accessible on port 7071
---
## 2. Storm Worker Logs - URL Fetching & SQL Persistence
### 2.1 URLFrontier Spout Initialization
```log
2026-01-11 14:11:37.397 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9,
9] [INFO] Opening spout spout:[9]
2026-01-11 14:11:37.425 o.a.s.u.ManagedChannelUtil
Thread-23-spout-executor[9, 9] [INFO] Initialisation of connection to
URLFrontier service on urlfrontier:7071
2026-01-11 14:11:37.540 o.a.s.u.Spout Thread-23-spout-executor[9, 9] [INFO]
Initialized URLFrontier Spout without crawlId
2026-01-11 14:11:37.541 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9,
9] [INFO] Opened spout spout:[9]
2026-01-11 14:11:37.541 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9,
9] [INFO] Activating spout spout:[9]
```
**Evidence:** URLFrontier Spout successfully connected to the URLFrontier
service.
### 2.2 URL Fetching (FetcherBolt)
```log
# Initial seed URL fetch - HTTP 308 redirect
2026-01-11 14:11:38.736 o.a.s.b.FetcherBolt FetcherThread #31 [INFO]
[Fetcher #4] Fetched https://kodis.iao.fraunhofer.de with status 308 in msec 69
# Successful page fetch - HTTP 200
2026-01-11 14:17:07.143 o.a.s.b.FetcherBolt FetcherThread #44 [INFO]
[Fetcher #4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in
msec 315
# Subsequent fetches showing continuous crawling
2026-01-11 14:22:08.262 o.a.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher
#4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in msec 282
2026-01-11 14:27:09.316 o.a.s.b.FetcherBolt FetcherThread #19 [INFO]
[Fetcher #4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in
msec 290
```
**Evidence:** FetcherBolt successfully fetching URLs with proper HTTP status
codes.
### 2.3 HTML Parsing (JSoupParserBolt)
```log
2026-01-11 14:17:07.148 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6,
6] [INFO] Parsing : starting https://www.kodis.iao.fraunhofer.de/
2026-01-11 14:17:07.244 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6,
6] [INFO] Parsed https://www.kodis.iao.fraunhofer.de/ in 50 msec
2026-01-11 14:17:07.252 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6,
6] [INFO] Total for https://www.kodis.iao.fraunhofer.de/ - 58 msec
```
**Evidence:** JSoupParserBolt successfully parsing HTML content.
### 2.4 Content Indexing (StdOutIndexer)
```log
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] content 561 chars
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] url https://www.kodis.iao.fraunhofer.de/
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] keywords IAO Kodis
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] domain fraunhofer.de
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] format html
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] description 137 chars
2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5]
[INFO] title Fraunhofer IAO - KODIS: Forschungs- und Innovationszentrum
Kognitive Dienstleistungssysteme
```
**Evidence:** Content successfully extracted and indexed with metadata.
### 2.5 SQL StatusUpdaterBolt - Batch Persistence
```log
# Single URL insert (redirect status)
2026-01-11 14:11:41.498 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO]
About to execute batch - triggered by time. Due 1768140700807, now
1768140701498
2026-01-11 14:11:41.504 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO]
Batched 1 inserts executed in 5 msec
# Batch insert of discovered URLs (27 URLs from parsed page)
2026-01-11 14:17:09.501 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO]
About to execute batch - triggered by time. Due 1768141029256, now
1768141029501
2026-01-11 14:17:09.520 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO]
Batched 27 inserts executed in 18 msec
```
**Evidence:** SQL StatusUpdaterBolt successfully persisting URL statuses to
MySQL in batches.
---
## 3. MySQL Database - Direct Verification
### 3.1 Database Connection & Schema Verification
```
sh-5.1# mysql -u root -prootpassword
mysql: [Warning] Using a password on the command line interface can be
insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 134
Server version: 8.0.44 MySQL Community Server - GPL
mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| crawl |
| information_schema |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)
mysql> USE crawl;
Database changed
mysql> SHOW TABLES;
+-----------------+
| Tables_in_crawl |
+-----------------+
| content |
| metrics |
| urls |
+-----------------+
3 rows in set (0.01 sec)
```
**Evidence:** Database `crawl` exists with all required tables (`urls`,
`metrics`, `content`).
### 3.2 URL Status Distribution
```sql
mysql> SELECT status, COUNT(*) as count FROM urls GROUP BY status;
+-------------+-------+
| status | count |
+-------------+-------+
| DISCOVERED | 29 |
| REDIRECTION | 1 |
+-------------+-------+
2 rows in set (0.00 sec)
mysql> SELECT COUNT(*) as total_urls FROM urls;
+------------+
| total_urls |
+------------+
| 30 |
+------------+
1 row in set (0.00 sec)
```
**Evidence:** 30 URLs tracked in database - 29 discovered, 1 redirection.
### 3.3 Complete URLs Table Data
```sql
mysql> SELECT * FROM urls;
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
| url
| status |
nextfetchdate | metadata |
bucket | host |
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
| https://kodis.iao.fraunhofer.de
| REDIRECTION |
2026-01-12 14:26:41 | _redirTo=https://www.kodis.iao.fraunhofer.de/ |
0 | kodis.iao.fraunhofer.de |
|
https://publica.fraunhofer.de/entities/publication/253205a3-ed71-4926-b26d-b07bca516800/details
| DISCOVERED | 2026-01-11 14:17:07 |
url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 | 0 |
publica.fraunhofer.de |
| https://www.iao.fraunhofer.de/
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/
| DISCOVERED |
2026-01-11 14:11:39 | url.path=https://kodis.iao.fraunhofer.de depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/aktuelles.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/leistungen.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/projekte.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/publikationen.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/ueber-uns.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/de/veranstaltungen.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
| https://www.kodis.iao.fraunhofer.de/en.html
| DISCOVERED |
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/ depth=1 |
0 | www.kodis.iao.fraunhofer.de |
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
(30 rows total - truncated for brevity)
```
**Key Observations:**
- `status` column correctly stores URL states (DISCOVERED, REDIRECTION)
- `nextfetchdate` shows scheduled refetch times
- `metadata` preserves crawl path and depth information
- `host` column enables per-host politeness scheduling
---
## 4. SQL MetricsConsumer - Metrics Table
```sql
mysql> SELECT * FROM metrics ORDER BY timestamp DESC LIMIT 15;
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
| id | srcComponentId | srcTaskId | srcWorkerHost| srcWorkerPort | name
| value | timestamp |
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
| 2189 | fetcher | 4 | cbead6cecad0 | 6700 |
in_queues | 0 | 2026-01-11 14:27:38 |
| 2188 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_average_persec.fetched_perSec | 0 | 2026-01-11 14:27:38 |
| 2187 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_average_persec.bytes_fetched_perSec | 0 | 2026-01-11 14:27:38 |
| 2186 | fetcher | 4 | cbead6cecad0 | 6700 |
activethreads | 0 | 2026-01-11 14:27:38 |
| 2185 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.fetched | 0 | 2026-01-11 14:27:38 |
| 2184 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.robots.fromCache | 0 | 2026-01-11 14:27:38 |
| 2183 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.bytes_fetched | 0 | 2026-01-11 14:27:38 |
| 2182 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.robots.fetched | 0 | 2026-01-11 14:27:38 |
| 2181 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.status_200 | 0 | 2026-01-11 14:27:38 |
| 2180 | fetcher | 4 | cbead6cecad0 | 6700 |
fetcher_counter.status_308 | 0 | 2026-01-11 14:27:38 |
| 2179 | fetcher | 4 | cbead6cecad0 | 6700 |
num_queues | 0 | 2026-01-11 14:27:38 |
| 2178 | spout | 9 | cbead6cecad0 | 6700 |
inPurgatory | 2 | 2026-01-11 14:27:38 |
| 2177 | spout | 9 | cbead6cecad0 | 6700 |
numQueues | 0 | 2026-01-11 14:27:38 |
| 2176 | spout | 9 | cbead6cecad0 | 6700 |
beingProcessed | 0 | 2026-01-11 14:27:38 |
| 2175 | spout | 9 | cbead6cecad0 | 6700 |
buffer_size | 0 | 2026-01-11 14:27:38 |
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
```
**Evidence:** SQL MetricsConsumer successfully writing crawler metrics to
MySQL including:
- Fetcher statistics (fetched count, bytes, HTTP status codes)
- Spout metrics (queue sizes, purgatory count)
- Worker identification (host, port, component)
---
## 5. URLFrontier Statistics
```
$ java -cp target/stormcrawler-sql-1.0-SNAPSHOT.jar \
crawlercommons.urlfrontier.client.Client -t localhost -p 7071 GetStats
Number of queues: 2
Active URLs: 2
In process: 2
active_queues = 2
completed = 0
$ java -cp target/stormcrawler-sql-1.0-SNAPSHOT.jar \
crawlercommons.urlfrontier.client.Client -t localhost -p 7071 ListQueues
kodis.iao.fraunhofer.de
www.kodis.iao.fraunhofer.de
```
**Evidence:** URLFrontier managing 2 queues (one per host) with active URL
processing.
---
## 6. Summary
| Component | Status | Evidence |
|-----------|--------|----------|
| **Storm Topology** | ✅ Running | `sql-crawler-1-1768140688` deployed and
active |
| **URLFrontier Spout** | ✅ Working | Connected to `urlfrontier:7071`, 2
queues active |
| **FetcherBolt** | ✅ Working | Fetched URLs with HTTP 200/308 responses |
| **JSoupParserBolt** | ✅ Working | Parsed pages, extracted content in ~50ms
|
| **SQL StatusUpdaterBolt** | ✅ Working | Batch inserts (1-27 URLs) in
2-18ms |
| **SQL MetricsConsumer** | ✅ Working | 2189+ metrics records in `metrics`
table |
| **MySQL Persistence** | ✅ Working | 30 URLs tracked with status in `urls`
table |
| **Docker Infrastructure** | ✅ Healthy | All 6 containers running, MySQL
health check passing |
---
## 7. Configuration Reference
### SQL Connection Settings (from crawler-conf.yaml)
```yaml
sql.connection:
url:
"jdbc:mysql://mysql:3306/crawl?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC"
user: "crawler" password: "crawler" rewriteBatchedStatements: true
useBatchMultiSend: true
sql.status.table: "urls"
sql.metrics.table: "metrics"
```
### Key Topology Components (from crawler.flux)
- **Spout:** `crawlercommons.urlfrontier.stormcrawler.Spout` (URLFrontier
integration)
- **Status Bolt:** `org.apache.stormcrawler.sql.StatusUpdaterBolt` (SQL
persistence)
- **Metrics Consumer:**
`org.apache.stormcrawler.sql.metrics.MetricsConsumer`
---
**Conclusion:** The StormCrawler SQL module integration is fully functional.
URLs are being fetched, parsed, and their statuses are correctly persisted to
MySQL. The MetricsConsumer is recording crawler performance data, and the
URLFrontier is managing URL scheduling across multiple host queues.
There were some changes required in the config file. So, I create a fork of
your repo and update the configs at
https://github.com/rajucomp/sc-sql--warc-demo. This will help in future testing
efforts.
Also attached a screenshot of the storm UI running. Not sure if further
verifications are needed.
<img width="1488" height="1023" alt="Screenshot 2026-01-11 at 14 25 17"
src="https://github.com/user-attachments/assets/7191ed45-072b-47f1-bb3c-46a006620453"
/>
<img width="1481" height="1015" alt="Screenshot 2026-01-11 at 14 25 33"
src="https://github.com/user-attachments/assets/fb8239d1-9021-49f1-bcfb-20279e059b09"
/>
Let me know your thoughts. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]