rajucomp commented on PR #1766:
URL: https://github.com/apache/stormcrawler/pull/1766#issuecomment-3734655502

   Hi @rzo1. Followed the steps. Here is the output
   
   
    ## 1. Infrastructure Status  
     
   All Docker containers are running and healthy:  
     
   ```  
   NAMES         STATUS                    PORTS  
   supervisor    Up 20 minutes  
   ui            Up 20 minutes             127.0.0.1:8080->8080/tcp  
   nimbus        Up 20 minutes  
   zookeeper     Up 20 minutes             2181/tcp, 2888/tcp, 3888/tcp, 
8080/tcp  
   mysql         Up 20 minutes (healthy)   127.0.0.1:3306->3306/tcp  
   urlfrontier   Up 20 minutes             127.0.0.1:7071->7071/tcp  
   ```  
     
   **Key Points:**  
   - MySQL container shows `(healthy)` status indicating successful health 
checks  
   - All Storm components (nimbus, supervisor, ui) are operational  
   - URLFrontier service is accessible on port 7071  
     
   ---  
     
   ## 2. Storm Worker Logs - URL Fetching & SQL Persistence  
     
   ### 2.1 URLFrontier Spout Initialization  
     
   ```log  
   2026-01-11 14:11:37.397 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9, 
9] [INFO] Opening spout spout:[9]  
   2026-01-11 14:11:37.425 o.a.s.u.ManagedChannelUtil 
Thread-23-spout-executor[9, 9] [INFO] Initialisation of connection to 
URLFrontier service on urlfrontier:7071  
   2026-01-11 14:11:37.540 o.a.s.u.Spout Thread-23-spout-executor[9, 9] [INFO] 
Initialized URLFrontier Spout without crawlId  
   2026-01-11 14:11:37.541 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9, 
9] [INFO] Opened spout spout:[9]  
   2026-01-11 14:11:37.541 o.a.s.e.s.SpoutExecutor Thread-23-spout-executor[9, 
9] [INFO] Activating spout spout:[9]  
   ```  
     
   **Evidence:** URLFrontier Spout successfully connected to the URLFrontier 
service.  
     
   ### 2.2 URL Fetching (FetcherBolt)  
     
   ```log  
   # Initial seed URL fetch - HTTP 308 redirect  
   2026-01-11 14:11:38.736 o.a.s.b.FetcherBolt FetcherThread #31 [INFO] 
[Fetcher #4] Fetched https://kodis.iao.fraunhofer.de with status 308 in msec 69 
 
     
   # Successful page fetch - HTTP 200  
   2026-01-11 14:17:07.143 o.a.s.b.FetcherBolt FetcherThread #44 [INFO] 
[Fetcher #4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in 
msec 315  
     
   # Subsequent fetches showing continuous crawling  
   2026-01-11 14:22:08.262 o.a.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher 
#4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in msec 282  
   2026-01-11 14:27:09.316 o.a.s.b.FetcherBolt FetcherThread #19 [INFO] 
[Fetcher #4] Fetched https://www.kodis.iao.fraunhofer.de/ with status 200 in 
msec 290  
   ```  
     
   **Evidence:** FetcherBolt successfully fetching URLs with proper HTTP status 
codes.  
     
   ### 2.3 HTML Parsing (JSoupParserBolt)  
     
   ```log  
   2026-01-11 14:17:07.148 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6, 
6] [INFO] Parsing : starting https://www.kodis.iao.fraunhofer.de/  
   2026-01-11 14:17:07.244 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6, 
6] [INFO] Parsed https://www.kodis.iao.fraunhofer.de/ in 50 msec  
   2026-01-11 14:17:07.252 o.a.s.b.JSoupParserBolt Thread-15-parse-executor[6, 
6] [INFO] Total for https://www.kodis.iao.fraunhofer.de/ - 58 msec  
   ```  
     
   **Evidence:** JSoupParserBolt successfully parsing HTML content.  
     
   ### 2.4 Content Indexing (StdOutIndexer)  
     
   ```log  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] content     561 chars  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] url         https://www.kodis.iao.fraunhofer.de/  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] keywords    IAO Kodis  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] domain      fraunhofer.de  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] format      html  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] description 137 chars  
   2026-01-11 14:17:07.253 o.a.s.i.StdOutIndexer Thread-20-index-executor[5, 5] 
[INFO] title       Fraunhofer IAO - KODIS: Forschungs- und Innovationszentrum 
Kognitive Dienstleistungssysteme  
   ```  
     
   **Evidence:** Content successfully extracted and indexed with metadata.  
     
   ### 2.5 SQL StatusUpdaterBolt - Batch Persistence  
     
   ```log  
   # Single URL insert (redirect status)  
   2026-01-11 14:11:41.498 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO] 
About to execute batch - triggered by time. Due 1768140700807, now 
1768140701498  
   2026-01-11 14:11:41.504 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO] 
Batched 1 inserts executed in 5 msec  
     
   # Batch insert of discovered URLs (27 URLs from parsed page)  
   2026-01-11 14:17:09.501 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO] 
About to execute batch - triggered by time. Due 1768141029256, now 
1768141029501  
   2026-01-11 14:17:09.520 o.a.s.s.StatusUpdaterBolt pool-9-thread-1 [INFO] 
Batched 27 inserts executed in 18 msec  
   ```  
     
   **Evidence:** SQL StatusUpdaterBolt successfully persisting URL statuses to 
MySQL in batches.  
     
   ---  
     
   ## 3. MySQL Database - Direct Verification  
     
   ### 3.1 Database Connection & Schema Verification  
     
   ```  
   sh-5.1# mysql -u root -prootpassword  
   mysql: [Warning] Using a password on the command line interface can be 
insecure.  
   Welcome to the MySQL monitor.  Commands end with ; or \g.  
   Your MySQL connection id is 134  
   Server version: 8.0.44 MySQL Community Server - GPL  
     
   mysql> SHOW DATABASES;  
   +--------------------+  
   | Database           |  
   +--------------------+  
   | crawl              |  
   | information_schema |  
   | mysql              |  
   | performance_schema |  
   | sys                |  
   +--------------------+  
   5 rows in set (0.00 sec)  
     
   mysql> USE crawl;  
   Database changed  
     
   mysql> SHOW TABLES;  
   +-----------------+  
   | Tables_in_crawl |  
   +-----------------+  
   | content         |  
   | metrics         |  
   | urls            |  
   +-----------------+  
   3 rows in set (0.01 sec)  
   ```  
     
   **Evidence:** Database `crawl` exists with all required tables (`urls`, 
`metrics`, `content`).  
     
   ### 3.2 URL Status Distribution  
     
   ```sql  
   mysql> SELECT status, COUNT(*) as count FROM urls GROUP BY status;  
   +-------------+-------+  
   | status      | count |  
   +-------------+-------+  
   | DISCOVERED  |    29 |  
   | REDIRECTION |     1 |  
   +-------------+-------+  
   2 rows in set (0.00 sec)  
     
   mysql> SELECT COUNT(*) as total_urls FROM urls;  
   +------------+  
   | total_urls |  
   +------------+  
   |         30 |  
   +------------+  
   1 row in set (0.00 sec)  
   ```  
     
   **Evidence:** 30 URLs tracked in database - 29 discovered, 1 redirection.  
     
   ### 3.3 Complete URLs Table Data  
     
   ```sql  
   mysql> SELECT * FROM urls;  
   
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
  
   | url                                                                        
                                                      | status      | 
nextfetchdate       | metadata                                               | 
bucket | host                        |  
   
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
  
   | https://kodis.iao.fraunhofer.de                                            
                                                      | REDIRECTION | 
2026-01-12 14:26:41 | _redirTo=https://www.kodis.iao.fraunhofer.de/          |  
    0 | kodis.iao.fraunhofer.de     |  
   | 
https://publica.fraunhofer.de/entities/publication/253205a3-ed71-4926-b26d-b07bca516800/details
                                  | DISCOVERED  | 2026-01-11 14:17:07 | 
url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |      0 | 
publica.fraunhofer.de       |  
   | https://www.iao.fraunhofer.de/                                             
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.iao.fraunhofer.de       |  
   | https://www.kodis.iao.fraunhofer.de/                                       
                                                      | DISCOVERED  | 
2026-01-11 14:11:39 | url.path=https://kodis.iao.fraunhofer.de       depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/aktuelles.html                      
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/leistungen.html                     
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/projekte.html                       
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/publikationen.html                  
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/ueber-uns.html                      
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/de/veranstaltungen.html                
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   | https://www.kodis.iao.fraunhofer.de/en.html                                
                                                      | DISCOVERED  | 
2026-01-11 14:17:07 | url.path=https://www.kodis.iao.fraunhofer.de/  depth=1 |  
    0 | www.kodis.iao.fraunhofer.de |  
   
+----------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------------------------------------------+--------+-----------------------------+
  
   (30 rows total - truncated for brevity)  
   ```  
     
   **Key Observations:**  
   - `status` column correctly stores URL states (DISCOVERED, REDIRECTION)  
   - `nextfetchdate` shows scheduled refetch times  
   - `metadata` preserves crawl path and depth information  
   - `host` column enables per-host politeness scheduling  
     
     
   ---  
     
   ## 4. SQL MetricsConsumer - Metrics Table  
     
   ```sql  
   mysql> SELECT * FROM metrics ORDER BY timestamp DESC LIMIT 15;  
   
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
  
   | id   | srcComponentId | srcTaskId | srcWorkerHost| srcWorkerPort  | name   
                                       | value | timestamp           |  
   
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
  
   | 2189 | fetcher        |         4 | cbead6cecad0 |           6700 | 
in_queues                                     |     0 | 2026-01-11 14:27:38 |  
   | 2188 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_average_persec.fetched_perSec         |     0 | 2026-01-11 14:27:38 |  
   | 2187 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_average_persec.bytes_fetched_perSec   |     0 | 2026-01-11 14:27:38 |  
   | 2186 | fetcher        |         4 | cbead6cecad0 |           6700 | 
activethreads                                 |     0 | 2026-01-11 14:27:38 |  
   | 2185 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.fetched                       |     0 | 2026-01-11 14:27:38 |  
   | 2184 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.robots.fromCache              |     0 | 2026-01-11 14:27:38 |  
   | 2183 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.bytes_fetched                 |     0 | 2026-01-11 14:27:38 |  
   | 2182 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.robots.fetched                |     0 | 2026-01-11 14:27:38 |  
   | 2181 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.status_200                    |     0 | 2026-01-11 14:27:38 |  
   | 2180 | fetcher        |         4 | cbead6cecad0 |           6700 | 
fetcher_counter.status_308                    |     0 | 2026-01-11 14:27:38 |  
   | 2179 | fetcher        |         4 | cbead6cecad0 |           6700 | 
num_queues                                    |     0 | 2026-01-11 14:27:38 |  
   | 2178 | spout          |         9 | cbead6cecad0 |           6700 | 
inPurgatory                                   |     2 | 2026-01-11 14:27:38 |  
   | 2177 | spout          |         9 | cbead6cecad0 |           6700 | 
numQueues                                     |     0 | 2026-01-11 14:27:38 |  
   | 2176 | spout          |         9 | cbead6cecad0 |           6700 | 
beingProcessed                                |     0 | 2026-01-11 14:27:38 |  
   | 2175 | spout          |         9 | cbead6cecad0 |           6700 | 
buffer_size                                   |     0 | 2026-01-11 14:27:38 |  
   
+------+----------------+-----------+--------------+----------------+-----------------------------------------------+-------+---------------------+
  
   ```  
     
   **Evidence:** SQL MetricsConsumer successfully writing crawler metrics to 
MySQL including:  
   - Fetcher statistics (fetched count, bytes, HTTP status codes)  
   - Spout metrics (queue sizes, purgatory count)  
   - Worker identification (host, port, component)  
     
   ---  
     
   ## 5. URLFrontier Statistics  
     
   ```  
   $ java -cp target/stormcrawler-sql-1.0-SNAPSHOT.jar \  
    crawlercommons.urlfrontier.client.Client -t localhost -p 7071 GetStats  
   Number of queues: 2  
   Active URLs: 2  
   In process: 2  
   active_queues = 2  
   completed = 0  
     
   $ java -cp target/stormcrawler-sql-1.0-SNAPSHOT.jar \  
    crawlercommons.urlfrontier.client.Client -t localhost -p 7071 ListQueues  
   kodis.iao.fraunhofer.de  
   www.kodis.iao.fraunhofer.de  
   ```  
     
   **Evidence:** URLFrontier managing 2 queues (one per host) with active URL 
processing.  
     
   ---  
     
   ## 6. Summary  
     
   | Component | Status | Evidence |  
   |-----------|--------|----------|  
   | **Storm Topology** | ✅ Running | `sql-crawler-1-1768140688` deployed and 
active |  
   | **URLFrontier Spout** | ✅ Working | Connected to `urlfrontier:7071`, 2 
queues active |  
   | **FetcherBolt** | ✅ Working | Fetched URLs with HTTP 200/308 responses |  
   | **JSoupParserBolt** | ✅ Working | Parsed pages, extracted content in ~50ms 
|  
   | **SQL StatusUpdaterBolt** | ✅ Working | Batch inserts (1-27 URLs) in 
2-18ms |  
   | **SQL MetricsConsumer** | ✅ Working | 2189+ metrics records in `metrics` 
table |  
   | **MySQL Persistence** | ✅ Working | 30 URLs tracked with status in `urls` 
table |  
   | **Docker Infrastructure** | ✅ Healthy | All 6 containers running, MySQL 
health check passing |  
     
   ---  
     
   ## 7. Configuration Reference  
     
   ### SQL Connection Settings (from crawler-conf.yaml)  
     
   ```yaml  
   sql.connection:  
    url: 
"jdbc:mysql://mysql:3306/crawl?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC"
 user: "crawler" password: "crawler" rewriteBatchedStatements: true 
useBatchMultiSend: true  
   sql.status.table: "urls"  
   sql.metrics.table: "metrics" 
   ```  
     
   ### Key Topology Components (from crawler.flux)  
     
   - **Spout:** `crawlercommons.urlfrontier.stormcrawler.Spout` (URLFrontier 
integration)  
   - **Status Bolt:** `org.apache.stormcrawler.sql.StatusUpdaterBolt` (SQL 
persistence)  
   - **Metrics Consumer:** 
`org.apache.stormcrawler.sql.metrics.MetricsConsumer`  
     
   ---  
     
   **Conclusion:** The StormCrawler SQL module integration is fully functional. 
URLs are being fetched, parsed, and their statuses are correctly persisted to 
MySQL. The MetricsConsumer is recording crawler performance data, and the 
URLFrontier is managing URL scheduling across multiple host queues.
   
   
   
   There were some changes required in the config file. So, I create a fork of 
your repo and update the configs at 
https://github.com/rajucomp/sc-sql--warc-demo. This will help in future testing 
efforts. 
   
   Also attached a screenshot of the storm UI running. Not sure if further 
verifications are needed.
   
   <img width="1488" height="1023" alt="Screenshot 2026-01-11 at 14 25 17" 
src="https://github.com/user-attachments/assets/7191ed45-072b-47f1-bb3c-46a006620453";
 />
   <img width="1481" height="1015" alt="Screenshot 2026-01-11 at 14 25 33" 
src="https://github.com/user-attachments/assets/fb8239d1-9021-49f1-bcfb-20279e059b09";
 />
   
   Let me know your thoughts. Thanks!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to