(stormcrawler) branch main updated: #1542 - Migrate Documentation from Wiki to Living Documentation in Code (#1714)

rzo1 Tue, 23 Dec 2025 01:10:28 -0800

This is an automated email from the ASF dual-hosted git repository.

rzo1 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/stormcrawler.git



The following commit(s) were added to refs/heads/main by this push:
     new 2a6930af #1542 - Migrate Documentation from Wiki to Living 
Documentation in Code (#1714)
2a6930af is described below

commit 2a6930af21c4c8265d8c1c923598bd77b64dbe26
Author: Richard Zowalla <[email protected]>
AuthorDate: Tue Dec 23 10:10:15 2025 +0100

    #1542 - Migrate Documentation from Wiki to Living Documentation in Code 
(#1714)
    
    * WIP: Started to work on a first documentation of SC
    
    * WIP: Started to work on a first documentation of SC
    
    * WIP: Converting wiki to code / adoc documentation.
    
    * WIP: Converting wiki to code / adoc documentation.
    
    * WIP: Disable WIKI as content is now in adoc
    
    * Fix parent version
    
    * WIP: Add option table
    
    * WIP: Minor Update and migration from StormCrawler wiki
    
    * WIP: Fix version
    
    * add imagss. minor changes
    
    * fix output
    
    * add generated docs to maven build as tar.gz so it can be used for website
    
    * Fix internal links, move from http to https
    
    * Fix line numbers in links
    
    * Change error direction
    
    * Fix icon configuration for highlighting
    
    * Fix archetype command with old version
---
 .asf.yaml                                          |   3 +-
 DISCLAIMER-BINARIES.txt                            |   9 +-
 docs/pom.xml                                       |  83 ++++
 docs/src/assembly/docs.xml                         |  38 ++
 docs/src/main/asciidoc/architecture.adoc           |  97 +++++
 docs/src/main/asciidoc/configuration.adoc          | 316 +++++++++++++++
 docs/src/main/asciidoc/debugging.adoc              |  21 +
 docs/src/main/asciidoc/images/stormcrawler.drawio  | 238 +++++++++++
 .../main/asciidoc/images/stormcrawler.drawio.jpg   | Bin 0 -> 172493 bytes
 .../main/asciidoc/images/stormcrawler.drawio.pdf   | Bin 0 -> 50189 bytes
 docs/src/main/asciidoc/index.adoc                  |  33 ++
 docs/src/main/asciidoc/internals.adoc              | 443 +++++++++++++++++++++
 docs/src/main/asciidoc/overview.adoc               |  53 +++
 docs/src/main/asciidoc/powered-by.adoc             |  40 ++
 docs/src/main/asciidoc/presentations.adoc          |  28 ++
 docs/src/main/asciidoc/quick-start.adoc            | 211 ++++++++++
 pom.xml                                            |   6 +-
 17 files changed, 1614 insertions(+), 5 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index be2f3d0e..df446845 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -33,8 +33,7 @@ github:
   dependabot_updates: false
 
   features:
-    # Enable wiki for documentation
-    wiki: true
+    wiki: false
     # Enable issue management
     issues: true
     # Enable projects for project management boards
diff --git a/DISCLAIMER-BINARIES.txt b/DISCLAIMER-BINARIES.txt
index dc42c01a..0259d338 100644
--- a/DISCLAIMER-BINARIES.txt
+++ b/DISCLAIMER-BINARIES.txt
@@ -2,8 +2,15 @@
 
 The following binaries are included in this project for testing purposes only:
 
-`./core/src/test/resources/tripadvisor.sitemap.xml.gz`: A compressed sitemap 
for testing sitemap parsing.
+- `./core/src/test/resources/tripadvisor.sitemap.xml.gz`: A compressed sitemap 
for testing sitemap parsing.
 - `./external/warc/src/test/resources/test.warc.gz`: A WARC file for testing 
WARC functionality.
 - `./external/warc/src/test/resources/unparsable-date.warc.gz`: A WARC file 
with an unparseable date for testing WARC functionality.
 
 These files are used to validate the functionality and reliability of Apache 
Stormcrawler. They are not intended for production use or distribution beyond 
the scope of testing within this project.
+
+## Disclaimer for Documentation Binaries
+
+The following binaries are included in this project for documentation purposes 
only:
+
+- `docs/src/main/asciidoc/images/stormcrawler.jpg`
+- `docs/src/main/asciidoc/images/stormcrawler.pdf`
\ No newline at end of file
diff --git a/docs/pom.xml b/docs/pom.xml
new file mode 100644
index 00000000..bc8362e8
--- /dev/null
+++ b/docs/pom.xml
@@ -0,0 +1,83 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0";
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+    <modelVersion>4.0.0</modelVersion>
+    <parent>
+        <groupId>org.apache.stormcrawler</groupId>
+        <artifactId>stormcrawler</artifactId>
+        <version>3.5.1-SNAPSHOT</version>
+    </parent>
+
+    <artifactId>stormcrawler-docs</artifactId>
+
+    <build>
+        <plugins>
+            <plugin>
+                <groupId>org.asciidoctor</groupId>
+                <artifactId>asciidoctor-maven-plugin</artifactId>
+                <version>3.2.0</version>
+                <executions>
+                    <execution>
+                        <id>output-html</id>
+                        <phase>generate-resources</phase>
+                        <goals>
+                            <goal>process-asciidoc</goal>
+                        </goals>
+                        <configuration>
+                            <doctype>article</doctype>
+                            <attributes>
+                                
<source-highlighter>coderay</source-highlighter>
+                                <toc />
+                                <linkcss>false</linkcss>
+                                <icons>font</icons>
+                            </attributes>
+                        </configuration>
+                    </execution>
+                </executions>
+                <configuration>
+                    <sourceDirectory>src/main/asciidoc</sourceDirectory>
+                </configuration>
+            </plugin>
+
+            <!-- Maven Assembly plugin to create tar.gz -->
+            <plugin>
+                <artifactId>maven-assembly-plugin</artifactId>
+                <executions>
+                    <execution>
+                        <id>make-docs-archive</id>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>single</goal>
+                        </goals>
+                        <configuration>
+                            <descriptors>
+                                <descriptor>src/assembly/docs.xml</descriptor>
+                            </descriptors>
+                            
<finalName>${project.artifactId}-${project.version}</finalName>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
+    </build>
+
+</project>
\ No newline at end of file
diff --git a/docs/src/assembly/docs.xml b/docs/src/assembly/docs.xml
new file mode 100644
index 00000000..c50761a6
--- /dev/null
+++ b/docs/src/assembly/docs.xml
@@ -0,0 +1,38 @@
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<assembly 
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3";
+          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+          
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3
+                              https://maven.apache.org/xsd/assembly-1.1.3.xsd";>
+    <id>docs</id>
+    <formats>
+        <format>tar.gz</format>
+    </formats>
+    <includeBaseDirectory>false</includeBaseDirectory>
+    <fileSets>
+        <fileSet>
+            <directory>${project.build.directory}/generated-docs</directory>
+            <outputDirectory>/</outputDirectory>
+            <includes>
+                <include>**/*</include>
+            </includes>
+        </fileSet>
+    </fileSets>
+</assembly>
diff --git a/docs/src/main/asciidoc/architecture.adoc 
b/docs/src/main/asciidoc/architecture.adoc
new file mode 100644
index 00000000..b4ba45b0
--- /dev/null
+++ b/docs/src/main/asciidoc/architecture.adoc
@@ -0,0 +1,97 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+
+== Understanding StormCrawler's Architecture
+
+=== Architecture Overview
+
+Apache StormCrawler is built as a distributed, stream-oriented web crawling 
system
+on top of Apache Storm. Its architecture emphasizes clear separation between
+*crawl control* and *content processing*, with the URL frontier acting as the
+central coordination point.
+
+.Architecture overview of StormCrawler
+image::stormcrawler.drawio.jpg[StormCrawler Architecture, width=100%]
+
+Figure 1 illustrates StormCrawler’s stream-processing crawl pipeline, built on 
Apache Storm.
+The architecture is intentionally modular and centers around two core 
abstractions:
+
+- The URL frontier: decides *what* to crawl and *when*
+- The parsing and indexing pipeline: decides *what* to *extract*, *keep*, and 
*store*
+
+Black arrows show the *main data flow*, gray arrows represent URLs *taken from 
the frontier*, and purple arrows indicate *URL status updates* fed back to the 
frontier.
+
+==== Crawl Flow and Core Components
+
+The crawl begins with the *Frontier*, which is responsible for scheduling,
+prioritization, politeness, and retry logic. URLs are emitted by a
+`FrontierSpout` and partitioned by the `URLPartitioner`, typically using the
+host as a key to enforce politeness constraints.
+
+The `Fetcher` retrieves web resources and emits both the fetched content and
+associated metadata such as HTTP status codes, headers, and MIME types. Based
+on the content type, documents are routed to specialized parsers, including
+`SiteMapParser`, `JSoupParser` for HTML content, and `TikaParser` for binary
+formats via Apache Tika.
+
+Parsed content is then sent to the `Indexer` and persisted by the `Storage`
+layer. Throughout the pipeline, fetch and parse outcomes are reported to the
+`StatusUpdater`, which feeds URL status information back to the frontier,
+closing the crawl feedback loop.
+
+==== URL Filters
+
+URL Filters determine whether a URL should be accepted, rejected, or modified
+before it is scheduled for fetching. They operate on seed URLs, discovered
+links, and redirect targets, ensuring that only crawl-worthy URLs enter the
+frontier.
+
+In Figure 1, URL Filters are conceptually positioned between link discovery
+and the frontier. Their primary role is to control crawl scope and prevent
+frontier explosion.
+
+Typical URL Filters include:
+
+* *URL Length*: rejects excessively long URLs that often indicate session IDs
+or crawler traps.
+* *Path Repetition*: detects repeating path segments that can lead to infinite
+crawl loops.
+* *URL Normalization*: canonicalizes URLs by removing fragments, sorting query
+parameters, or enforcing consistent schemes.
+* *MIME Type*: avoids scheduling URLs unlikely to yield useful content.
+
+By applying these filters early, StormCrawler prevents unnecessary fetches and
+maintains an efficient, focused crawl.
+
+==== Parse Filters
+
+Parse Filters operate after content has been successfully fetched and parsed.
+They allow fine-grained control over how extracted data and outgoing links are
+processed.
+
+Parse Filters are applied within the parsing bolts, following parsing by
+`SiteMapParser`, `JSoupParser`, or `TikaParser`. They can modify extracted 
text,
+metadata, and links before the content is indexed or new URLs are emitted.
+
+Common Parse Filters include:
+
+* *URL Filters (post-parse)*: further refine outgoing links extracted from
+content.
+* *XPath*: extract structured fields from HTML documents.
+* *Text Extraction*: control which parts of a document contribute to the
+indexed text.
+* *Enrichment*: add custom metadata such as language detection, entity tags,
+or domain-specific signals.
+
+Parse Filters enable domain-specific logic without coupling it directly to the
+crawler’s core components.
+
+==== Interaction Between URL Filters and Parse Filters
+
+URL Filters focus on deciding *what should be crawled*, while Parse
+Filters focus on deciding *what should be kept and how it should be
+interpreted*.
diff --git a/docs/src/main/asciidoc/configuration.adoc 
b/docs/src/main/asciidoc/configuration.adoc
new file mode 100644
index 00000000..02da291c
--- /dev/null
+++ b/docs/src/main/asciidoc/configuration.adoc
@@ -0,0 +1,316 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Configuration
+
+=== User Agent Configuration
+
+Crawlers should always act responsibly and ethically when accessing websites. 
A key aspect of this is properly identifying themselves through the 
`User-Agent` header. By providing a clear and accurate user agent string, 
webmasters can understand who is visiting their site and why, and can apply 
rules in robots.txt accordingly. Respecting these rules, avoiding excessive 
request rates, and honoring content restrictions not only ensures legal 
compliance but also maintains a healthy relation [...]
+Transparent identification is a fundamental part of ethical web crawling.
+
+The configuration of the 
link:https://www.w3.org/WAI/UA/work/wiki/Definition_of_User_Agent[user agent] 
in StormCrawler has two purposes:
+
+. Identification of the crawler for webmasters
+. Selection of rules from robots.txt
+
+==== Crawler Identification
+
+The politeness of a web crawler is not limited to how frequently it fetches 
pages from a site, but also in how it identifies itself to sites it crawls. 
This is done by setting the HTTP header `User-Agent`, just like 
link:https://www.whatismybrowser.com/detect/what-is-my-user-agent/[your web 
browser does].
+
+The full user agent string is built from the concatenation of the 
configuration elements:
+
+* `http.agent.name`: name of your crawler
+* `http.agent.version`: version of your crawler
+* `http.agent.description`: description of what it does
+* `http.agent.url`: URL webmasters can go to to learn about it
+* `http.agent.email`: an email so that they can get in touch with you
+
+Whereas StormCrawler used to provide a default value for these, this is not 
the case since version 2.11 and you will now be asked to provide a value.
+
+You can specify the user agent verbatim with the config `http.agent` but you 
will still need to provide a `http.agent.name` for parsing robots.txt files.
+
+==== Robots Exclusion Protocol
+
+This is also known as the robots.txt protocol, it is formalised in 
link:https://www.rfc-editor.org/rfc/rfc9309.html[RFC 9309]. Part of what the 
robots directives do is to define rules to specify which parts of a website (if 
any) are allowed to be crawled. The rules are organised by `User-Agent`, with a 
`*` to match any agent not otherwise specified explicitly, e.g.:
+
+----
+User-Agent: *
+Disallow: *.gif$
+Disallow: /example/
+Allow: /publications/
+----
+
+In the example above the rule allows access to the URLs with the 
_/publications/_ path prefix, and it restricts access to the URLs with the 
_/example/_ path prefix and to all URLs with a _.gif_ suffix. The `"*"` 
character designates any character, including the otherwise-required forward 
slash.
+
+The value of `http.agent.name` is what StormCrawler looks for in the 
robots.txt. It MUST contain only uppercase and lowercase letters ("a-z" and 
"A-Z"), underscores ("_"), and hyphens ("-").
+
+Unless you are running a well known web crawler, it is unlikely that its agent 
name will be listed explicitly in the robots.txt (if it is, well, 
congratulations!). While you want the agent name value to reflect who your 
crawler is, you might want to follow rules set for better known crawlers. For 
instance, if you were a responsible AI company crawling the web to build a 
dataset to train LLMs, you would want to follow the rules set for 
`Google-Extended` (see link:https://developers.google [...]
+
+This is what the configuration `http.robots.agents` allows you to do. It is a 
comma-separated string but can also take a list of values. By setting it 
alongside `http.agent.name` (which should also be the first value it contains), 
you are able to broaden the match rules based on the identity as well as the 
purpose of your crawler.
+
+=== Proxy 
+
+StormCrawler's proxy system is built on top of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
 class and the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
 interface. Every proxy used in the system is formatted as a **SCProxy**. The 
**ProxyManager** implementations handle the management and delegation of their 
internal pr [...]
+
+The **ProxyManager** interface can be implemented in a custom class to create 
custom logic for proxy management and load balancing. The default 
**ProxyManager** implementation is **SingleProxyManager**. This ensures 
backwards compatibility for prior StormCrawler releases. To use 
**MultiProxyManager** or custom implementations, pass the class path and name 
via the config parameter `http.proxy.manager`:
+
+----
+http.proxy.manager: "org.apache.stormcrawler.proxy.MultiProxyManager"
+----
+
+StormCrawler implements two **ProxyManager** classes by default:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SingleProxyManager.java[SingleProxyManager]
+Manages a single proxy passed by the backwards compatible proxy fields in the 
configuration:
+
+  ----
+  http.proxy.host  
+  http.proxy.port  
+  http.proxy.type  
+  http.proxy.user (optional)  
+  http.proxy.pass (optional)  
+  ----
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/MultiProxyManager.java[MultiProxyManager]
+Manages multiple proxies passed through a TXT file. The file should contain 
connection strings for all proxies including the protocol and authentication 
(if needed). The file supports comment lines (`//` or `#`) and empty lines. The 
file path should be passed via the config at the below field. The TXT file must 
be available to all nodes participating in the topology:
+
+  ----
+  http.proxy.file
+  ----
+
+The **MultiProxyManager** load balances across proxies using one of the 
following schemes. The load balancing scheme can be passed via the config using 
`http.proxy.rotation`; the default value is `ROUND_ROBIN`:
+
+* ROUND_ROBIN
+Evenly distributes load across all proxies
+* RANDOM
+Randomly selects proxies using the native Java random number generator. RNG is 
seeded with the nanos at instantiation
+* LEAST_USED
+Selects the proxy with the least amount of usage. This is performed lazily for 
speed and therefore will not account for changes to usages during the selection 
process. If no custom implementations are made this should theoretically 
operate the same as **ROUND_ROBIN**
+
+The **SCProxy** class contains all of the information associated with proxy 
connection. In addition, it tracks the total usage of the proxy and optionally 
tracks the location of the proxy IP. Usage information is used for the 
**LEAST_USED** load balancing scheme. The location information is currently 
unused but left to enable custom implementations the ability to select proxies 
by location.
+
+=== Metadata
+
+==== Registering Metadata for Kryo Serialization
+
+If your Apache StormCrawler topology doesn't extend 
`org.apache.storm.crawler.ConfigurableTopology`, you will need to manually 
register StormCrawler's `Metadata` class for serialization in Storm. For more 
information on Kryo serialization in Apache Storm, you can refer to the 
link:https://storm.apache.org/documentation/Serialization.html[documentation].
+
+To register `Metadata` for serialization, you'll need to import 
`backtype.storm.Config` and `org.apache.storm.crawler.Metadata`. Then, in your 
topology class, you'll register the class with:
+
+[source,java]
+----
+Config.registerSerialization(conf, Metadata.class);
+----
+
+where `conf` is your Storm configuration for the topology.
+
+Alternatively, you can specify in the configuration file:
+
+[source,yaml]
+----
+topology.kryo.register:
+  - org.apache.storm.crawler.Metadata
+----
+
+==== MetadataTransfer
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/util/MetadataTransfer.java[MetadataTransfer]
 is an important part of the framework and is used in key parts of a pipeline:
+
+* Fetching
+* Parsing
+* Updating bolts
+
+An instance (or extension) of **MetadataTransfer** gets created and configured 
with the method:
+
+[source,java]
+----
+public static MetadataTransfer getInstance(Map<String, Object> conf)
+----
+
+which takes as parameter the standard Storm [[Configuration]].
+
+A **MetadataTransfer** instance has mainly two methods, both returning 
`Metadata` objects:
+
+* `getMetaForOutlink(String targetURL, String sourceURL, Metadata parentMD)`
+* `filter(Metadata metadata)`
+
+The former is used when creating 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/Outlink.java[Outlinks],
 i.e., in the parsing bolts but also for handling redirections in the 
[[FetcherBolt(s)]].
+
+The latter is used by extensions of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 class to determine which **Metadata** should be persisted.
+
+The behavior of the default **MetadataTransfer** class is driven by 
configuration only. It has the following options:
+
+* `metadata.transfer`:: list of metadata key values to filter or transfer to 
the outlinks. Please see the corresponding comments in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/crawler-default.yaml[crawler-default.yaml]
+* `metadata.persist`:: list of metadata key values to persist in the status 
storage. Please see the corresponding comments in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/crawler-default.yaml[crawler-default.yaml]
+* `metadata.track.path`:: whether to track the URL path or not. Boolean value, 
true by default.
+* `metadata.track.depth`:: whether to track the depth from seed. Boolean 
value, true by default.
+
+Note that the method `getMetaForOutlink` calls `filter` to determine which key 
values to keep.
+
+=== Configuration Options
+
+The following tables describe all available configuration options and their 
default values.
+If one of the keys is not present in your YAML file, the default value will be 
taken.
+
+Note: Some configuration options may not be applicable depending on the 
specific components and
+features you are using in your Apache StormCrawler topology. Some external 
modules might define additional options not listed here.
+
+==== Fetching and Partitioning
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| fetcher.max.crawl.delay | 30 | The maximum number in seconds that will be 
accepted by Crawl-delay
+directives in robots.txt files. If the crawl-delay exceeds this value the 
behavior depends on the value of fetcher.max.crawl.delay.force.
+| fetcher.max.crawl.delay.force | false | Configures the behavior of fetcher 
if the robots.txt crawl-delay exceeds fetcher.max.crawl.delay. If false: the 
tuple is emitted to the StatusStream
+as an ERROR. If true: the queue delay is set to fetcher.max.crawl.delay.
+| fetcher.max.queue.size | -1 | The maximum length of the queue used to store 
items to be fetched by the FetcherBolt. A setting of -1 sets the length to 
Integer.MAX_VALUE.
+| fetcher.max.throttle.sleep | -1 | The maximum amount of time to wait between 
fetches; if exceeded, the item is sent to the back of the queue. Used in 
SimpleFetcherBolt. -1 disables it.
+| fetcher.max.urls.in.queues | -1 | Limits the number of URLs that can be 
stored in a fetch queue. -1 disables the limit.
+| fetcher.maxThreads.host/domain/ip | fetcher.threads.per.queue | Overwrites 
fetcher.threads.per.queue. Useful for crawling some domains/hosts/IPs more 
intensively.
+| fetcher.metrics.time.bucket.secs | 10 | Metrics events emitted every value 
seconds to the system stream.
+| fetcher.queue.mode | byHost | Possible values: byHost, byDomain, byIP. 
Determines queue grouping.
+| fetcher.server.delay | 1 | Delay between crawls in the same queue if no 
Crawl-delay
+is defined.
+| fetcher.server.delay.force | false | Defines fetcher behavior when the 
robots.txt crawl-delay is smaller than fetcher.server.delay.
+| fetcher.server.min.delay | 0 | Delay between crawls for queues with >1 
thread. Ignores robots.txt.
+| fetcher.threads.number | 10 | Total concurrent threads fetching pages. 
Adjust carefully based on system capacity.
+| fetcher.threads.per.queue | 1 | Default number of threads per queue. Can be 
overridden.
+| fetcher.timeout.queue | -1 | Maximum wait time (seconds) for items in the 
queue. -1 disables timeout.
+| fetcherbolt.queue.debug.filepath | "" | Path to a debug log (e.g. 
/tmp/fetcher-dump-{port}).
+| http.agent.description | - | Description for the User-Agent header.
+| http.agent.email | - | Email address in User-Agent header.
+| http.agent.name | - | Name in User-Agent header.
+| http.agent.url | - | URL in User-Agent header.
+| http.agent.version | - | Version in User-Agent header.
+| http.basicauth.password | - | Password for http.basicauth.user.
+| http.basicauth.user | - | Username for Basic Authentication.
+| http.content.limit | -1 | Maximum HTTP response body size (bytes). Default: 
no limit.
+| http.protocol.implementation | 
org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTP Protocol
+implementation.
+| http.proxy.host | - | HTTP proxy host.
+| http.proxy.pass | - | Proxy password.
+| http.proxy.port | 8080 | Proxy port.
+| http.proxy.user | - | Proxy username.
+| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 
403.
+| http.robots.agents | '' | Additional user-agent strings for interpreting 
robots.txt.
+| http.robots.file.skip | false | Ignore robots.txt rules (1.17+).
+| http.skip.robots | false | Deprecated (replaced by http.robots.file.skip).
+| http.store.headers | false | Whether to store response headers.
+| http.store.responsetime | true | Not yet implemented — store response time 
in Metadata.
+| http.timeout | 10000 | Connection timeout (ms).
+| http.use.cookies | false | Use cookies in subsequent requests.
+| https.protocol.implementation | 
org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol
+implementation.
+| partition.url.mode | byHost | Defines how URLs are partitioned: byHost, 
byDomain, or byIP.
+| protocols | http,https | Supported protocols.
+| redirections.allowed | true | Allow URL redirects.
+| sitemap.discovery | false | Enable automatic sitemap discovery.
+|===
+
+==== Protocol
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| cacheConfigParamName | maximumSize=10000,expireAfterWrite=6h | CacheBuilder 
configuration for robots cache.
+| errorcacheConfigParamName | maximumSize=10000,expireAfterWrite=1h | 
CacheBuilder configuration for error cache.
+| file.encoding | UTF-8 | Encoding for FileProtocol.
+| http.custom.headers | - | Custom HTTP headers.
+| http.accept | - | HTTP Accept
+header.
+| http.accept.language | - | HTTP Accept-Language
+header.
+| http.content.partial.as.trimmed | false | Accepts partially fetched content 
in OKHTTP.
+| http.trust.everything | true | If true, trust all SSL/TLS connections.
+| navigationfilters.config.file | - | JSON config for NavigationFilter. See 
blog post
+.
+| selenium.addresses | - | WebDriver server addresses.
+| selenium.capabilities | - | Desired WebDriver capabilities
+.
+| selenium.delegated.protocol | - | Delegated protocol for selective Selenium 
usage.
+| selenium.implicitlyWait | 0 | WebDriver element search timeout.
+| selenium.instances.num | 1 | Number of instances per WebDriver connection.
+| selenium.pageLoadTimeout | 0 | WebDriver page load timeout.
+| selenium.setScriptTimeout | 0 | WebDriver script execution timeout.
+| topology.message.timeout.secs | -1 | OKHTTP message timeout.
+|===
+
+==== Indexing
+
+The values below are used by sub-classes of `AbstractIndexerBolt`.
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| indexer.md.filter | - | YAML list of key=value filters for metadata-based 
indexing.
+| indexer.md.mapping | - | YAML mapping from metadata fields to persistence 
layer fields.
+| indexer.text.fieldname | - | Field name for indexed HTML body text.
+| indexer.url.fieldname | - | Field name for indexed URL.
+|===
+
+==== Status Persistence
+
+This refers to persisting the status of a URL (e.g. ERROR, DISCOVERED etc.) 
along with asomething like a `nextFetchDate`
+that is being calculated by a 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[Scheduler].
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| fetchInterval.default | 1440 | Default revisit interval (minutes). Used by 
DefaultScheduler
+.
+| fetchInterval.error | 44640 | Revisit interval for error pages (minutes).
+| fetchInterval.fetch.error | 120 | Revisit interval for fetch errors 
(minutes).
+| status.updater.cache.spec | maximumSize=10000, expireAfterAccess=1h | Cache 
specification
+.
+| status.updater.use.cache | true | Whether to use cache to avoid 
re-persisting URLs.
+|===
+
+==== Parsing
+
+Configures parsing of fetched text and the handling of discovered URIs
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| collections.file | collections.json | Config for CollectionTagger
+.
+| collections.key | collections | Key under which tags are stored in metadata.
+| feed.filter.hours.since.published | -1 | Discard feeds older than value 
hours.
+| feed.sniffContent | false | Try to detect feeds automatically.
+| parsefilters.config.file | parsefilters.json | Path to JSON config defining 
ParseFilters. See example
+.
+| parser.emitOutlinks | true | Emit discovered links as DISCOVERED tuples.
+| parser.emitOutlinks.max.per.page | -1 | Limit number of emitted links per 
page.
+| textextractor.exclude.tags | "" | HTML tags ignored by TextExtractor.
+| textextractor.include.pattern | "" | Regex patterns to include for 
TextExtractor.
+| textextractor.no.text | false | Disable text extraction entirely.
+| track.anchors | true | Add anchor text to outlink metadata.
+| urlfilters.config.file | urlfilters.json | JSON file defining URL filters. 
See default
+.
+|===
+
+====  Metadata
+
+Options on how Storm Crawler should handle metadata tracking as well as
+minimising metadata clashes
+
+[cols="1,1,3", options="header"]
+|===
+| key | default value | description
+
+| metadata.persist | - | Metadata to persist but not transfer to outlinks.
+| metadata.track.depth | true | Track crawl depth of URLs.
+| metadata.track.path | true | Track URL path history in metadata.
+| metadata.transfer | - | Metadata to transfer to outlinks.
+| metadata.transfer.class | org.apache.stormcrawler.util.MetadataTransfer | 
Class handling metadata transfer.
+| protocol.md.prefix | - | Prefix for remote metadata keys to avoid collisions.
+|===
\ No newline at end of file
diff --git a/docs/src/main/asciidoc/debugging.adoc 
b/docs/src/main/asciidoc/debugging.adoc
new file mode 100644
index 00000000..963efad3
--- /dev/null
+++ b/docs/src/main/asciidoc/debugging.adoc
@@ -0,0 +1,21 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Debugging a StormCrawler Topology
+
+Assuming you have a StormCrawler setup all ready to use (e.g., generated from 
the archetype), you can debug it by either:
+
+* Running the Java topology from your IDE in debug mode.
+* Using the following command to enable remote debugging:
+
+[source,bash]
+----
+export 
STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"
+----
+
+Then run the topology in the usual way with `storm jar ...` and remote debug 
from your IDE.
+
+In both cases, the topology will need to run in local mode, i.e., not deployed 
to a Storm cluster.
diff --git a/docs/src/main/asciidoc/images/stormcrawler.drawio 
b/docs/src/main/asciidoc/images/stormcrawler.drawio
new file mode 100644
index 00000000..740813e0
--- /dev/null
+++ b/docs/src/main/asciidoc/images/stormcrawler.drawio
@@ -0,0 +1,238 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  https://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<mxfile host="app.diagrams.net" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36" 
version="29.2.7">
+  <diagram name="Seite-1" id="PRhTqK9N1ZGVMDB96As0">
+    <mxGraphModel dx="1018" dy="659" grid="1" gridSize="10" guides="1" 
tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" 
pageWidth="827" pageHeight="1169" math="0" shadow="0">
+      <root>
+        <mxCell id="0" />
+        <mxCell id="1" parent="0" />
+        <mxCell id="19sVcG4BV6jm642oUi7z-2" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="FrontierSpout" vertex="1">
+          <mxGeometry height="60" width="120" x="180" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-3" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;strokeColor=#666666;fontColor=#333333;"
 value="Frontier" vertex="1">
+          <mxGeometry height="60" width="120" x="10" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-4" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="URLPartitonier" vertex="1">
+          <mxGeometry height="60" width="120" x="350" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-5" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="Fetcher" vertex="1">
+          <mxGeometry height="60" width="120" x="520" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-6" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="SiteMapParser" vertex="1">
+          <mxGeometry height="60" width="120" x="690" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-7" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="JSoupParser" vertex="1">
+          <mxGeometry height="60" width="120" x="690" y="245" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-8" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="TikaShunt" vertex="1">
+          <mxGeometry height="60" width="120" x="550" y="245" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-9" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="TikaParser" vertex="1">
+          <mxGeometry height="60" width="120" x="350" y="245" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-10" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="Indexer" vertex="1">
+          <mxGeometry height="60" width="120" x="190" y="245" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-11" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;strokeColor=#666666;fontColor=#333333;"
 value="Storage" vertex="1">
+          <mxGeometry height="60" width="120" x="10" y="245" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-12" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-3" 
style="endArrow=classic;html=1;rounded=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=3;strokeColor=#999999;entryX=0;entryY=0.5;entryDx=0;entryDy=0;"
 target="19sVcG4BV6jm642oUi7z-2" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="160" y="45" />
+            </Array>
+            <mxPoint x="120" y="245" as="sourcePoint" />
+            <mxPoint x="170" y="110" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-14" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-2" 
style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-4" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="260" y="125" as="sourcePoint" />
+            <mxPoint x="210" y="125" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-16" edge="1" parent="1" 
style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="470" y="44.66" as="sourcePoint" />
+            <mxPoint x="520" y="44.66" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-17" edge="1" parent="1" 
style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="640" y="44.66" as="sourcePoint" />
+            <mxPoint x="690" y="44.66" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-18" parent="1" 
style="rounded=0;whiteSpace=wrap;html=1;" value="StatusUpdater" vertex="1">
+          <mxGeometry height="60" width="120" x="350" y="155" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-19" edge="1" parent="1" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.25;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeColor=#990099;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-18" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="570" y="170" />
+            </Array>
+            <mxPoint x="570" y="75" as="sourcePoint" />
+            <mxPoint x="620" y="75" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-20" edge="1" parent="1" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeColor=#990099;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-18" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="750" y="185" />
+            </Array>
+            <mxPoint x="750" y="75" as="sourcePoint" />
+            <mxPoint x="650" y="185" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-21" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-18" 
style="endArrow=classic;html=1;rounded=0;entryX=0.5;entryY=1;entryDx=0;entryDy=0;exitX=0;exitY=0.5;exitDx=0;exitDy=0;strokeColor=#990099;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-3" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="70" y="185" />
+            </Array>
+            <mxPoint x="170" y="35" as="sourcePoint" />
+            <mxPoint x="70" y="145" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-22" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-6" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-7" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="840" y="45" />
+              <mxPoint x="840" y="215" />
+              <mxPoint x="840" y="275" />
+            </Array>
+            <mxPoint x="1090" y="145" as="sourcePoint" />
+            <mxPoint x="810" y="270" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-23" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-7" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;exitX=0;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-8" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="680" y="355" as="sourcePoint" />
+            <mxPoint x="730" y="355" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-24" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-7" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.75;entryDx=0;entryDy=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;strokeColor=#990099;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-18" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points">
+              <mxPoint x="750" y="200" />
+            </Array>
+            <mxPoint x="1020" y="225" as="sourcePoint" />
+            <mxPoint x="740" y="350" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-25" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-8" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;exitX=0;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 target="19sVcG4BV6jm642oUi7z-9" value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="450" y="395" as="sourcePoint" />
+            <mxPoint x="500" y="395" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-26" edge="1" parent="1" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;strokeWidth=2;"
 value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="350" y="275" as="sourcePoint" />
+            <mxPoint x="310" y="274.67" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-27" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-10" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.5;entryDx=0;entryDy=0;exitX=0;exitY=0.5;exitDx=0;exitDy=0;strokeWidth=2;"
 value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <mxPoint x="170" y="275" as="sourcePoint" />
+            <mxPoint x="130" y="274.67" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-33" edge="1" parent="1" 
source="19sVcG4BV6jm642oUi7z-9" 
style="endArrow=classic;html=1;rounded=0;entryX=1;entryY=0.25;entryDx=0;entryDy=0;exitX=0.5;exitY=0;exitDx=0;exitDy=0;strokeColor=#990099;strokeWidth=2;"
 value="">
+          <mxGeometry height="50" relative="1" width="50" as="geometry">
+            <Array as="points" />
+            <mxPoint x="510" y="120" as="sourcePoint" />
+            <mxPoint x="410" y="215" as="targetPoint" />
+          </mxGeometry>
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-37" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;gradientColor=none;fillColor=#330000;"
 value="" vertex="1">
+          <mxGeometry height="10" width="10" x="797" y="18" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-38" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;gradientColor=none;fillColor=#330000;"
 value="" vertex="1">
+          <mxGeometry height="10" width="10" x="797" y="248" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-39" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;gradientColor=none;fillColor=#330000;"
 value="" vertex="1">
+          <mxGeometry height="10" width="10" x="457" y="249" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-41" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" value="" vertex="1">
+          <mxGeometry height="10" width="10" x="627" y="17" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-43" parent="1" 
style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="" 
vertex="1">
+          <mxGeometry height="170" width="260" x="877" y="10" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-44" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="URL Length" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="20" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-45" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="Path Repetition" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="60" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-46" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="URL Normalization" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="100" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-47" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="MIME Type" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="140" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-48" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" value="" vertex="1">
+          <mxGeometry height="10" width="10" x="1120" y="15" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-49" parent="1" 
style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="" 
vertex="1">
+          <mxGeometry height="170" width="260" x="877" y="190" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-50" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="URL Filters" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="200" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-51" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="Text Extraction" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="280" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-55" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;" value="" vertex="1">
+          <mxGeometry height="10" width="10" x="1077" y="205" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-42" parent="1" 
style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;gradientColor=none;fillColor=#330000;"
 value="" vertex="1">
+          <mxGeometry height="10" width="10" x="1117" y="195" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-58" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="XPath" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="240" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-59" parent="1" 
style="rounded=1;whiteSpace=wrap;html=1;" value="Enrichment" vertex="1">
+          <mxGeometry height="30" width="180" x="917" y="320" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-60" parent="1" 
style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=center;verticalAlign=middle;rounded=0;rotation=-90;"
 value="ParseFilters" vertex="1">
+          <mxGeometry height="30" width="60" x="857" y="260" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-61" parent="1" 
style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=center;verticalAlign=middle;rounded=0;rotation=-90;"
 value="URLFilters" vertex="1">
+          <mxGeometry height="30" width="60" x="859" y="80" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-63" parent="1" 
style="line;strokeWidth=7;direction=south;html=1;" value="" vertex="1">
+          <mxGeometry height="350" width="10" x="857" y="10" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-66" parent="1" 
style="swimlane;fontStyle=0;childLayout=stackLayout;horizontal=1;startSize=30;horizontalStack=0;resizeParent=1;resizeParentMax=0;resizeLast=0;collapsible=1;marginBottom=0;whiteSpace=wrap;html=1;"
 value="Legend" vertex="1">
+          <mxGeometry height="120" width="140" x="1167" y="10" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-67" parent="19sVcG4BV6jm642oUi7z-66" 
style="text;strokeColor=none;fillColor=none;align=left;verticalAlign=middle;spacingLeft=4;spacingRight=4;overflow=hidden;points=[[0,0.5],[1,0.5]];portConstraint=eastwest;rotatable=0;whiteSpace=wrap;html=1;"
 value="Purple: URL Status" vertex="1">
+          <mxGeometry height="30" width="140" y="30" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-68" parent="19sVcG4BV6jm642oUi7z-66" 
style="text;strokeColor=none;fillColor=none;align=left;verticalAlign=middle;spacingLeft=4;spacingRight=4;overflow=hidden;points=[[0,0.5],[1,0.5]];portConstraint=eastwest;rotatable=0;whiteSpace=wrap;html=1;"
 value="Black: Data Flow" vertex="1">
+          <mxGeometry height="30" width="140" y="60" as="geometry" />
+        </mxCell>
+        <mxCell id="19sVcG4BV6jm642oUi7z-69" parent="19sVcG4BV6jm642oUi7z-66" 
style="text;strokeColor=none;fillColor=none;align=left;verticalAlign=middle;spacingLeft=4;spacingRight=4;overflow=hidden;points=[[0,0.5],[1,0.5]];portConstraint=eastwest;rotatable=0;whiteSpace=wrap;html=1;"
 value="Gray: URL from Frontier" vertex="1">
+          <mxGeometry height="30" width="140" y="90" as="geometry" />
+        </mxCell>
+      </root>
+    </mxGraphModel>
+  </diagram>
+</mxfile>
diff --git a/docs/src/main/asciidoc/images/stormcrawler.drawio.jpg 
b/docs/src/main/asciidoc/images/stormcrawler.drawio.jpg
new file mode 100644
index 00000000..12bde527
Binary files /dev/null and 
b/docs/src/main/asciidoc/images/stormcrawler.drawio.jpg differ
diff --git a/docs/src/main/asciidoc/images/stormcrawler.drawio.pdf 
b/docs/src/main/asciidoc/images/stormcrawler.drawio.pdf
new file mode 100644
index 00000000..4b4e73a8
Binary files /dev/null and 
b/docs/src/main/asciidoc/images/stormcrawler.drawio.pdf differ
diff --git a/docs/src/main/asciidoc/index.adoc 
b/docs/src/main/asciidoc/index.adoc
new file mode 100644
index 00000000..8c99dbb5
--- /dev/null
+++ b/docs/src/main/asciidoc/index.adoc
@@ -0,0 +1,33 @@
+Apache StormCrawler 3.x - Documentation
+========================================
+Apache Software Foundation
+:doctype: article
+:toc: left
+:toclevels: 3
+:toc-position: left
+:toc-title: Apache StormCrawler 3.x - Documentation
+:numbered:
+
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+:imagesdir: images
+
+include::overview.adoc[]
+
+include::quick-start.adoc[]
+
+include::architecture.adoc[]
+
+include::internals.adoc[]
+
+include::configuration.adoc[]
+
+include::debugging.adoc[]
+
+include::powered-by.adoc[]
+
+include::presentations.adoc[]
diff --git a/docs/src/main/asciidoc/internals.adoc 
b/docs/src/main/asciidoc/internals.adoc
new file mode 100644
index 00000000..7f552ded
--- /dev/null
+++ b/docs/src/main/asciidoc/internals.adoc
@@ -0,0 +1,443 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Understanding StormCrawler's Internals
+
+=== Status Stream
+
+The Apache StormCrawler components rely on two Apache Storm streams: the 
_default_ one and another one called _status_.
+
+The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
+
+This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether 
the crawler is recursive or not.
+
+Tuples are emitted on the _status_ stream by the parsing bolts for handling 
outlinks but also to notify that there has been a problem with a URL (e.g., 
unparsable content). It is also used by the fetching bolts to handle 
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
+
+A bolt which sends tuples on the _status_ stream declares its output in the 
following way:
+
+[source,java]
+----
+declarer.declareStream(
+    org.apache.storm.crawler.Constants.StatusStreamName,
+    new Fields("url", "metadata", "status"));
+----
+
+As you can see for instance in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt].
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
 enum has the following values:
+
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
 or "injected" into the storage. The URLs can be already known in the storage.
+* REDIRECTION:: set by the fetcher bolts.
+* FETCH_ERROR:: set by the fetcher bolts.
+* ERROR:: used by either the fetcher, parser, or indexer bolts.
+* FETCHED:: set by the StatusStreamBolt bolt (see below).
+
+The difference between FETCH_ERROR and ERROR is that the former is possibly 
transient whereas the latter is terminal. The bolt which is in charge of 
updating the status (see below) can then decide when and whether to schedule a 
new fetch for a URL based on the status value.
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is useful for notifying the storage layer that a URL has been successfully 
processed, i.e., fetched, parsed, and anything else we want to do with the main 
content. It must be placed just before the StatusUpdaterBolt and sends a tuple 
for the URL on the status stream with a Status value of `fetched`.
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 can be extended to handle status updates for a specific backend. It has an 
internal cache of URLs with a `discovered` status so that they don't get added 
to the backend if they already exist, which is a simple but efficient 
optimisation. It also uses 
link:https://github.com/apache/stormcrawler/blob/main/core/src/m [...]
+
+In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for Elasticsearch.
+
+
+=== Bolts
+
+==== Fetcher Bolts
+
+There are actually two different bolts for fetching the content of URLs:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt]
+
+Both declare the same output:
+
+[source,java]
+----
+declarer.declare(new Fields("url", "content", "metadata"));
+declarer.declareStream(
+        org.apache.storm.crawler.Constants.StatusStreamName,
+        new Fields("url", "metadata", "status"));
+----
+
+with the `StatusStream` being used for handling redirections, restrictions by 
robots directives, or fetch errors, whereas the default stream gets the binary 
content returned by the server as well as the metadata to the following 
components (typically a parsing bolt).
+
+Both use the same xref:protocols[Protocols] implementations and 
xref:urlfilters[URLFilters] to control the redirections.
+
+The **FetcherBolt** has an internal set of queues where the incoming URLs are 
placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and 
a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by 
default) which pull the URLs to fetch from the **FetchQueues**. When doing so, 
they make sure that a minimal amount of time (set with `fetcher.server.delay` – 
default 1 sec) has passed since the previous URL was fetched from the same 
queue. This mechanism ensure [...]
+
+Incoming tuples spend very little time in the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
 method of the **FetcherBolt** as they are put in the FetchQueues, which is why 
you'll find that the value of **Execute latency** in the Storm UI is pretty 
low. They get acked later on, after they've been fetched. The metric to watch 
for in the Storm UI is **Process latency**.
+
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https:/
+
+=== Indexer Bolts
+The purpose of crawlers is often to index web pages to make them searchable. 
The project contains resources for indexing with popular search solutions such 
as:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache
 SOLR]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
 CloudSearch]
+
+All of these extend the class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt].
+
+The core module also contains a 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/StdOutIndexer.java[simple
 indexer] which dumps the documents into the standard output – useful for 
debugging – as well as a 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer].
+
+The basic functionalities of filtering a document to index, mapping the 
metadata (which determines which metadata to keep for indexing and under what 
field name), or using the canonical tag (if any) are handled by the abstract 
class. This allows implementations to focus on communication with the indexing 
APIs.
+
+Indexing is often the penultimate component in a pipeline and takes the output 
of a Parsing bolt on the standard stream. The output of the indexing bolts is 
on the _status_ stream:
+
+[source,java]
+----
+public void declareOutputFields(OutputFieldsDeclarer declarer) {
+    declarer.declareStream(
+            org.apache.stormcrawler.Constants.StatusStreamName,
+            new Fields("url", "metadata", "status"));
+}
+----
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is used for cases where no actual indexing is required. It simply generates a 
tuple on the _status_ stream so that any StatusUpdater bolt knows that the URL 
was processed successfully and can update its status and scheduling in the 
corresponding backend.
+
+You can easily build your own custom indexer to integrate with other storage 
systems, such as a vector database for semantic search, a graph database for 
network analysis, or any other specialized data store. By extending 
AbstractIndexerBolt, you only need to implement the logic to communicate with 
your target system, while StormCrawler handles the rest of the pipeline and 
status updates.
+
+=== Parser Bolts
+==== JSoupParserBolt
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
 can be used to parse HTML documents and extract the outlinks, text, and 
metadata it contains. If you want to parse non-HTML documents, use the 
link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based
 ParserBolt] from the external modules.
+
+This parser calls the xref:urlfilters[URLFilters] and 
xref:parsefilters[ParseFilters] defined in the configuration. Please note that 
it calls xref:metadatatransfer[MetadataTransfer] prior to calling the 
xref:parsefilters[ParseFilters]. If you create new Outlinks in your 
[[ParseFilters]], you'll need to make sure that you use MetadataTransfer there 
to inherit the Metadata from the parent document.
+
+The **JSoupParserBolt** automatically identifies the charset of the documents. 
It uses the link:StatusStream[status stream] to report parsing errors but also 
for the outlinks it extracts from a page. These would typically be used by an 
extension of 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 and persisted in some form of storage.
+
+==== SiteMapParserBolt
+StormCrawler can handle sitemap files thanks to the **SiteMapParserBolt**. 
This bolt should be placed before the standard **ParserBolt** in the topology, 
as illustrated in 
link:https://github.com/apache/stormcrawler/blob/main/archetype/src/main/resources/archetype-resources/src/main/java/CrawlTopology.java[CrawlTopology].
+
+The reason for this is that the **SiteMapParserBolt** acts as a filter: it 
passes on any incoming tuples to the default stream so that they get processed 
by the **ParserBolt**, unless the tuple contains `isSitemap=true` in its 
metadata, in which case the **SiteMapParserBolt** will parse it itself. Any 
outlinks found in the sitemap files are then emitted on the [[StatusStream]].
+
+The **SiteMapParserBolt** applies any configured 
xref:parsefilters[ParseFilters] to the documents it parses and, just like its 
equivalent for HTML pages, it uses xref:metadatatransfer[MetadataTransfer] to 
populate the Metadata objects for the Outlinks it finds.
+
+=== Filters
+
+[[parsefilters]]
+==== Parse Filters
+
+ParseFilters are called from parsing bolts such as 
link:https://github.com/apache/stormcrawler/wiki/JSoupParserBolt[JSoupParserBolt]
 and 
link:https://github.com/apache/stormcrawler/wiki/SiteMapParserBolt[SiteMapParserBolt]
 to extract data from web pages. The extracted data is stored in the Metadata 
object. ParseFilters can also modify the Outlinks and, in that sense, act as 
URLFilters.
+
+ParseFilters need to implement the interface 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/ParseFilter.java[ParseFilter],
 which defines three methods:
+
+[source,java]
+----
+public void filter(String URL, byte[] content, DocumentFragment doc, 
ParseResult parse);
+
+public void configure(Map stormConf, JsonNode filterParams);
+
+public boolean needsDOM();
+----
+
+* The `filter` method is where the extraction occurs. ParseResult objects 
contain the outlinks extracted from the document as well as a Map of String to 
ParseData, where the String is the URL of a subdocument or the main document 
itself. ParseData objects contain Metadata, binary content, and text for the 
subdocuments, which is useful for indexing subdocuments independently of the 
main document.
+* The `needsDOM` method indicates whether the ParseFilter instance requires 
the DOM structure. If no ParseFilters need it, the parsing bolt will skip 
generating the DOM, slightly improving performance.
+* The `configure` method takes a JSON object loaded by the wrapper class 
ParseFilters. The Storm configuration map can also be used to configure the 
filters, as described in link:Configuration[Configuration].
+
+Here is the default 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/parsefilters.json[JSON
 configuration file] for ParseFilters. The configuration allows multiple 
instances of the same filter class with different parameters and supports 
complex parameter objects. ParseFilters are executed in the order they appear 
in the JSON file.
+
+===== Provided ParseFilters
+
+* **CollectionTagger** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/CollectionTagger.java[CollectionTagger]
 assigns one or more tags to the metadata of a document based on URL patterns 
defined in a JSON resource file. The resource file supports both include and 
exclude regular expressions:
+
+[source,json]
+----
+{
+  "collections": [
+    {
+      "name": "stormcrawler",
+      "includePatterns": ["https://stormcrawler.net/.+";]
+    },
+    {
+      "name": "crawler",
+      "includePatterns": [".+crawler.+", ".+nutch.+"],
+      "excludePatterns": [".+baby.+", ".+spider.+"]
+    }
+  ]
+}
+----
+
+* **CommaSeparatedToMultivaluedMetadata** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/CommaSeparatedToMultivaluedMetadata.java[CommaSeparatedToMultivaluedMetadata]
 rewrites single metadata values containing comma-separated entries into 
multiple values for the same key, useful for keyword tags.
+* **DebugParseFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/DebugParseFilter.java[DebugParseFilter]
 dumps an XML representation of the DOM structure to a temporary file.
+* **DomainParseFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/DomainParseFilter.java[DomainParseFilter]
 stores the domain or host name in the metadata for later indexing.
+* **LDJsonParseFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/LDJsonParseFilter.java[LDJsonParseFilter]
 extracts data from JSON-LD representations.
+* **LinkParseFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/LinkParseFilter.java[LinkParseFilter]
 extracts outlinks from documents using XPath expressions defined in the 
configuration.
+* **MD5SignatureParseFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/MD5SignatureParseFilter.java[MD5SignatureParseFilter]
 generates an MD5 signature of a document based on the binary content, text, or 
URL (as a last resort). It can be combined with content filtering to exclude 
boilerplate text.
+* **MimeTypeNormalization** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/MimeTypeNormalization.java[MimeTypeNormalization]
 converts server-reported or inferred mime-type values into human-readable 
values such as _pdf_, _html_, or _image_ and stores them in the metadata, 
useful for indexing and filtering search results.
+* **XPathFilter** – 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/filter/XPathFilter.java[XPathFilter]
 allows extraction of data using XPath expressions and storing them in the 
Metadata object.
+
+You can also implement custom ParseFilters to extend the capabilities of the 
parsing pipeline. For example, you might create a filter to enrich a document's 
metadata with additional information, such as language detection, sentiment 
analysis, named entity recognition, or custom tags extracted from the content. 
Custom filters can also modify or remove outlinks, normalize text, or integrate 
external data sources, allowing you to tailor the crawler to your specific 
processing or indexing re [...]
+By implementing the ParseFilter interface and configuring the filter in the 
JSON file, your custom logic will be seamlessly executed within the parsing 
bolt.
+
+[[urlfilters]]
+==== URL Filters
+
+The URL filters can be used to both remove or modify incoming URLs (unlike 
Nutch where these functionalities are separated between URLFilters and 
URLNormalizers). This is generally used within a parsing bolt to normalize and 
filter outgoing URLs, but is also called within the FetcherBolt to handle 
redirections.
+
+URLFilters need to implement the interface 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/URLFilter.java[URLFilter]
 which defines a single method:
+
+[source, java]
+----
+public String filter(URL sourceUrl, Metadata sourceMetadata,
+            String urlToFilter);
+----
+
+and inherits a default one from 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/util/Configurable.java[Configurable]:
+
+[source, java]
+----
+public void configure(Map stormConf, JsonNode jsonNode);
+----
+
+The configuration is done via a JSON file which is loaded by the wrapper class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/URLFilters.java[URLFilters].
 The URLFilter instances can be used directly, but it is easier to use the 
class URLFilters instead. Some filter implementations can also be configured 
with the 
link:https://github.com/apache/stormcrawler/wiki/Configuration[standard 
configuration mechanism].
+
+Here is an example of a 
link:https://github.com/apache/stormcrawler/blob/main/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json[JSON
 configuration file].
+
+The JSON configuration allows loading several instances of the same filtering 
class with different parameters and can handle complex configuration objects 
since it makes no assumptions about the content of the field `param`. The 
URLFilters are executed in the order in which they are defined in the JSON file.
+
+===== Built-in URL Filters
+
+====== Basic
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLFilter.java[BasicURLFilter]
 filters based on the length of the URL and the repetition of path elements.
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java[BasicURLNormalizer]
 removes the anchor part of URLs based on the value of the parameter 
`removeAnchorPart`. It also removes query elements based on the configuration 
and whether their value corresponds to a 32-bit hash.
+
+====== FastURLFilter
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/regex/FastURLFilter.java[FastURLFilter]
 is based on regex patterns and organized by scope (host | domain | metadata | 
global). For a given URL, the scopes are tried in the order given above and the 
URL is kept or removed based on the first matching rule. The default policy is 
to accept a URL if no match is found.
+
+The resource file is in JSON and looks like this:
+
+[source,json]
+----
+[{
+    "scope": "GLOBAL",
+    "patterns": [
+        "DenyPathQuery \\.jpg"
+    ]
+  },
+  {
+    "scope": "domain:stormcrawler.net",
+    "patterns": [
+        "AllowPath /digitalpebble/",
+        "DenyPath .+"
+    ]
+  },
+  {
+    "scope": "metadata:key=value",
+    "patterns": [
+       "DenyPath .+"
+    ]
+}]
+----
+
+_DenyPathQuery_ indicates that the pattern should be applied on the path URL 
path and the query element, whereas _DenyPath_ means the path alone.
+
+====== Host
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/host/HostURLFilter.java[HostURLFilter]
 filters URLs based on whether they belong to the same host or domain name as 
the source URL. This is configured with the parameters `ignoreOutsideDomain` 
and `ignoreOutsideHost`. The latter takes precedence over the former.
+
+====== MaxDepth
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/depth/MaxDepthFilter.java[MaxDepthFilter]
 is configured with the parameter `maxDepth` and requires 
`metadata.track.depth` to be set to true in the Configuration. This removes 
outlinks found too far from the seed URL and controls the expansion of the 
crawl.
+
+If the filter is configured with a value of 0, all outlinks will be removed, 
regardless of whether the depth is being tracked.
+
+The max depth can also be set on a per-seed basis using the key/value 
`max.depth`, which is automatically transferred to the outlinks if 
`metadata.track.depth` is set to true.
+
+====== Metadata
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/metadata/MetadataFilter.java[MetadataFilter]
 filters URLs based on metadata in the source document.
+
+====== RegexURLFilter
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/regex/RegexURLFilter.java[RegexURLFilter]
 uses a configuration file or a JSON ArrayNode containing regular expressions 
to determine whether a URL should be kept or not. The most specific rule must 
be placed first as a URL is kept or removed based on the first matching rule.
+
+[source,json]
+----
+{
+    "urlFilters": [
+        "-^(file|ftp|mailto):",
+        "+."
+    ]
+}
+----
+
+====== RegexURLNormalizer
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/regex/RegexURLNormalizer.java[RegexURLNormalizer]
 uses a configuration file or a JSON ArrayNode containing regular expressions 
and replacements to normalize URLs.
+
+[source,json]
+----
+{
+    "urlNormalizers": [
+        {
+            "pattern": "#.*?(\\?|&amp;|$)",
+            "substitution": "$1"
+        },
+        {
+            "pattern": "\\?&amp;",
+            "substitution": "\\?"
+        }
+    ]
+}
+----
+
+====== RobotsFilter
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/robots/RobotsFilter.java[RobotsFilter]
 discards URLs based on the robots.txt directives. This is meant for small, 
limited crawls where the number of hosts is finite. Using this on a larger or 
open crawl would impact performance as the filter tries to retrieve the 
robots.txt files for any host found.
+
+====== SitemapFilter
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/filtering/sitemap/SitemapFilter.java[SitemapFilter]
 discards the outlinks of URLs which are not sitemaps when sitemaps have been 
found.
+
+[[metadatatransfer]]
+=== Metadata Transfer
+
+The class 
https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/util/MetadataTransfer.java[MetadataTransfer]
 is an important part of the framework and is used in key parts of a pipeline.
+
+* Fetching
+* Parsing
+* Updating bolts
+
+An instance (or extension) of MetadataTransfer gets created and configured 
with the method `public static MetadataTransfer getInstance(Map++<++String, 
Object++>++ conf)` which takes as parameter with the standard Apache Storm 
configuration.
+
+A *MetadataTransfer* instance has mainly two methods, both returning Metadata 
objects :
+
+* `getMetaForOutlink(String targetURL, String sourceURL,           Metadata 
parentMD)`
+* `filter(Metadata metadata)`
+
+The former is used when creating 
https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/Outlink.java[Outlinks]
 i.e. in the parsing bolts but also for handling redirections in the 
FetcherBolt(s)
+The latter is used by extensions of the 
https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 class to determine which *Metadata* should be persisted.
+
+The behavior of the default MetadataTransfer class is driven by configuration 
only. It has the following options.
+
+* `metadata.transfer` list of metadata key values to filter or transfer to the 
outlinks. See 
https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/crawler-default.yaml#L23[crawler-default.yaml]
+* `metadata.persist` list of metadata key values to persist in the status 
storage. See 
https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/crawler-default.yaml#L28[crawler-default.yaml]
+* `metadata.track.path` whether to track the URL path or not. Boolean value, 
true by default.
+* `metadata.track.depth` whether to track the depth from seed. Boolean value, 
true by default.
+
+Note that the method `getMetaForOutlink` calls `filter` to determine what to 
key values to keep.
+
+[[protocols]]
+=== Protocols
+
+StormCrawler supports multiple *network protocols* for fetching content from 
various sources on the web.
+Each protocol implementation defines how the crawler connects to a resource, 
sends requests, and handles responses such as status codes, headers, and 
content streams.
+
+Protocols are a key part of the fetching process and are used by 
StormCrawler’s *bolts* to retrieve data from remote servers.
+While HTTP and HTTPS are the most commonly used, other protocols like `file:` 
are also supported for local or distributed filesystem access.
+
+Use these configurations to fine-tune fetching performance, authentication, 
connection handling, and protocol-level optimizations across your crawler 
topology.
+
+==== Network Protocols
+
+The following network protocols are implemented in StormCrawler:
+
+===== File
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/file/FileProtocol.java[FileProtocol]
+
+===== HTTP/S
+
+See [[HTTPProtocol]] for the effect of metadata content on protocol behaviour.
+
+To change the implementation, add the following lines to your 
_crawler-conf.yaml_:
+
+[source,yaml]
+----
+http.protocol.implementation: 
"org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
+https.protocol.implementation: 
"org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
+----
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/httpclient/HttpProtocol.java[HttpClient]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/selenium/SeleniumProtocol.java[Selenium]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/okhttp/HttpProtocol.java[OKHttp]
+
+==== Feature grid
+
+[cols="2,1,1,1", options="header"]
+|===
+| Features | HTTPClient | OKHttp | Selenium
+
+| Basic authentication | 
link:https://github.com/apache/stormcrawler/pull/589[Y] | 
link:https://github.com/apache/stormcrawler/issues/792[Y] | N
+| Proxy (w. credentials?) | Y / Y | Y / 
link:https://github.com/apache/stormcrawler/issues/751[Y] | ?
+| Interruptible / trimmable 
link:https://github.com/apache/stormcrawler/issues/463[#463] | N / Y | Y / Y | 
Y / N
+| Cookies | Y | link:https://github.com/apache/stormcrawler/issues/632[Y] | N
+| Response headers | Y | Y | N
+| Trust all certificates | N | 
link:https://github.com/apache/stormcrawler/issues/615[Y] | N
+| HEAD method | link:https://github.com/apache/stormcrawler/issues/485[Y] | 
link:https://github.com/apache/stormcrawler/pull/923[Y] | N
+| POST method | N | link:https://github.com/apache/stormcrawler/issues/641[Y] 
| N
+| Verbatim response header | 
link:https://github.com/apache/stormcrawler/issues/317[Y] | 
link:https://github.com/apache/stormcrawler/issues/506[Y] | N
+| Verbatim request header | N | 
link:https://github.com/apache/stormcrawler/issues/506[Y] | N
+| IP address capture | N | 
link:https://github.com/apache/stormcrawler/pull/691[Y] | N
+| Navigation and javascript | N | N | Y
+| HTTP/2 | N | Y | (Y)
+| Configurable connection pool | N | 
link:https://github.com/apache/stormcrawler/issues/918[Y] | N
+|===
+
+==== HTTP/2
+
+* The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/okhttp/HttpProtocol.java[OKHttp]
 protocol supports link:https://en.wikipedia.org/wiki/HTTP/2[HTTP/2] if the JDK 
includes 
link:https://en.wikipedia.org/wiki/Application-Layer_Protocol_Negotiation[ALPN] 
(Java 9 and upwards or Java 8 builds starting early/mid 2020).
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/httpclient/HttpProtocol.java[HttpClient]
 does not yet support HTTP/2.
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/selenium/SeleniumProtocol.java[Selenium]:
 whether HTTP/2 is used or not depends on the used driver.
+
+Since link:https://github.com/apache/stormcrawler/pull/829[#829], the HTTP 
protocol version used is configurable via `http.protocol.versions` (see also 
comments in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/resources/crawler-default.yaml[crawler-default.yaml]).
+
+For example, to force that only HTTP/1.1 is used:
+
+[source,yaml]
+----
+http.protocol.versions:
+- "http/1.1"
+----
+
+==== Metadata-dependent Behavior For HTTP Protocols
+
+The `metadata` argument to 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/Protocol.java#L53[HTTPProtocol.getProtocolOutput()]
 can affect the behavior of the protocol. The following metadata keys are 
detected by `HTTPProtocol` implementations and utilized in performing the 
request:
+
+* `last-modified`: If this key is present in `metadata`, the protocol will use 
the metadata value as the date for the `If-Modified-Since` header field of the 
HTTP request. If the key is not present, the `If-Modified-Since` field won't be 
added to the request header.
+
+* `protocol.etag`: If this key is present in `metadata`, the protocol will use 
the metadata value as the ETag for the `If-None-Match` header field of the HTTP 
request. If the key is not present, the `If-None-Match` field won't be added to 
the request header.
+
+* `http.accept`: If this key is present in `metadata`, the protocol will use 
the value to override the value for the `Accept` header field of the HTTP 
request. If the key is not present, the `http.accept` global configuration 
value is used instead. (Available in v1.11+)
+
+* `http.accept.language`: If this key is present in `metadata`, the protocol 
will use the value to override the value for the `Accept-Language` header field 
of the HTTP request. If the key is not present, the `http.accept.language` 
global configuration value is used instead. (Available in v1.11+)
+
+* `protocol.set-cookie`: If this key is present in `metadata` and 
`http.use.cookies` is true, the protocol will send cookies stored from the 
response this page was linked to, given the cookie is applicable to the domain 
of the link.
+
+* `http.method.head`: If this key is present in `metadata`, the protocol sends 
a HEAD request. (Available in v1.12+ only for httpclient, see 
link:https://github.com/apache/stormcrawler/issues/485[#485])
+
+* `http.post.json`: If this key is present in `metadata`, the protocol sends a 
POST request. (Available in v1.12+ only for okhttp, see 
link:https://github.com/apache/stormcrawler/issues/641[#641])
+
+* `protocol.set-headers`: If this key is present in metadata, the protocol 
adds the specified headers to the request. See 
link:https://github.com/apache/stormcrawler/pull/993[#993]
+
+Example:
+
+[source,json]
+----
+"protocol%2Eset-header": [
+  "header1=value1",
+  "header2=value2"
+]
+----
+
+Notes:
+
+* Metadata values starting with `protocol.` may start with a different prefix 
instead. See `protocol.md.prefix` and 
link:https://github.com/apache/stormcrawler/issues/776[#776].
+* Metadata used for requests needs to be persisted. For example:
+
+[source,yaml]
+----
+metadata.persist:
+  - last-modified
+  - protocol.etag
+  - protocol.set-cookie
+  - ...
+----
+
+* Cookies need to be transferred to outlinks by setting:
+
+[source,yaml]
+----
+metadata.transfer:
+  - set-cookie
+----
+
+
diff --git a/docs/src/main/asciidoc/overview.adoc 
b/docs/src/main/asciidoc/overview.adoc
new file mode 100644
index 00000000..50df79d5
--- /dev/null
+++ b/docs/src/main/asciidoc/overview.adoc
@@ -0,0 +1,53 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Overview
+
+Apache StormCrawler is an open source collection of resources for building 
low-latency, scalable web crawlers on link:https://storm.apache.org/[Apache 
Storm]. It is provided under the 
link:https://www.apache.org/licenses/LICENSE-2.0[Apache License] and is written 
mostly in Java.
+
+The aims of StormCrawler are to help build web crawlers that are:
+
+* Scalable
+* Low latency
+* Easy to extend
+* Polite yet efficient
+
+StormCrawler is both a library and a collection of reusable components 
designed to help developers build custom web crawlers with ease.
+Getting started is simple — the Maven archetypes allow you to quickly scaffold 
a new project, which you can then adapt to fit your specific needs.
+
+In addition to its core modules, StormCrawler offers a range of external 
resources that can be easily integrated into your project.
+These include spouts and bolts for OpenSearch, as well as a ParserBolt that 
leverages Apache Tika to handle various document formats and many more.
+
+StormCrawler is well-suited for scenarios where URLs to fetch and parse arrive 
as continuous streams, but it also performs exceptionally in large-scale, 
recursive crawls where low latency is essential.
+The project is actively maintained, widely adopted in production environments, 
and supported by an engaged community.
+
+You can find links to recent talks and demos later in this document, 
showcasing real-world applications and use cases.
+
+== Key Features
+
+Here is a short list of provided features:
+
+* Integration with 
link:https://github.com/crawler-commons/url-frontier[URLFrontier] for 
distributed URL management
+* Pluggable components (Spouts and Bolts from 
link:https://storm.apache.org/[Apache Storm]) for flexibility and modularity — 
adding custom components is straightforward
+* Support for link:https://tika.apache.org/[Apache Tika] for document parsing 
via `ParserBolt`
+* Integration with link:https://opensearch.org/[OpenSearch] and 
link:https://solr.apache.org/[Apache Solr] for indexing and status storage
+* Option to store crawled data as WARC (Web ARChive) files
+* Support for headless crawling using link:https://playwright.dev/[Playwright]
+* Support for LLM-based advanced text extraction
+* Proxy support for distributed and controlled crawling
+* Flexible and pluggable filtering mechanisms:
+** URL Filters for pre-fetch filtering
+** Parse Filters for post-fetch content filtering
+* Built-in support for crawl metrics and monitoring
+* Configurable politeness policies (e.g., crawl delay, user agent management)
+* Robust HTTP fetcher based on link:https://hc.apache.org/[Apache 
HttpComponents] or link:https://square.github.io/okhttp/[OkHttp].
+* MIME type detection and response-based filtering
+* Support for parsing and honoring `robots.txt` and sitemaps
+* Stream-based, real-time architecture using 
link:https://storm.apache.org/[Apache Storm] — suitable for both recursive and 
one-shot crawling tasks
+* Can run in both local and distributed environments
+* Apache Maven archetypes for quickly bootstrapping new crawler projects
+* Actively developed and used in production by xref:poweredby[multiple 
organizations]
+
diff --git a/docs/src/main/asciidoc/powered-by.adoc 
b/docs/src/main/asciidoc/powered-by.adoc
new file mode 100644
index 00000000..fa12f0c7
--- /dev/null
+++ b/docs/src/main/asciidoc/powered-by.adoc
@@ -0,0 +1,40 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+[[poweredby]]
+== Companies & Projects Using StormCrawler
+
+Apache StormCrawler has been adopted by a wide variety of organizations across 
industries, from startups to large enterprises and research institutions.
+The following is a non-exhaustive list of companies, projects, and 
institutions that have used Apache StormCrawler in production or research.
+If your organization is also making use of Apache StormCrawler, we’d love to 
hear from you!
+
+* link:https://www.careerbuilder.com/[CareerBuilder]
+* link:https://www.stolencamerafinder.com/[StolenCameraFinder]
+* link:https://www.weborama.com/[Weborama]
+* link:https://www.ontopic.io/[Ontopic]
+* link:https://www.shopstyle.com/[ShopStyle]
+* link:https://www.wombatsoftware.de/[Wombat Software]
+* link:https://commoncrawl.org/2016/10/news-dataset-available/[CommonCrawl]
+* link:https://webfinery.com/[WebFinery]
+* link:https://www.reportlinker.com/[ReportLinker]
+* link:https://www.tokenmill.lt/[TokenMill]
+* link:https://www.polecat.com/[Polecat]
+* link:https://www.wizenoze.com/en/[WizeNoze]
+* link:https://iproduct.io/[IProduct.io]
+* link:https://www.cgi.com/[CGI]
+* link:https://github.com/miras-tech/MirasText[MirasText]
+* link:https://www.g2webservices.com/[G2 Web Services]
+* link:https://www.gov.nt.ca/[Government of Northwest Territories]
+* 
link:https://digitalpebble.blogspot.com/2019/02/meet-stormcrawler-users-q-with-pixray.html[Pixray]
+* link:https://www.cameraforensics.com/[CameraForensics]
+* link:https://gagepiracy.com/[Gage Piracy]
+* link:https://www.clarin.eu/[Clarin ERIC]
+* link:https://openwebsearch.eu/owler/[OpenWebSearch]
+* link:https://shc-info.zml.hs-heilbronn.de/[Heilbronn University]
+* link:https://www.contexity.com[Contexity]
+* 
link:https://www.kodis.iao.fraunhofer.de/de/projekte/SPIDERWISE.html[Fraunhofer 
IAO - KODIS]
+
+Drop us a line at 
mailto:[email protected][[email protected]] if you want 
to be added to this page.
diff --git a/docs/src/main/asciidoc/presentations.adoc 
b/docs/src/main/asciidoc/presentations.adoc
new file mode 100644
index 00000000..4d7f0e96
--- /dev/null
+++ b/docs/src/main/asciidoc/presentations.adoc
@@ -0,0 +1,28 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Talks, tutorials, articles and interviews
+
+* **OWler: preliminary results for building a collaborative open web 
crawler**, Dinzinger and al., OSSYM 2023 
link:https://ca-roll.github.io/downloads/owler.pdf[]
+* **Presentation: StormCrawler and OpenSearch** (Feb 2023) 
link:https://www.youtube.com/watch?v=azHYI9pnjos[]
+* **Crawling the German Health Web: Exploratory Study and Graph Analysis**, 
Zowalla R, Wetter T, Pfeifer D. J Med Internet Res 2020;22(7):e17853 
link:https://www.jmir.org/2020/7/e17853[]
+* **Tutorial: StormCrawler 1.16 + Elasticsearch 7.5.0** 
link:https://youtu.be/8kpJLPdhvLw[]
+* **StormCrawler open source web crawler strengthened by Elasticsearch, 
Kibana** 
link:https://www.elastic.co/blog/stormcrawler-open-source-web-crawler-strengthened-by-elasticsearch-kibana[]
+* **Harvesting Online Health Information**, Richard Zowalla 
link:https://www.slideshare.net/secret/v8Y0qFlGBk7IbB[slides]
+* _**Tutorial: StormCrawler 1.10 + Apache SOLR 4.7.0**_ 
link:https://youtu.be/F8nvGj03XLo[]
+* **Patent-Crawler: A Web Crawler to Gather Virtual Patent Marking 
Information**, Etienne Orliac, l’Université de Lausanne/École Polytechnique 
Fédérale de Lausanne (UNIL/EPFL) 
link:https://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/pdf/Weds11April/Orliac_PatentCrawler_Wed110418.pdf[slides]
 - link:https://youtu.be/2v6Y_3Q0vT0[video]
+* _**DigitalPebble's Blog: Crawl dynamic content with Selenium and 
StormCrawler**_ 
link:https://digitalpebble.blogspot.co.uk/2017/04/crawl-dynamic-content-with-selenium-and.html[]
+* _**The Battle of the Crawlers: Apache Nutch vs. StormCrawler**_ 
link:https://dzone.com/articles/the-battle-of-the-crawlers-apache-nutch-vs-stormcr[]
+* _**Tutorial: StormCrawler + Elasticsearch + Kibana**_ 
link:https://digitalpebble.blogspot.co.uk/2017/04/video-tutorial-stormcrawler.html[]
+* _**Q&A with InfoQ**_ 
link:https://www.infoq.com/news/2016/12/nioche-stormcrawler-web-crawler[]
+* _**DigitalPebble's Blog: Index the web with StormCrawler (revisited)**_ 
link:https://digitalpebble.blogspot.co.uk/2016/09/index-web-with-stormcrawler-revisited.html[]
+* _**DigitalPebble's Blog: Index the web with AWS CloudSearch**_ 
link:https://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html[]
+* _**Low latency scalable web crawling on Apache Storm**_ 
link:https://www.slideshare.net/digitalpebble/j-nioche-berlinbuzzwords20150601[slides]
 - link:https://t.co/A3bRKroDn3[video], by Julien Nioche. BerlinBuzzwords 2015
+* _**Storm Crawler: A real-time distributed web crawling and monitoring 
framework**_ 
link:https://www.slideshare.net/ontopic/storm-crawler-apacheconna2015[slides], 
by Jake Dodd - link:https://www.ontopic.io/[Ontopic], ApacheCon North America 
2015
+* _**A quick introduction to Storm Crawler**_ 
link:https://www.slideshare.net/digitalpebble/j-nioche-apacheconeu2014fastfeather[slides],
 by Julien Nioche. ApacheCon Europe, Budapest, Nov 2014
+* _**StormCrawler in the wild**_ 
link:https://www.slideshare.net/digitalpebble/storm-crawler-ontopic20141113[slides],
 by Jake Dodd - link:https://www.ontopic.io/[Ontopic], ApacheCon Europe, 
Budapest, Nov 2014
+
+Drop us a line at 
mailto:[email protected][[email protected]] if you want 
to be added to this page.
diff --git a/docs/src/main/asciidoc/quick-start.adoc 
b/docs/src/main/asciidoc/quick-start.adoc
new file mode 100644
index 00000000..b3f6f89b
--- /dev/null
+++ b/docs/src/main/asciidoc/quick-start.adoc
@@ -0,0 +1,211 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Quick Start
+
+These instructions should help you get Apache StormCrawler up and running in 5 
to 15 minutes.
+
+=== Prerequisites
+
+To run StormCrawler, you will need Java SE 17 or later.
+
+Additionally, since we'll be running the required Apache Storm cluster using 
Docker Compose,
+make sure Docker is installed on your operating system.
+
+=== Terminology
+
+Before starting, we will give a quick overview of **central** Storm concepts 
and terminology, you need to know before starting with StormCrawler:
+
+- *Topology*: A topology is the overall data processing graph in Storm, 
consisting of spouts and bolts connected together to perform continuous, 
real-time computations.
+
+- *Spout*: A spout is a source component in a Storm topology that emits 
streams of data into the processing pipeline.
+
+- *Bolt*: A bolt processes, transforms, or routes data streams emitted by 
spouts or other bolts within the topology.
+
+- *Flux*: In Apache Storm, Flux is a declarative configuration framework that 
enables you to define and run Storm topologies using YAML files instead of 
writing Java code. This simplifies topology management and deployment.
+
+- *Frontier*: In the context of a web crawler, the Frontier is the component 
responsible for managing and prioritizing the list of URLs to be fetched next.
+
+- *Seed*: In web crawling, a seed is an initial URL or set of URLs from which 
the crawler starts its discovery and fetching process.
+
+=== Bootstrapping a StormCrawler Project
+
+You can quickly generate a new StormCrawler project using the Maven archetype:
+
+[source,shell]
+----
+mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler \
+                       -DarchetypeArtifactId=stormcrawler-archetype \
+                       -DarchetypeVersion=<CURRENT_VERSION>
+----
+
+Be sure to replace `<CURRENT_VERSION>` with the latest released version of 
StormCrawler, which you can find on 
link:https://search.maven.org/artifact/org.apache.stormcrawler/stormcrawler-archetype[search.maven.org].
+
+During the process, you’ll be prompted to provide the following:
+
+* `groupId` (e.g. `com.mycompany.crawler`)
+* `artifactId` (e.g. `stormcrawler`)
+* Version
+* Package name
+* User agent details
+
+IMPORTANT: Specifying a user agent is important for crawler ethics because it 
identifies your crawler to websites, promoting transparency and allowing site 
owners to manage or block requests if needed. Be sure to provide a crawler 
information website as well.
+
+The archetype will generate a fully-structured project including:
+
+* A pre-configured `pom.xml` with the necessary dependencies
+* Default resource files
+* A sample `crawler.flux` configuration
+* A basic configuration file
+
+After generation, navigate into the newly created directory (named after the 
`artifactId` you specified).
+
+TIP: You can learn more about the architecture and how each component works 
together if you look into link:architecture.adoc[the architecture 
documentation].
+By exploring that part of the documentation, you can gain a better 
understanding of how StormCrawler performs crawling and how bolts, spouts, as 
well as parse and URL filters, collaborate in the process.
+
+==== Docker Compose Setup
+
+Below is a simple `docker-compose.yaml` configuration to spin up URLFrontier, 
Zookeeper, Storm Nimbus, Storm Supervisor, and the Storm UI:
+
+[source,yaml]
+----
+services:
+  zookeeper:
+    image: zookeeper:3.9.3
+    container_name: zookeeper
+    restart: always
+
+  nimbus:
+    image: storm:latest
+    container_name: nimbus
+    hostname: nimbus
+    command: storm nimbus
+    depends_on:
+      - zookeeper
+    restart: always
+
+  supervisor:
+    image: storm:latest
+    container_name: supervisor
+    command: storm supervisor -c worker.childopts=-Xmx%HEAP-MEM%m
+    depends_on:
+      - nimbus
+      - zookeeper
+    restart: always
+
+  ui:
+    image: storm:latest
+    container_name: ui
+    command: storm ui
+    depends_on:
+      - nimbus
+    restart: always
+    ports:
+      - "127.0.0.1:8080:8080"
+
+  urlfrontier:
+    image: crawlercommons/url-frontier:latest
+    container_name: urlfrontier
+    restart: always
+    ports:
+      - "127.0.0.1:7071:7071"
+----
+
+Notes:
+
+- This example Docker Compose uses the official Apache Storm and Apache 
Zookeeper images.
+- URLFrontier is an additional service used by StormCrawler to act as 
Frontier. Please note, that we also offer other Frontier implementations like 
OpenSearch or Apache Solr.
+- Ports may need adjustment depending on your environment.
+- The Storm UI runs on port 8080 by default.
+- Ensure network connectivity between services; Docker Compose handles this by 
default.
+
+After setting up your Docker Compose, you should start it up:
+
+[source,shell]
+----
+docker compose up -d
+----
+
+Check the logs and see, if every service is up and running:
+
+[source,shell]
+----
+docker compose logs -f
+----
+
+Next, access the Storm UI via `https://localhost:8080` and check, that a Storm 
Nimbus as well as a Storm Supervisor is available.
+
+==== Compile
+
+Build the generated archetype by running
+
+[source,shell]
+----
+mvn package
+----
+
+This will create a uberjar named `${artifactId}-${version}.jar` (matches the 
artifact id and the version specified during the archetype generation) in your 
`target` directory.
+
+==== Inject Your First Seeds
+
+Now you are ready to insert your first seeds into URLFrontier. To do so, 
create a file `seeds.txt` containing your seeds:
+
+[source,text]
+----
+https://stormcrawler.apache.org
+----
+
+After you have saved it, we need to inject the seeds into URLFrontier. This 
can be done by running URLFrontiers client:
+
+[source,shell]
+----
+java -cp target/${artifactId}-${version}.jar 
crawlercommons.urlfrontier.client.Client PutURLs -f seeds.txt
+----
+
+where _seeds.txt_ is the previously created file containing URLs to inject, 
with one URL per line.
+
+==== Run Your First Crawl
+
+Now it is time to run our first crawl. To do so, we need to start our crawler 
topology in distributed mode and deploy it on our Storm Cluster.
+
+[source,shell]
+----
+docker run --network ${NETWORK} -it \
+--rm \
+-v "$(pwd)/crawler-conf.yaml:/apache-storm/crawler-conf.yaml" \
+-v "$(pwd)/crawler.flux:/apache-storm/crawler.flux" \
+-v 
"$(pwd)/${artifactId}-${version}.jar:/apache-storm/${artifactId}-${version}.jar"
 \
+storm:latest \
+storm jar ${artifactId}-${version}.jar org.apache.storm.flux.Flux --remote 
crawler.flux
+----
+
+where `${NETWORK}` is the name of the Docker network of the previously started 
Docker Compose. You can find this name by running
+
+[source,shell]
+----
+docker network ls
+----
+
+After running the `storm jar` command, you should carefully monitor the logs 
via
+
+[source,shell]
+----
+docker compose logs -f
+----
+
+as well as the Storm UI. It should now list a running topology.
+
+In the default archetype, the fetched content is printed out to the default 
system out print stream.
+
+NOTE: In a Storm topology defined with Flux, parallelism specifies the number 
of tasks or instances of a spout or bolt to run concurrently, enabling scalable 
and efficient processing. In the archetype every component is set to a 
parallelism of **1**.
+
+Congratulations! You learned how to start your first simple crawl using Apache 
StormCrawler.
+
+Feel free to explore the rest of our documentation to build more complex 
crawler topologies.
+
+=== Summary
+
+This document shows how simple it is to get Apache StormCrawler up and running 
and to run a simple crawl.
diff --git a/pom.xml b/pom.xml
index f5f183ec..183a1830 100644
--- a/pom.xml
+++ b/pom.xml
@@ -41,7 +41,7 @@ under the License.
        <licenses>
                <license>
                        <name>The Apache License, Version 2.0</name>
-                       
<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
+                       
<url>https://www.apache.org/licenses/LICENSE-2.0.txt</url>
                </license>
        </licenses>
 
@@ -519,6 +519,7 @@ under the License.
                                 <exclude>**/*.flux</exclude>
                                 <exclude>**/*.txt</exclude>
                                 <exclude>**/*.rss</exclude>
+                                <exclude>**/*.pdf</exclude>
                                 <exclude>**/*.tar.gz</exclude>
                                 <exclude>**/README.md</exclude>
                                 <exclude>**/target/**</exclude>
@@ -692,6 +693,7 @@ under the License.
                <module>archetype</module>
                <module>external/opensearch/archetype</module>
                <module>external/solr/archetype</module>
-    </modules>
+               <module>docs</module>
+       </modules>
 
 </project>

(stormcrawler) branch main updated: #1542 - Migrate Documentation from Wiki to Living Documentation in Code (#1714)

Reply via email to